记录一次flink报错

在本地测试一个flink程序的时候报了一个异常:org.apache.flink.api.common.InvalidProgramException: This type (GenericType<com.bart.flink.datasource.WordWithCount>) cannot be used as key
观察异常堆栈打印定位到代码中的keyBy这里,报错的代码如下:

KeyedStream<WordWithCount, Tuple> word = ds.keyBy("word"); // word 是POJO中的字段名

很奇怪,明明照着官网的例子做的为什么还会报错呢?
官方文档
官网给出的例子代码:

// some ordinary POJO (Plain old Java Object)
public class WC {
  public String word;
  public int count;
}
DataStream<WC> words = // [...]
DataStream<WC> wordCounts = words.keyBy("word").window(/*window specification*/);

我的代码:

class WordWithCount  {

    public String word;
    public int count;

    public WordWithCount() {}

    public WordWithCount(String word, int count) {
        this.word = word;
        this.count = count;
    }

    @Override
    public String toString() {
        return "WordWithCount{" +
                "word='" + word + '\'' +
                ", count=" + count +
                '}';
    }
}

最后找来找去发现因为我在定义POJO的时候使用的内部类,该内部类没有添加public修饰符,从而导致keyBy的时候报错,将该类修改为静态内部类添加public修饰符后异常消失了。。。

    public static class WordWithCount  {

        public String word;
        public int count;

        public WordWithCount() {}

        public WordWithCount(String word, int count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return "WordWithCount{" +
                    "word='" + word + '\'' +
                    ", count=" + count +
                    '}';
        }
    }

顿时感觉很坑爹,怪自己粗心大意导致报错。。。
仔细阅读了flink官方文档给出的POJO的定义后发现两个很重要的点:
flink如何定义POJO

Java and Scala classes are treated by Flink as a special POJO data type if they fulfill the following requirements:
The class must be public.
It must have a public constructor without arguments (default constructor).
All fields are either public or must be accessible through getter and setter functions. For a field called foo the getter and setter methods must be named getFoo() and setFoo().
The type of a field must be supported by Flink. At the moment, Flink uses Avro to serialize arbitrary objects (such as Date).

简单翻译一下就是:

flink中关于java和scala中的pojo必须满足以下条件:
1. class必须被public修饰 (很重要的一点,特别是java定义内部类的时候注意一定要使用静态内部类且被public修饰);
2. 必须有一个呗public修饰的午餐构造器;
3. 所有的字段必须是public修饰的,如果被private修饰则必须有getter、setter方法,例如:对于字段foo 必须有getFoo()、setFoo()方法
4. 字段的类型必须被flink支持,flink是java开发的,这一点大部分都满足,除非该pojo的字段仍然是一个pojo类型,此时需要注意该bojo类型需要重写hascode函数

关于上面的第4点,需要注意一下,在flink文档中明确说明了被keyBy的字段的要求:
flink关于keyBy的文档说明
摘录一段:

A type cannot be a key if:
it is a POJO type but does not override the hashCode() method and relies on the Object.hashCode() implementation.
it is an array of any type.

简单翻译:

一个类型不能(注意是“不能”)被keyBy必须满足:
1. pojo没有重写hashcode方法,而使用的是Object.hashCode() 实现;
2. 一个任意类型的array不能够作为key;

换言之,我们只要不满足文档中说明的情况就可以做一个key来使用。

放一段我测试的完整代码:

import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;


public class DataSourcePoJO {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1); // 并行度给1 方便观察
        DataStreamSource<WordWithCount> ds = env.fromElements(
                new WordWithCount("hello", 1),
                new WordWithCount("good", 1),
                new WordWithCount("hello", 1),
                new WordWithCount("good", 2)
        );
        KeyedStream<WordWithCount, Tuple> word = ds.keyBy("word");
        SingleOutputStreamOperator<WordWithCount> res = word.sum("count");
        res.print();
        env.execute("pojo datasource");
    }

    public static class WordWithCount  {
        public String word;
        public int count;
        public WordWithCount() {} // 必须给!!!
        public WordWithCount(String word, int count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return "WordWithCount{" +
                    "word='" + word + '\'' +
                    ", count=" + count +
                    '}';
        }
    }

}

补充:

使用scala的时候定义样例类不需要添加public修饰符,scala编译成class的时候会自动添加:

scala的代码:

package com.flink.example.datasource

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object DataSourcePoJO {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    val ds = env.fromElements(
      WordWithCount("hello", 1),
      WordWithCount("good", 1),
      WordWithCount("hello", 2),
      WordWithCount("good", 1)
    )
    // 对于scala会自动将pojo转化为tuple来进行keyby
    //    ds.keyBy(0).sum(1).print() // OK
    ds.keyBy("word").sum("count").print()
    env.execute("custom datasource")
  }
  // 定义样例类
  case class WordWithCount(word:String, count:Int)
}

生成的class文件中找到DataSourcePoJO$WordWithCount.class反编译得到:

public class DataSourcePoJO$WordWithCount implements Product, Serializable {
   private final String word;
   private final int count;
   public String word() {
      return this.word;
   }
   public int count() {
      return this.count;
   }
   public DataSourcePoJO$WordWithCount copy(String word, int count) {
      return new DataSourcePoJO$WordWithCount(word, count);
   }
// .... 省略
}

其实添加了public修饰符,同时给每个字段有get方法,此处的get方法和java中的不太一样,可能和flink的scala版本的获取字段机制有关,此处并未深入研究。

posted @ 2019-12-17 23:34  bartggg  阅读(595)  评论(0)    收藏  举报