记录一次flink报错
在本地测试一个flink程序的时候报了一个异常:org.apache.flink.api.common.InvalidProgramException: This type (GenericType<com.bart.flink.datasource.WordWithCount>) cannot be used as key
观察异常堆栈打印定位到代码中的keyBy这里,报错的代码如下:
KeyedStream<WordWithCount, Tuple> word = ds.keyBy("word"); // word 是POJO中的字段名
很奇怪,明明照着官网的例子做的为什么还会报错呢?
官方文档
官网给出的例子代码:
// some ordinary POJO (Plain old Java Object)
public class WC {
public String word;
public int count;
}
DataStream<WC> words = // [...]
DataStream<WC> wordCounts = words.keyBy("word").window(/*window specification*/);
我的代码:
class WordWithCount {
public String word;
public int count;
public WordWithCount() {}
public WordWithCount(String word, int count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "WordWithCount{" +
"word='" + word + '\'' +
", count=" + count +
'}';
}
}
最后找来找去发现因为我在定义POJO的时候使用的内部类,该内部类没有添加public修饰符,从而导致keyBy的时候报错,将该类修改为静态内部类添加public修饰符后异常消失了。。。
public static class WordWithCount {
public String word;
public int count;
public WordWithCount() {}
public WordWithCount(String word, int count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "WordWithCount{" +
"word='" + word + '\'' +
", count=" + count +
'}';
}
}
顿时感觉很坑爹,怪自己粗心大意导致报错。。。
仔细阅读了flink官方文档给出的POJO的定义后发现两个很重要的点:
flink如何定义POJO
Java and Scala classes are treated by Flink as a special POJO data type if they fulfill the following requirements:
The class must be public.
It must have a public constructor without arguments (default constructor).
All fields are either public or must be accessible through getter and setter functions. For a field called foo the getter and setter methods must be named getFoo() and setFoo().
The type of a field must be supported by Flink. At the moment, Flink uses Avro to serialize arbitrary objects (such as Date).
简单翻译一下就是:
flink中关于java和scala中的pojo必须满足以下条件:
1. class必须被public修饰 (很重要的一点,特别是java定义内部类的时候注意一定要使用静态内部类且被public修饰);
2. 必须有一个呗public修饰的午餐构造器;
3. 所有的字段必须是public修饰的,如果被private修饰则必须有getter、setter方法,例如:对于字段foo 必须有getFoo()、setFoo()方法
4. 字段的类型必须被flink支持,flink是java开发的,这一点大部分都满足,除非该pojo的字段仍然是一个pojo类型,此时需要注意该bojo类型需要重写hascode函数
关于上面的第4点,需要注意一下,在flink文档中明确说明了被keyBy的字段的要求:
flink关于keyBy的文档说明
摘录一段:
A type cannot be a key if:
it is a POJO type but does not override the hashCode() method and relies on the Object.hashCode() implementation.
it is an array of any type.
简单翻译:
一个类型不能(注意是“不能”)被keyBy必须满足:
1. pojo没有重写hashcode方法,而使用的是Object.hashCode() 实现;
2. 一个任意类型的array不能够作为key;
换言之,我们只要不满足文档中说明的情况就可以做一个key来使用。
放一段我测试的完整代码:
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class DataSourcePoJO {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); // 并行度给1 方便观察
DataStreamSource<WordWithCount> ds = env.fromElements(
new WordWithCount("hello", 1),
new WordWithCount("good", 1),
new WordWithCount("hello", 1),
new WordWithCount("good", 2)
);
KeyedStream<WordWithCount, Tuple> word = ds.keyBy("word");
SingleOutputStreamOperator<WordWithCount> res = word.sum("count");
res.print();
env.execute("pojo datasource");
}
public static class WordWithCount {
public String word;
public int count;
public WordWithCount() {} // 必须给!!!
public WordWithCount(String word, int count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "WordWithCount{" +
"word='" + word + '\'' +
", count=" + count +
'}';
}
}
}
补充:
使用scala的时候定义样例类不需要添加public修饰符,scala编译成class的时候会自动添加:
scala的代码:
package com.flink.example.datasource
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object DataSourcePoJO {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val ds = env.fromElements(
WordWithCount("hello", 1),
WordWithCount("good", 1),
WordWithCount("hello", 2),
WordWithCount("good", 1)
)
// 对于scala会自动将pojo转化为tuple来进行keyby
// ds.keyBy(0).sum(1).print() // OK
ds.keyBy("word").sum("count").print()
env.execute("custom datasource")
}
// 定义样例类
case class WordWithCount(word:String, count:Int)
}
生成的class文件中找到DataSourcePoJO$WordWithCount.class反编译得到:
public class DataSourcePoJO$WordWithCount implements Product, Serializable {
private final String word;
private final int count;
public String word() {
return this.word;
}
public int count() {
return this.count;
}
public DataSourcePoJO$WordWithCount copy(String word, int count) {
return new DataSourcePoJO$WordWithCount(word, count);
}
// .... 省略
}
其实添加了public修饰符,同时给每个字段有get方法,此处的get方法和java中的不太一样,可能和flink的scala版本的获取字段机制有关,此处并未深入研究。
浙公网安备 33010602011771号