记录一次flink报错

在本地测试一个flink程序的时候报了一个异常：org.apache.flink.api.common.InvalidProgramException: This type (GenericType<com.bart.flink.datasource.WordWithCount>) cannot be used as key
观察异常堆栈打印定位到代码中的keyBy这里，报错的代码如下：

KeyedStream<WordWithCount, Tuple> word = ds.keyBy("word"); // word 是POJO中的字段名

很奇怪，明明照着官网的例子做的为什么还会报错呢？
官方文档
官网给出的例子代码：

// some ordinary POJO (Plain old Java Object)
public class WC {
  public String word;
  public int count;
}
DataStream<WC> words = // [...]
DataStream<WC> wordCounts = words.keyBy("word").window(/*window specification*/);

我的代码：

class WordWithCount  {

    public String word;
    public int count;

    public WordWithCount() {}

    public WordWithCount(String word, int count) {
        this.word = word;
        this.count = count;
    }

    @Override
    public String toString() {
        return "WordWithCount{" +
                "word='" + word + '\'' +
                ", count=" + count +
                '}';
    }
}

最后找来找去发现因为我在定义POJO的时候使用的内部类，该内部类没有添加public修饰符，从而导致keyBy的时候报错，将该类修改为静态内部类添加public修饰符后异常消失了。。。

    public static class WordWithCount  {

        public String word;
        public int count;

        public WordWithCount() {}

        public WordWithCount(String word, int count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return "WordWithCount{" +
                    "word='" + word + '\'' +
                    ", count=" + count +
                    '}';
        }
    }

顿时感觉很坑爹，怪自己粗心大意导致报错。。。
仔细阅读了flink官方文档给出的POJO的定义后发现两个很重要的点：
flink如何定义POJO

Java and Scala classes are treated by Flink as a special POJO data type if they fulfill the following requirements:
The class must be public.
It must have a public constructor without arguments (default constructor).
All fields are either public or must be accessible through getter and setter functions. For a field called foo the getter and setter methods must be named getFoo() and setFoo().
The type of a field must be supported by Flink. At the moment, Flink uses Avro to serialize arbitrary objects (such as Date).

简单翻译一下就是：

flink中关于java和scala中的pojo必须满足以下条件：
1. class必须被public修饰 （很重要的一点，特别是java定义内部类的时候注意一定要使用静态内部类且被public修饰）；
2. 必须有一个呗public修饰的午餐构造器；
3. 所有的字段必须是public修饰的，如果被private修饰则必须有getter、setter方法，例如:对于字段foo 必须有getFoo()、setFoo()方法
4. 字段的类型必须被flink支持，flink是java开发的，这一点大部分都满足，除非该pojo的字段仍然是一个pojo类型，此时需要注意该bojo类型需要重写hascode函数

关于上面的第4点，需要注意一下，在flink文档中明确说明了被keyBy的字段的要求：
flink关于keyBy的文档说明
摘录一段：

A type cannot be a key if:
it is a POJO type but does not override the hashCode() method and relies on the Object.hashCode() implementation.
it is an array of any type.

简单翻译：

一个类型不能（注意是“不能”）被keyBy必须满足：
1. pojo没有重写hashcode方法，而使用的是Object.hashCode() 实现；
2. 一个任意类型的array不能够作为key；

换言之，我们只要不满足文档中说明的情况就可以做一个key来使用。

放一段我测试的完整代码：

import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;


public class DataSourcePoJO {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1); // 并行度给1 方便观察
        DataStreamSource<WordWithCount> ds = env.fromElements(
                new WordWithCount("hello", 1),
                new WordWithCount("good", 1),
                new WordWithCount("hello", 1),
                new WordWithCount("good", 2)
        );
        KeyedStream<WordWithCount, Tuple> word = ds.keyBy("word");
        SingleOutputStreamOperator<WordWithCount> res = word.sum("count");
        res.print();
        env.execute("pojo datasource");
    }

    public static class WordWithCount  {
        public String word;
        public int count;
        public WordWithCount() {} // 必须给！！！
        public WordWithCount(String word, int count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return "WordWithCount{" +
                    "word='" + word + '\'' +
                    ", count=" + count +
                    '}';
        }
    }

}

补充：

使用scala的时候定义样例类不需要添加public修饰符，scala编译成class的时候会自动添加：

scala的代码：

package com.flink.example.datasource

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object DataSourcePoJO {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    val ds = env.fromElements(
      WordWithCount("hello", 1),
      WordWithCount("good", 1),
      WordWithCount("hello", 2),
      WordWithCount("good", 1)
    )
    // 对于scala会自动将pojo转化为tuple来进行keyby
    //    ds.keyBy(0).sum(1).print() // OK
    ds.keyBy("word").sum("count").print()
    env.execute("custom datasource")
  }
  // 定义样例类
  case class WordWithCount(word:String, count:Int)
}

生成的class文件中找到DataSourcePoJO$WordWithCount.class反编译得到：

public class DataSourcePoJO$WordWithCount implements Product, Serializable {
   private final String word;
   private final int count;
   public String word() {
      return this.word;
   }
   public int count() {
      return this.count;
   }
   public DataSourcePoJO$WordWithCount copy(String word, int count) {
      return new DataSourcePoJO$WordWithCount(word, count);
   }
// .... 省略
}

其实添加了public修饰符，同时给每个字段有get方法，此处的get方法和java中的不太一样，可能和flink的scala版本的获取字段机制有关，此处并未深入研究。

posted @ 2019-12-17 23:34 bartggg 阅读(603) 评论(0) 收藏举报

刷新页面返回顶部

记录一次flink报错

公告