Flink(五)Flink开发IDEA环境搭建与测试(3)

 

使用IDEA开发离线程序--批量处理数据--WordCountBatch

Dataset是flink的常用程序,数据集通过source进行初始化,例如读取文件或者序列化集合,然后通过transformation(filtering、mapping、joining、grouping)将数据集转成,然后通过sink进行存储,既可以写入hdfs这种分布式文件系统,也可以打印控制台,flink可以有很多种运行方式,如local、flink集群、yarn等.

1scala程序

import org.apache.flink.api.scala.ExecutionEnvironment

import org.apache.flink.api.scala._

 

object WordCountScala{

  def main(args: Array[String]) {

    //初始化环境

    val env = ExecutionEnvironment.getExecutionEnvironment

    //从字符串中加载数据

    val text = env.fromElements(

      "Who's there?",

      "I think I hear them. Stand, ho! Who's there?")

    //分割字符串、汇总tuple、按照key进行分组、统计分组后word个数

    val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

      .map { (_, 1) }

      .groupBy(0)

      .sum(1)

    //打印

    counts.print()

  }

}


2java程序

import org.apache.flink.api.common.functions.FlatMapFunction;

import org.apache.flink.api.java.DataSet;

import org.apache.flink.api.java.ExecutionEnvironment;

import org.apache.flink.api.java.tuple.Tuple2;

import org.apache.flink.util.Collector;

 

public class WordCountJava {

    public static void main(String[] args) throws Exception {

        //构建环境

        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        //通过字符串构建数据集

        DataSet<String> text = env.fromElements(

                "Who's there?",

                "I think I hear them. Stand, ho! Who's there?");

        //分割字符串、按照key进行分组、统计相同的key个数

        DataSet<Tuple2<String, Integer>> wordCounts = text

                .flatMap(new LineSplitter())

                .groupBy(0)

                .sum(1);

        //打印

        wordCounts.print();

    }

    //分割字符串的方法

    public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {

        @Override

        public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {

            for (String word : line.split(" ")) {

                out.collect(new Tuple2<String, Integer>(word, 1));

            }

        }

    }

}


运行

 

----自己操作----

Java代码:

package WordCount;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

/**
 * @Author : ASUS and xinrong
 * @Version : 2020/8/29 & 1.0
 */
public class WordCountBatch {
    public static void main(String[] args) throws Exception {
        //一、首先创建运行时的环境-判断是本地还是集群
        ExecutionEnvironment environment = ExecutionEnvironment.getExecutionEnvironment();
        //二、构建数据集
        DataSet<String> text = environment.fromElements("Who's there?", "I think I hear them.Stand,ho! Who's there!");
        /**
         * transformation
         * 所有的转换操作
         */
       DataSet<Tuple2<String, Integer>> wordCount = text.flatMap(new LinSplitter()).groupBy(0).sum(1);//自定义分词的方法,计数
       wordCount.print();
       //触发执行-由于用了DataSet的API所以不调用execute()也可,如果调用:除了打印结果之外还会报错
       //environment.execute();
    }
    /**
     * 自定义分词方法
     */
    private static class LinSplitter implements FlatMapFunction<String, Tuple2<String,Integer>> {

        @Override
        public void flatMap(String line, Collector<Tuple2<String,Integer>> out) throws Exception {
            for(String word:line.split(" ")){
                out.collect(new Tuple2<>(word,1));
            }
        }
    }
}

结果:

 

 

 

博客园  ©  2004-2025
浙公网安备 33010602011771号 浙ICP备2021040463号-3