Flink WordCount计算详解
pom依赖准备:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.flink</groupId>
<artifactId>flinkdemo</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.11.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients 从Flink1.11开始,移除了flink-streaming-java对flink-clients的依赖 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.11.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>1.11.1</version>
</dependency>
</dependencies>
</project>
Flink WordCount主要分为batch处理和stream处理
1.batch(批处理)方式:
import org.apache.flink.api.scala.ExecutionEnvironment
object DataSetFlink {
def main(args: Array[String]): Unit = {
val environment = ExecutionEnvironment.getExecutionEnvironment //引用批处理环境
import org.apache.flink.api.scala._
val data = environment.readTextFile("D:\\a.txt")
val result = data.flatMap(_.split(" "))
.filter(_.nonEmpty)
.map((_, 1))
.groupBy(0)
.sum(1)
result.print() //执行完则代表结束
}
}
结果:

2.Stream(流)处理方式
准备一台虚拟机 开启监听端口;比如 nc -l 7777
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
object StreamingFlink {
def main(args: Array[String]): Unit = {
val environment = StreamExecutionEnvironment.getExecutionEnvironment //获取流处理环境
//外部参数封装对象
val tool = ParameterTool.fromArgs(args)
val host :String = tool.get("host")
val port = tool.getInt("port")
val datastream = environment.socketTextStream(host,port)
import org.apache.flink.api.scala._
val value = datastream
.flatMap(_.split(" "))
.filter(_.nonEmpty).map(a => (a,1))
.keyBy(0) //流处理使用的keyBy函数,批处理使用的是groupBy函数
.sum(1)
.print()
environment.execute() //此处要开启流处理环境处于监听状态
}
}
结果:

ps:注意流处理的时候,单词旁边的尖括号字段代表的电脑默认核数,可以理解为默认并行度
posted on 2020-11-05 19:11 RICH-ATONE 阅读(490) 评论(0) 收藏 举报
浙公网安备 33010602011771号