SparkStreaming(二)——wordcount
需求:使用netcat工具向9999端口不断的发送数据,通过SparkStreaming读取端口数据并统计不同单词出现的次数
前期准备
1)消除idea控制台过多的日志信息
1.到spark/conf目录下,将log4j.properties.template文件下载到本地,重命名为log4j.properties;
2.将该文件复制到spark项目的resource目录下
3.修改该文件,将warn,info替换为error
2)netcat网络工具的安装和使用,模拟数据源.
添加依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
</dependency>
编写代码
package sparkstreaming import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} object WordCount { def main(args: Array[String]): Unit = { //1.初始化Spark配置信息 val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamWordCount") //2.初始化SparkStreamingContext val ssc = new StreamingContext(sparkConf, Seconds(5)) //3.通过监控端口创建DStream,读进来的数据为一行行 val lineStreams = ssc.socketTextStream("chxy001", 9999) //将每一行数据做切分,形成一个个单词 val wordStreams = lineStreams.flatMap(_.split(" ")) //将单词映射成元组(word,1) val wordAndOneStreams = wordStreams.map((_, 1)) //将相同的单词次数做统计 val wordAndCountStreams = wordAndOneStreams.reduceByKey(_+_) //打印 wordAndCountStreams.print() //启动采集器 ssc.start() //让driver等待采集器的执行 ssc.awaitTermination() } }
扩展
1)StreamingContext的常用的构造方法
/** * Create a StreamingContext using an existing SparkContext. * @param sparkContext existing SparkContext * @param batchDuration the time interval at which streaming data will be divided into batches */ def this(sparkContext: SparkContext, batchDuration: Duration) = { this(sparkContext, null, batchDuration) } /** * Create a StreamingContext by providing the configuration necessary for a new SparkContext. * @param conf a org.apache.spark.SparkConf object specifying Spark parameters * @param batchDuration the time interval at which streaming data will be divided into batches */ def this(conf: SparkConf, batchDuration: Duration) = { this(StreamingContext.createNewSparkContext(conf), null, batchDuration) }
通过提供一个SparkConf对象或者SparkContext对象和一个Duration对象来创建streamingcontext;接下来看看Duration对象:
case class Duration (private val millis: Long) {...}
它是一个case类,创建的时候传递一个long类型数据,单位是毫秒,代表采集周期。但是,如果我想把采集周期设为1天,以毫秒为单位肯定非常不方便,因此spark提供了其他的时间单位,并以单例对象的形式给出:
/** * Helper object that creates instance of [[org.apache.spark.streaming.Duration]] representing * a given number of seconds. */ object Seconds { def apply(seconds: Long): Duration = new Duration(seconds * 1000) }
它底层调用了Duration这个case类,并将传递进去的参数扩大1000倍。
然后调用这个Second:
new StreamingContext(conf,Seconds(5))
注意:这里并不是创建了Seconds的一个对象,而是调用了该对象的apply方法。参见博文https://www.cnblogs.com/chxyshaodiao/p/12420320.html
2)socketTextStream
/** * Creates an input stream from TCP source hostname:port. Data is received using * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited * lines. * @param hostname Hostname to connect to for receiving data * @param port Port to connect to for receiving data * @param storageLevel Storage level to use for storing the received objects * (default: StorageLevel.MEMORY_AND_DISK_SER_2) * @see [[socketStream]] */ def socketTextStream( hostname: String, port: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2 ): ReceiverInputDStream[String] = withNamedScope("socket text stream") { socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel) }
该方法用于streamingcontext从远程端口创建数据源。阅读它的方法注释,发现它数据读入的形式是以\n分割的一行一行,这就为我们的数据处理提供了依据。

浙公网安备 33010602011771号