SparkStreaming(二)——wordcount

需求:使用netcat工具向9999端口不断的发送数据,通过SparkStreaming读取端口数据并统计不同单词出现的次数

 

前期准备

1)消除idea控制台过多的日志信息

1.到spark/conf目录下,将log4j.properties.template文件下载到本地,重命名为log4j.properties;

2.将该文件复制到spark项目的resource目录下

3.修改该文件,将warn,info替换为error

 

2)netcat网络工具的安装和使用,模拟数据源.

 

添加依赖

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

 

编写代码

package sparkstreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WordCount {

  def main(args: Array[String]): Unit = {

    //1.初始化Spark配置信息
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamWordCount")

    //2.初始化SparkStreamingContext
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //3.通过监控端口创建DStream,读进来的数据为一行行
    val lineStreams = ssc.socketTextStream("chxy001", 9999)

    //将每一行数据做切分,形成一个个单词
    val wordStreams = lineStreams.flatMap(_.split(" "))

    //将单词映射成元组(word,1)
    val wordAndOneStreams = wordStreams.map((_, 1))

    //将相同的单词次数做统计
    val wordAndCountStreams = wordAndOneStreams.reduceByKey(_+_)

    //打印
    wordAndCountStreams.print()

    //启动采集器
    ssc.start()
    //让driver等待采集器的执行
    ssc.awaitTermination()
  }
}

 

扩展

1)StreamingContext的常用的构造方法

/**
   * Create a StreamingContext using an existing SparkContext.
   * @param sparkContext existing SparkContext
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
  }

  /**
   * Create a StreamingContext by providing the configuration necessary for a new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }

通过提供一个SparkConf对象或者SparkContext对象和一个Duration对象来创建streamingcontext;接下来看看Duration对象:

case class Duration (private val millis: Long) {...}

它是一个case类,创建的时候传递一个long类型数据,单位是毫秒,代表采集周期。但是,如果我想把采集周期设为1天,以毫秒为单位肯定非常不方便,因此spark提供了其他的时间单位,并以单例对象的形式给出:

/**
 * Helper object that creates instance of [[org.apache.spark.streaming.Duration]] representing
 * a given number of seconds.
 */
object Seconds {
  def apply(seconds: Long): Duration = new Duration(seconds * 1000)
}

它底层调用了Duration这个case类,并将传递进去的参数扩大1000倍。

然后调用这个Second:

new StreamingContext(conf,Seconds(5))

注意:这里并不是创建了Seconds的一个对象,而是调用了该对象的apply方法。参见博文https://www.cnblogs.com/chxyshaodiao/p/12420320.html

 

2)socketTextStream

/**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @see [[socketStream]]
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

该方法用于streamingcontext从远程端口创建数据源。阅读它的方法注释,发现它数据读入的形式是以\n分割的一行一行,这就为我们的数据处理提供了依据。

 

posted @ 2020-03-05 15:07  盛夏群岛  阅读(264)  评论(0)    收藏  举报