SparkStreaming(二)——wordcount

需求：使用netcat工具向9999端口不断的发送数据，通过SparkStreaming读取端口数据并统计不同单词出现的次数

前期准备

1）消除idea控制台过多的日志信息

1.到spark/conf目录下，将log4j.properties.template文件下载到本地，重命名为log4j.properties；

2.将该文件复制到spark项目的resource目录下

3.修改该文件，将warn,info替换为error

2）netcat网络工具的安装和使用，模拟数据源.

添加依赖

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

编写代码

package sparkstreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WordCount {

  def main(args: Array[String]): Unit = {

    //1.初始化Spark配置信息
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamWordCount")

    //2.初始化SparkStreamingContext
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //3.通过监控端口创建DStream，读进来的数据为一行行
    val lineStreams = ssc.socketTextStream("chxy001", 9999)

    //将每一行数据做切分，形成一个个单词
    val wordStreams = lineStreams.flatMap(_.split(" "))

    //将单词映射成元组（word,1）
    val wordAndOneStreams = wordStreams.map((_, 1))

    //将相同的单词次数做统计
    val wordAndCountStreams = wordAndOneStreams.reduceByKey(_+_)

    //打印
    wordAndCountStreams.print()

    //启动采集器
    ssc.start()
    //让driver等待采集器的执行
    ssc.awaitTermination()
  }
}

扩展

1）StreamingContext的常用的构造方法

/**
   * Create a StreamingContext using an existing SparkContext.
   * @param sparkContext existing SparkContext
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
  }

  /**
   * Create a StreamingContext by providing the configuration necessary for a new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }

通过提供一个SparkConf对象或者SparkContext对象和一个Duration对象来创建streamingcontext;接下来看看Duration对象：

case class Duration (private val millis: Long) {...}

它是一个case类，创建的时候传递一个long类型数据，单位是毫秒，代表采集周期。但是，如果我想把采集周期设为1天，以毫秒为单位肯定非常不方便，因此spark提供了其他的时间单位，并以单例对象的形式给出：

/**
 * Helper object that creates instance of [[org.apache.spark.streaming.Duration]] representing
 * a given number of seconds.
 */
object Seconds {
  def apply(seconds: Long): Duration = new Duration(seconds * 1000)
}

它底层调用了Duration这个case类，并将传递进去的参数扩大1000倍。

然后调用这个Second：

new StreamingContext(conf,Seconds(5))

注意：这里并不是创建了Seconds的一个对象，而是调用了该对象的apply方法。参见博文https://www.cnblogs.com/chxyshaodiao/p/12420320.html

2）socketTextStream

/**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @see [[socketStream]]
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

该方法用于streamingcontext从远程端口创建数据源。阅读它的方法注释，发现它数据读入的形式是以\n分割的一行一行，这就为我们的数据处理提供了依据。

posted @ 2020-03-05 15:07 盛夏群岛阅读(264) 评论(0) 收藏举报

刷新页面返回顶部

盛夏群岛

SparkStreaming(二)——wordcount

公告