大数据Spark实时处理--实时流处理3（Spark Streaming API）

常用Output操作
1）目前代码经过一系列复杂的操作后，结果是输出在控制台上的，仅测试使用。我们的结果是要写到一个地方去的。
2）官网：Spark Streaming - Spark 3.2.0 Documentation (apache.org)
3）输出操作，允许DStream数据，推送至外部的系统，比如说数据库或文件系统。
4）输出操作，允许通过外部系统消费transformed数据。
5）print()：控制台打印出每个批次中，前十个数据。
6）saveAsTextFiles(prefix, [suffix])：输出到本地文件中。有弊端：文件太多。
7）foreachRDD(func)：输出结果到外部系统（数据库）。

统计结果写入到数据库
1）官网：Spark Streaming - Spark 3.2.0 Documentation (apache.org)
2）进入数据库

[hadoop@spark000 ~]$ mysql -uroot -proot
Warning: Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.6.42 MySQL Community Server (GPL)

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

3）在mysql中创建表

mysql> show databases;
mysql> create database jieqiong;
mysql> use jieqiong;
mysql> create table wc(
->    word varchar(20),
->    cnt int(10)
->    );

4）ForeachRDDNetworkWordCount.scala

package com.imooc.bigdata.ss

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

import java.sql.Connection

/*
 * object ForeachRDDNetworkWordCount作用：基于SS完成词频统计分析，把结果写入到MySQL中
 * 数据源：是基于端口、网络即nc的方式，造数据
 *
 * 数据库MySQL表结构
 * create table wc(
   word varchar(20),
   cnt int(10)
   );
 *
 * ss的编程范式
 * 1)main方法
 * 2)找入口点:new StreamingContext().var
 * 3)添加SparkConf的构造器:new SparkConf().var
 * 4)参数1：sparkConf放入new StreamingContext()
 * 5)参数2：Seconds(5)放入new StreamingContext()
 * 6)生成ssc:new StreamingContext(sparkConf,Seconds(5)).var
 * 7)对接网络数据
 *   ssc.socketTextStream("spark000",9527).var
 * 8)开始业务逻辑处理
 *   启动流作业:ssc.start()
 *   输入数据以逗号分隔开:map是给每个单词赋值1,，然后两两相加。
     lines.flatMap(_.split(",")).map((_,1))
     .reduceBykey(_+_).var
 *   结果打印：
 *   终止流作业:ssc.awaitTermination()
 * 9)运行报错，添加val sparkConf = new SparkConf()参数
 */

object ForeachRDDNetworkWordCount {
  // *****第1步
  /*
     对于NetworkWordCount这种Spark Streaming编程来讲，也是通过main方法
     输入main，回车
   */
  def main(args: Array[String]): Unit = {
    // *****第2步
    /*
       和kafka相同，找入口点
       官网：https://spark.apache.org/docs/latest/streaming-programming-guide.html
       要开发Spark Streaming应用程序，入口点就是：拿到一个streamingContext：new StreamingContext()
       看源码：按ctrl，进入StreamingContext.scala

     * 关于StreamingContext.scala的描述
       Main entry point for Spark Streaming functionality. It provides methods used to create DStream：
       [[org.apache.spark.streaming.dstream.DStream]
       那 DStream是什么呢？


     * 目前，鼠标放在StreamingContext()，报错：不能解析构造器，所以这里缺少构造器
       Cannot resolve overloaded constructor `StreamingContext`
       在scala里是有构造器的，主构造器、副主构造器。

     * 以下就是构造器要传的三个参数
     * class StreamingContext private[streaming] (
       _sc: SparkContext,
       _cp: Checkpoint,
       _batchDur: Duration
       )

     * 这个是副主构造器1：传的是sparkContext
     * batchDuration是时间间隔
     * def this(sparkContext: SparkContext, batchDuration: Duration) = {
       this(sparkContext, null, batchDuration)
       }

     * 这个是副主构造器2：传的是SparkConf
     * def this(conf: SparkConf, batchDuration: Duration) = {
       this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
       }
     */

    // *****第3步
    /* 添加SparkConf的构造器
       new SparkConf().var
       然后选择sparkConf。不建议加类型
       当打jar包时，两个参数要注释
     */
    val sparkConf = new SparkConf()
      .setAppName(this.getClass.getSimpleName)
      .setMaster("local[2]")

    // *****第2步
    /*
      new StreamingContext()
     */

    // *****第4步
    /*
       将第3步中新生成的sparkConf，放入new StreamingContext()括号中。
     */

    // *****第5步
    /*
     * 添加时间间隔Duration(毫秒)，可以看一下源码
     * 使用
     * object Seconds {
       def apply(seconds: Long): Duration = new Duration(seconds * 1000)
       }

     * 并导入org.apache的包,往Seconds()放5
     * 意味着指定间隔5秒为一个批次
     */

    // *****第6步
    /*
       new StreamingContext(sparkConf,Seconds(5)).var
       输入ssc
     */
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // TODO... 对接业务数据
    // *****第7步:先调用start启动
    /*
       Creates an input stream from TCP source hostname:port. Data is received using
       a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
     */
    val lines = ssc.socketTextStream("spark000", 9527)

    // TODO... 业务逻辑处理
    // *****第9步:输入数据以逗号分隔,并打印结果
    val result = lines.flatMap(_.split(",")).map((_,1))
      .reduceByKey(_+_)

    // *****第10步:
    // TODO... 把结果通过foreachRDD算子输出到MySQL中
    //选择unit的foreachRDD，进行遍历
    result.foreachRDD(rdd => {
      //拿到每一个rdd，然后对所属的每一个分区Partition
      rdd.foreachPartition(partition => {
        // 先拿一个connection，设置为空
        // val connection:Connection = _
        // MySQLUtils好了后，注释上面为空的connection，并建立连接
        val connection = MySQLUtils.getConnection()
        // 对Partition里的东西进行遍历，并且每一个都是键值对
        partition.foreach(pair => {
          // 这里是一个sql语句，
          val sql = s"insert into wc(word,cnt) values('${pair._1}',${pair._2})"
          // 执行sql语句
          connection.createStatement().execute(sql)
        })
        //关闭连接的工具类MySQLUtils.scala
        MySQLUtils.close(connection)
      })
    })

    // *****第8步:先调用start启动\终止
    ssc.start()
    ssc.awaitTermination()
  }
}

5）MySQLUtils.scala

package com.imooc.bigdata.ss

import java.sql.{Connection, DriverManager}

object MySQLUtils {

  //建立连接
  def getConnection() = {
    // mysql的连接
    Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection("jdbc:mysql://spark000:3306/jieqiong/wc","root","root")
  }
  //关闭连接
  def close(connection: Connection): Unit={
    if(null != connection) {
      connection.close()
    }
  }
}

6）pom.xml(主)

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.47</version>
        </dependency>

Spark SQL的示例

数据

[hadoop@spark000 ~]$ cd $SPARK_HOME
[hadoop@spark000 resources]$ pwd
/home/hadoop/app/spark-3.0.0-bin-2.6.0-cdh5.16.2/examples/src/main/resources
[hadoop@spark000 resources]$ cat employees.json
{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4500}
{"name":"Justin", "salary":3500}
{"name":"Berta", "salary":4000}

加载依赖
spark-sql、spark-core

Spark Streaming 和 Spark SQL的综合使用

posted @ 2022-03-16 14:04 酱汁怪兽阅读(156) 评论(0) 收藏举报

刷新页面返回顶部

酱汁怪兽

大数据Spark实时处理--实时流处理3（Spark Streaming API）

公告