Overview

  • 整个项目的整体架构如下:

  

  • 关于SparkStreaming的部分:
    1. Flume传数据到SparkStreaming:为了简单使用的是push-based的方式。这种方式可能会丢失数据,但是简单。
    2. SparkStreaming因为micro-batch的架构,跟我们这个实时热点的应用还是比较契合的。
    3. SparkStreaming这边是基于sliding window实现实时热搜的,batch interval待定(1min左右),window也待定(3~N* batch interval),slide就等于batch interval。

Step1:Flume Configuration

  • Flume端将之前的配置扩展成多channel + 多sink,即sink到HDFS和Spark Streaming。关于Hadoop端的配置,参见Nginx+Flume+Hadoop日志分析,Ngram+AutoComplete
  • Flume + SparkStreaming的集成这部分,暂时选用push-based的方法。简单,但容错性不行。

Flume多channel&多sink

  • 首先,Flume支持多channel+多sink;
  • 具体的实现:
    1. 在channels和sinks下面加上要add的channel和sink即可
      • clusterLogAgent.sinks = HDFS sink2
        clusterLogAgent.channels = ch1 ch2
    2. 确定所选用的selector
      • 关于selector,我们之前在flume源码解读篇中有了解到,这里选择的是replicating的selector,也就是把source中的events复制到各个channels中。
    3. 这个多sink应该还是配在hadoop cluster端,一个avro sink加一个hdfs sink。要是配在web server端理论上还要浪费网络带宽。
  • 用FlumeEventCount.scala测试了下,一切ok,如下~
    • web server端运行 bin/flume-ng agent -n WebAccLo-c conf -f conf/flume-avro.conf
    • spark端运行 ./bin/spark-submit --class com.wttttt.spark.FlumeEventCount --master yarn --deploy-mode client --driver-memory 1g --executor-memory 1g --executor-cores 2 /home/hhh/RealTimeLog.jar 10.3.242.99 4545 30000
-------------------------------------------
Time: 1495098360000 ms
-------------------------------------------
Received 10 flume events.

17/05/18 17:06:00 INFO JobScheduler: Finished job streaming job 1495098360000 ms.0 from job set of time 1495098360000 ms
17/05/18 17:06:00 INFO JobScheduler: Total delay: 0.298 s for time 1495098360000 ms (execution: 0.216 s)
17/05/18 17:06:00 INFO ReceivedBlockTracker: Deleting batches: 
17/05/18 17:06:00 INFO InputInfoTracker: remove old batch metadata: 
17/05/18 17:06:13 INFO BlockManagerInfo: Added input-0-1495098373000 in memory on host99:42342 (size: 319.0 B, free: 366.3 MB)
17/05/18 17:06:17 INFO BlockManagerInfo: Added input-0-1495098377200 in memory on host99:42342 (size: 620.0 B, free: 366.3 MB)
17/05/18 17:06:20 INFO BlockManagerInfo: Added input-0-1495098380400 in memory on host99:42342 (size: 620.0 B, free: 366.3 MB)
17/05/18 17:06:30 INFO JobScheduler: Added jobs for time 1495098390000 ms
17/05/18 17:06:30 INFO JobScheduler: Starting job streaming job 1495098390000 ms.0 from job set of time 1495098390000 ms
17/05/18 17:06:30 INFO SparkContext: Starting job: print at FlumeEventCount.scala:30
17/05/18 17:06:30 INFO DAGScheduler: Registering RDD 14 (union at DStream.scala:605)
17/05/18 17:06:30 INFO DAGScheduler: Got job 4 (print at FlumeEventCount.scala:30) with 1 output partitions
17/05/18 17:06:30 INFO DAGScheduler: Final stage: ResultStage 8 (print at FlumeEventCount.scala:30)
17/05/18 17:06:30 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 7)
17/05/18 17:06:30 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 7)
17/05/18 17:06:30 INFO DAGScheduler: Submitting ShuffleMapStage 7 (UnionRDD[14] at union at DStream.scala:605), which has no missing parents
17/05/18 17:06:30 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 3.3 KB, free 399.5 MB)
17/05/18 17:06:30 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 2.0 KB, free 399.5 MB)
17/05/18 17:06:30 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.3.242.99:36107 (size: 2.0 KB, free: 399.6 MB)
17/05/18 17:06:30 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:996
17/05/18 17:06:30 INFO DAGScheduler: Submitting 4 missing tasks from ShuffleMapStage 7 (UnionRDD[14] at union at DStream.scala:605)
17/05/18 17:06:30 INFO YarnScheduler: Adding task set 7.0 with 4 tasks
17/05/18 17:06:30 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 88, host99, executor 1, partition 0, NODE_LOCAL, 7290 bytes)
17/05/18 17:06:30 INFO TaskSetManager: Starting task 3.0 in stage 7.0 (TID 89, host101, executor 2, partition 3, PROCESS_LOCAL, 7470 bytes)
17/05/18 17:06:30 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on host99:42342 (size: 2.0 KB, free: 366.3 MB)
17/05/18 17:06:30 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on host101:45692 (size: 2.0 KB, free: 366.3 MB)
17/05/18 17:06:30 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 90, host99, executor 1, partition 1, NODE_LOCAL, 7290 bytes)
17/05/18 17:06:30 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 88) in 22 ms on host99 (executor 1) (1/4)
17/05/18 17:06:30 INFO TaskSetManager: Finished task 3.0 in stage 7.0 (TID 89) in 27 ms on host101 (executor 2) (2/4)
17/05/18 17:06:30 INFO TaskSetManager: Starting task 2.0 in stage 7.0 (TID 91, host99, executor 1, partition 2, NODE_LOCAL, 7290 bytes)
17/05/18 17:06:30 INFO TaskSetManager: Finished task 1.0 in stage 7.0 (TID 90) in 12 ms on host99 (executor 1) (3/4)
17/05/18 17:06:30 INFO TaskSetManager: Finished task 2.0 in stage 7.0 (TID 91) in 11 ms on host99 (executor 1) (4/4)
17/05/18 17:06:30 INFO YarnScheduler: Removed TaskSet 7.0, whose tasks have all completed, from pool 
17/05/18 17:06:30 INFO DAGScheduler: ShuffleMapStage 7 (union at DStream.scala:605) finished in 0.045 s
17/05/18 17:06:30 INFO DAGScheduler: looking for newly runnable stages
17/05/18 17:06:30 INFO DAGScheduler: running: Set(ResultStage 2)
17/05/18 17:06:30 INFO DAGScheduler: waiting: Set(ResultStage 8)
17/05/18 17:06:30 INFO DAGScheduler: failed: Set()
17/05/18 17:06:30 INFO DAGScheduler: Submitting ResultStage 8 (MapPartitionsRDD[17] at map at FlumeEventCount.scala:30), which has no missing parents
17/05/18 17:06:30 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 3.8 KB, free 399.5 MB)
17/05/18 17:06:30 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 2.1 KB, free 399.5 MB)
17/05/18 17:06:30 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 10.3.242.99:36107 (size: 2.1 KB, free: 399.6 MB)
17/05/18 17:06:30 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:996
17/05/18 17:06:30 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 8 (MapPartitionsRDD[17] at map at FlumeEventCount.scala:30)
17/05/18 17:06:30 INFO YarnScheduler: Adding task set 8.0 with 1 tasks
17/05/18 17:06:30 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 92, host101, executor 2, partition 0, NODE_LOCAL, 7069 bytes)
17/05/18 17:06:30 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on host101:45692 (size: 2.1 KB, free: 366.3 MB)
17/05/18 17:06:30 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 10.3.242.101:41672
17/05/18 17:06:30 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 164 bytes
17/05/18 17:06:30 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 92) in 26 ms on host101 (executor 2) (1/1)
17/05/18 17:06:30 INFO YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool 
17/05/18 17:06:30 INFO DAGScheduler: ResultStage 8 (print at FlumeEventCount.scala:30) finished in 0.027 s
17/05/18 17:06:30 INFO DAGScheduler: Job 4 finished: print at FlumeEventCount.scala:30, took 0.089621 s
17/05/18 17:06:30 INFO SparkContext: Starting job: print at FlumeEventCount.scala:30
17/05/18 17:06:30 INFO DAGScheduler: Got job 5 (print at FlumeEventCount.scala:30) with 3 output partitions
17/05/18 17:06:30 INFO DAGScheduler: Final stage: ResultStage 10 (print at FlumeEventCount.scala:30)
17/05/18 17:06:30 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 9)
17/05/18 17:06:30 INFO DAGScheduler: Missing parents: List()
17/05/18 17:06:30 INFO DAGScheduler: Submitting ResultStage 10 (MapPartitionsRDD[17] at map at FlumeEventCount.scala:30), which has no missing parents
17/05/18 17:06:30 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 3.8 KB, free 399.5 MB)
17/05/18 17:06:30 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 2.1 KB, free 399.5 MB)
17/05/18 17:06:30 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on 10.3.242.99:36107 (size: 2.1 KB, free: 399.6 MB)
17/05/18 17:06:30 INFO SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:996
17/05/18 17:06:30 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 10 (MapPartitionsRDD[17] at map at FlumeEventCount.scala:30)
17/05/18 17:06:30 INFO YarnScheduler: Adding task set 10.0 with 3 tasks
17/05/18 17:06:30 INFO TaskSetManager: Starting task 0.0 in stage 10.0 (TID 93, host99, executor 1, partition 1, PROCESS_LOCAL, 7069 bytes)
17/05/18 17:06:30 INFO TaskSetManager: Starting task 1.0 in stage 10.0 (TID 94, host101, executor 2, partition 2, PROCESS_LOCAL, 7069 bytes)
17/05/18 17:06:30 INFO TaskSetManager: Starting task 2.0 in stage 10.0 (TID 95, host101, executor 2, partition 3, PROCESS_LOCAL, 7069 bytes)
17/05/18 17:06:30 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on host99:42342 (size: 2.1 KB, free: 366.3 MB)
17/05/18 17:06:30 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on host101:45692 (size: 2.1 KB, free: 366.3 MB)
17/05/18 17:06:30 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 10.3.242.99:35937
17/05/18 17:06:30 INFO TaskSetManager: Finished task 0.0 in stage 10.0 (TID 93) in 19 ms on host99 (executor 1) (1/3)
17/05/18 17:06:30 INFO TaskSetManager: Finished task 2.0 in stage 10.0 (TID 95) in 21 ms on host101 (executor 2) (2/3)
17/05/18 17:06:30 INFO TaskSetManager: Finished task 1.0 in stage 10.0 (TID 94) in 22 ms on host101 (executor 2) (3/3)
17/05/18 17:06:30 INFO YarnScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool 
17/05/18 17:06:30 INFO DAGScheduler: ResultStage 10 (print at FlumeEventCount.scala:30) finished in 0.025 s
17/05/18 17:06:30 INFO DAGScheduler: Job 5 finished: print at FlumeEventCount.scala:30, took 0.032431 s
-------------------------------------------
Time: 1495098390000 ms
-------------------------------------------
Received 5 flume events.
  •  下一步会用Flume先做一次filter,去除掉没有搜索记录的log event。

Step2:Spark Streaming

  • Spark Streaming这边只要编写程序:
    1. new一个StreamingContext
    2. FlumeUtils.createStream接收Flume传过来的events
    3. mapPartitions(优化):对每个partiiton建一个Pattern和hashMap
    4. reduceByKeyAndWindow(优化): 滑动窗口对hashMap的相同key进行加减
    5. sortByKey: sort之后取前N个
  •  基于上述方式,现在本地作测试:
    • sparkStreaming处理:
    • object LocalTest {
        val logger = LoggerFactory.getLogger("LocalTest")
        def main(args: Array[String]) {
      
          val batchInterval = Milliseconds(10000)
          val slideInterval = Milliseconds(5000)
      
          val conf = new SparkConf()
            .setMaster("local[2]")
            .setAppName("LocalTest")
          // WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data,
          // otherwise Spark jobs will not get resources to process the received data.
          val sc = new StreamingContext(conf, Milliseconds(5000))
          sc.checkpoint("flumeCheckpoint/")
      
          val stream = sc.socketTextStream("localhost", 9998)
      
          val counts = stream.mapPartitions{ events =>
            val pattern = Pattern.compile("\\?Input=[^\\s]*\\s")
            val map = new mutable.HashMap[String, Int]()
            logger.info("Handling events, events is empty: " + events.isEmpty)
            while (events.hasNext){   // par is an Iterator!!!
            val line = events.next()
              val m = pattern.matcher(line)
              if (m.find()) {
                val words = line.substring(m.start(), m.end()).split("=")(1).toLowerCase()
                logger.info(s"Processing words $words")
                map.put(words, map.getOrElse(words, 0) + 1)
              }
            }
            map.iterator
          }
      
          val window = counts.reduceByKeyAndWindow(_+_, _-_, batchInterval, slideInterval)
          // window.print()
      
          // transform和它的变体trnasformWith运行在DStream上任意的RDD-to-RDD函数;
          // 可以用来使用那些不包含在DStrema API中RDD操作
          val sorted = window.transform(rdd =>{
            val sortRdd = rdd.map(t => (t._2, t._1)).sortByKey(false).map(t => (t._2, t._1))
            val more = sortRdd.take(2)
            more.foreach(println)
            sortRdd
          })
      
          sorted.print()
      
          sc.start()
          sc.awaitTermination()
        }
      }
    • 同时,另外运行一个程序,产生log,并向9998端口发送:
    • object GenerateChar {
        def main(args: Array[String]) {
          val listener = new ServerSocket(9998)
          while(true){
            val socket = listener.accept()
            new Thread(){
              override def run() = {
                println("Got client connected from :"+ socket.getInetAddress)
                val out = new PrintWriter(socket.getOutputStream,true)
                while(true){
                  Thread.sleep(3000)
                  val context1 = "GET /result.html?Input=test1 HTTP/1.1"
                  println(context1)
                  val context2 = "GET /result.html?Input=test2 HTTP/1.1"
                  println(context2)
                  val context3 = "GET /result.html?Input=test3 HTTP/1.1"
                  println(context3)
                  out.write(context1 + '\n' + context2 + "\n" + context2 + "\n" + context3 + "\n" + context3 + "\n" + context3 + "\n" + context3 + "\n")
                  out.flush()
                }
                socket.close()
              }
            }.start()
          }
        }
      }
  • 以上,本地完全没有问题。但是!!!一打包到集群,就各种bug,没有输出。打的logger info也没有输出,System.out.println也没有(stdout文件为空...)。而且会报错shuffleException。
  • 基于上述问题,google了很多都说是内存的问题,但是我的数据量已经不能更小了...  我又测试了下在集群上跑,但不连flume,而是在driver本地跑了个generateLog的程序向9998端口发数据。事实是仍然可能报错如下:
    • 17/05/24 15:07:17 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) from host101:37940
      java.io.IOException: Failed to connect to host101/10.3.242.101:37940
      

      但是结果是正确的...

 

TODO

  • 整个架构还有很多可改进的地方。因为我现在只剩两台机器了,就先不折腾了。 - -
  • 其中最大的问题还是容错
    • flume是push-based,所以一旦有events冲击波,HDFS可能负载不了高强度的写操作,从而出问题;
    • spark-streaming那边因为也是直接使用这种push-based(没有定制receiver,我嫌麻烦),所以也会有问题。
  • 后续的话,还是要使用经典的Kafka + Flume的架构。