spark大数据-wordcount原理剖析

1、代码如下

package cn.spark.study.core

import org.apache.spark.{SparkConf, SparkContext}

/**
 * @author: yangchun
 * @description:
 * @date: Created in 2020-05-04 15:41
 */
object WordCountScala {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("WordCount")
    val sc = new SparkContext(conf)
    val lines = sc.textFile("hdfs://spark1:9000/spark.txt")
    val words = lines.flatMap{line=>line.split(" ")}
    val pairs = words.map{word=>(word,1)}
    val wordCounts = pairs.reduceByKey{_ + _}
    wordCounts.foreach(wordCount=>println(wordCount._1+" appeared "+wordCount._2+" times"))
  }
}

2、原理图如下

 

3、分布式、迭代计算、基于内存

一批批不同数据组成一个个不同RDD,不停的在内存里面进行迭代计算得出结果。reduceByKey还会现在本地进行一次聚合,然后再进行shuffle操作

从hadoop的hdfs里面获取数据

posted on 2020-05-04 17:35  清浊  阅读(598)  评论(0)    收藏  举报