导航

公告

大数据入门到精通6---spark rdd reduce by key 的使用方法

1.前期数据准备（同之前的章节）

val collegesRdd= sc.textFile("/user/hdfs/CollegeNavigator.csv")
val header= collegesRdd.first

val headerlessRdd= collegesRdd.filter( line=>{ line!= header } )

2.获得map

val typeMapCount= headerlessRdd.map(line=>{
val strtype=line.split("\",\"")(3)
val strCount=line.split("\",\"")(7)
val stuCount=if (strCount.length()>0) strCount.toLong
else 0
(strtype,stuCount)
})
typeMapCount.take(10).foreach(println)

3使用reducebykey 方法

val typeReduce=typeMapCount.reduceByKey((sum,current)=>{
sum+current
})

4.数据排序

由于只有sortByKey这个方法，所以想按照后面的数据来排序，比较麻烦，必须把key value做两次置换，如下：

val typeReduce=typeMapCount.reduceByKey((sum,current)=>{
sum+current
}).map(line=>(line._2,line._1)).sortByKey().map(line=>(line._2,line._1))

typeReduce.take(10).foreach(println)

posted on 2018-11-23 11:59 David_Zhu 阅读(383) 评论(0) 收藏举报

刷新页面返回顶部