David_Zhu

导航

 

学习了之前的rdd的filter以后,这次来讲spark的map方式

1.获得文件

val collegesRdd= sc.textFile("/user/hdfs/CollegeNavigator.csv")
val header= collegesRdd.first

2.通过filter获得纯粹的数据

val headerlessRdd= collegesRdd.filter( line=>{ line!= header } )

3.查看一下实际数据格式

scala> headerlessRdd.first
res1: String = "Aaniiih Nakoda College","269 Blackfeet Avenue Agency, Harlem, Montana 59526","www.ancollege.edu","2-year, Public","Less than one year certificate|One but less than two years certificate|Associate's degree","Rural: Remote","No","122","122","28%","31%","Fall 2014","$8,311","-","180203","02517500"

4.定义一个map函数,获得学校名和,学生数,这了map采用了(key,(value1,value2))的格式,value采用了一个set集合,key其实也可以是一个set结合。

val collegesinfo= headerlessRdd.map(line=>{
val collegesList=line.split("\",\"")
val name=collegesList(0).substring(1)
val stuCount=collegesList(7)
val underCount=collegesList(8)
(name,(stuCount,underCount))
}
)

中间使用了map函数,重点是(name,(stuCount,underCount)) 最后一样,前面都是对数据做处理,有些数据经过简单的split以后,数据的不规则必须需要处理。

通过scala语言做一些简单的额处理。

 如果要分析某一行的数据并得到他并处理可以通过

headerlessRdd.take(121)(120) 这样获得到的120行,这一样的string类型的数据。

 

5.查看实际rdd的效果

scala> collegesinfo.take(10).foreach(println)
(Aaniiih Nakoda College,(122,122))
(Abraham Baldwin Agricultural College,(3394,3394))
(Academy for Careers and Technology,(39,39))
(Academy of Careers and Technology,(146,146))
(Adams State University,(3314,1960))
(Adirondack Community College,(3892,3892))
(Adult and Community Education-Hudson,(129,129))
(Adult and Continuing Education-BCTS,(78,78))
(Aiken Technical College,(2399,2399))
(Aims Community College,(5982,5982))

 6.也可以定义一个map函数来处理

def mapfunc(line:String):(String,(String,String)) ={
val collegesList=line.split("\",\"")
val name=collegesList(0).substring(1)
val stuCount=collegesList(7)
val underCount=collegesList(8)
(name,(stuCount,underCount))
}


val collegesinfo2= headerlessRdd.map(mapfunc)

7.查看结果,发现是一样的

scala> collegesinfo2.take(10).foreach(println)
(Aaniiih Nakoda College,(122,122))
(Abraham Baldwin Agricultural College,(3394,3394))
(Academy for Careers and Technology,(39,39))
(Academy of Careers and Technology,(146,146))
(Adams State University,(3314,1960))
(Adirondack Community College,(3892,3892))
(Adult and Community Education-Hudson,(129,129))
(Adult and Continuing Education-BCTS,(78,78))
(Aiken Technical College,(2399,2399))
(Aims Community College,(5982,5982))

8.补充说明如果需要具体的行做处理的时候

val collegesList=line.split("\",\"")

获得的list,可以通过list.size或者 list.length得到总列数。

list.slice(startIndex,endIndex)方法获得其中的部分List

数组可以转化为字符串处理:

scala> collegeList.mkString(",")
res9: String = "Name,Address,Website,Type,Awards offered,Campus setting,Campus housing,Student population,Undergraduate students,Graduation Rate,Transfer-Out Rate,Cohort Year *,Net Price **,Largest Program,IPEDS ID,OPE ID"

字符串可以通过+符号链接

这些在实际操作数据的时候都很有用

 

posted on 2018-11-21 10:55  David_Zhu  阅读(1504)  评论(0)    收藏  举报