学习了之前的rdd的filter以后,这次来讲spark的map方式
1.获得文件
val collegesRdd= sc.textFile("/user/hdfs/CollegeNavigator.csv")
val header= collegesRdd.first
2.通过filter获得纯粹的数据
val headerlessRdd= collegesRdd.filter( line=>{ line!= header } )
3.查看一下实际数据格式
scala> headerlessRdd.first
res1: String = "Aaniiih Nakoda College","269 Blackfeet Avenue Agency, Harlem, Montana 59526","www.ancollege.edu","2-year, Public","Less than one year certificate|One but less than two years certificate|Associate's degree","Rural: Remote","No","122","122","28%","31%","Fall 2014","$8,311","-","180203","02517500"
4.定义一个map函数,获得学校名和,学生数,这了map采用了(key,(value1,value2))的格式,value采用了一个set集合,key其实也可以是一个set结合。
val collegesinfo= headerlessRdd.map(line=>{
val collegesList=line.split("\",\"")
val name=collegesList(0).substring(1)
val stuCount=collegesList(7)
val underCount=collegesList(8)
(name,(stuCount,underCount))
}
)
中间使用了map函数,重点是(name,(stuCount,underCount)) 最后一样,前面都是对数据做处理,有些数据经过简单的split以后,数据的不规则必须需要处理。
通过scala语言做一些简单的额处理。
如果要分析某一行的数据并得到他并处理可以通过
headerlessRdd.take(121)(120) 这样获得到的120行,这一样的string类型的数据。
5.查看实际rdd的效果
scala> collegesinfo.take(10).foreach(println)
(Aaniiih Nakoda College,(122,122))
(Abraham Baldwin Agricultural College,(3394,3394))
(Academy for Careers and Technology,(39,39))
(Academy of Careers and Technology,(146,146))
(Adams State University,(3314,1960))
(Adirondack Community College,(3892,3892))
(Adult and Community Education-Hudson,(129,129))
(Adult and Continuing Education-BCTS,(78,78))
(Aiken Technical College,(2399,2399))
(Aims Community College,(5982,5982))
6.也可以定义一个map函数来处理
def mapfunc(line:String):(String,(String,String)) ={
val collegesList=line.split("\",\"")
val name=collegesList(0).substring(1)
val stuCount=collegesList(7)
val underCount=collegesList(8)
(name,(stuCount,underCount))
}
val collegesinfo2= headerlessRdd.map(mapfunc)
7.查看结果,发现是一样的
scala> collegesinfo2.take(10).foreach(println)
(Aaniiih Nakoda College,(122,122))
(Abraham Baldwin Agricultural College,(3394,3394))
(Academy for Careers and Technology,(39,39))
(Academy of Careers and Technology,(146,146))
(Adams State University,(3314,1960))
(Adirondack Community College,(3892,3892))
(Adult and Community Education-Hudson,(129,129))
(Adult and Continuing Education-BCTS,(78,78))
(Aiken Technical College,(2399,2399))
(Aims Community College,(5982,5982))
8.补充说明如果需要具体的行做处理的时候
val collegesList=line.split("\",\"")
获得的list,可以通过list.size或者 list.length得到总列数。
list.slice(startIndex,endIndex)方法获得其中的部分List
数组可以转化为字符串处理:
scala> collegeList.mkString(",")
res9: String = "Name,Address,Website,Type,Awards offered,Campus setting,Campus housing,Student population,Undergraduate students,Graduation Rate,Transfer-Out Rate,Cohort Year *,Net Price **,Largest Program,IPEDS ID,OPE ID"
字符串可以通过+符号链接
这些在实际操作数据的时候都很有用
浙公网安备 33010602011771号