David_Zhu

导航

 

通过hdfs或者spark用户登录操作系统,执行spark-shell

spark-shell 也可以带参数,这样就覆盖了默认得参数

spark-shell --master yarn --num-executors 2 --executor-memory 2G --driver-memory 1536M 

默认值得设置一般在/etc/spark/conf/spark-env.sh里面设置

一.通过array数组自动获得

1.枚举生成数组

val arr=Array(1,2,3,4,5,6,7)

val arrRdd= sc.parallelize(arr)

arrRdd.first

arrRdd.take(5).foreach(println)

2.通过list范围生成数组

val lst = (0 to 10000 by 5).toList

val lstRdd=sc.parallelize(lst)

lstRdd.take(10).foreach(println)

lstRdd.collect

把一个rdd里面的所有数据都转换成array,需要很大的内存

 

3.字符串形式的也可以直接形成rdd

val stringRdd=sc.parallelize("hello world")

 

 

 

二.通过本地文件系统获得数据

1.准备: 从这里下载了一个数据集

https://nces.ed.gov/collegenavigator/?s=all&l=91+92+93+94&ct=1

把导出来的文件命名为CollegeNavigator.csv

把文件通过工具传到操作系统的/home/hdfs/CollegeNavigator.csv

然后用hdfs用户登录

执行spark-shell

scala> val arrayIterator = scala.io.Source.fromFile("/home/hdfs/CollegeNavigator.csv")
arrayIterator: scala.io.BufferedSource = non-empty iterator

 

scala> arrayIterator.next
res1: Char = "

scala> arrayIterator.next
res2: Char = N

2.通过getLines 函数直接得到每行

scala> val arrayIterator = scala.io.Source.fromFile("/home/hdfs/CollegeNavigator.csv").getLines
arrayIterator: Iterator[String] = non-empty iterator

scala> arrayIterator.next
res13: String = "Name","Address","Website","Type","Awards offered","Campus setting","Campus housing","Student population","Undergraduate students","Graduation Rate","Transfer-Out Rate","Cohort Year *","Net Price **","Largest Program","IPEDS ID","OPE ID"

 

scala> val collegesRdd= sc.parallelize(arrayIterator.toList)
collegesRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:26

scala> collegesRdd.first
你会发现生成的rdd第一行不是行头,直接是数字,是因为你在生成rdd之前把iterator已经next过一次了,也就是说如果以后你生成的rdd不想要行头的话,那么可以next一次再用parallelize方法。

 

三.通过hdfs中的文件获得数据

1.scala> val CollegesRdd =sc.textFile("/user/hdfs/CollegeNavigator.csv")
CollegesRdd: org.apache.spark.rdd.RDD[String] = /user/hdfs/CollegeNavigator.csv MapPartitionsRDD[3] at textFile at <console>:24

scala> CollegesRdd.take(10).foreach(println)

2.在hdfs可以直接获得目录,把目录中的数据直接生成rdd。

scala> val dirRdd=sc.textFile("/user/hdfs/data")
dirRdd: org.apache.spark.rdd.RDD[String] = /user/hdfs/data MapPartitionsRDD[8] at textFile at <console>:24

scala> dirRdd.count
res19: Long = 1008

如果没有这个直接装载目录的功能,还可以通过rdd2.union(rdd1)的方式实现

3.这种方式注意和spark.read.textFile("..")区别

scala> val test=spark.read.textFile("/user/hdfs/CollegeNavigator.csv")
test: org.apache.spark.sql.Dataset[String] = [value: string]

这个得到的是sparksql的Dataset

 

posted on 2018-11-14 15:39  David_Zhu  阅读(911)  评论(0)    收藏  举报