通过hdfs或者spark用户登录操作系统,执行spark-shell
spark-shell 也可以带参数,这样就覆盖了默认得参数
spark-shell --master yarn --num-executors 2 --executor-memory 2G --driver-memory 1536M
默认值得设置一般在/etc/spark/conf/spark-env.sh里面设置
一.通过array数组自动获得
1.枚举生成数组
val arr=Array(1,2,3,4,5,6,7)
val arrRdd= sc.parallelize(arr)
arrRdd.first
arrRdd.take(5).foreach(println)
2.通过list范围生成数组
val lst = (0 to 10000 by 5).toList
val lstRdd=sc.parallelize(lst)
lstRdd.take(10).foreach(println)
lstRdd.collect
把一个rdd里面的所有数据都转换成array,需要很大的内存
3.字符串形式的也可以直接形成rdd
val stringRdd=sc.parallelize("hello world")
二.通过本地文件系统获得数据
1.准备: 从这里下载了一个数据集
https://nces.ed.gov/collegenavigator/?s=all&l=91+92+93+94&ct=1
把导出来的文件命名为CollegeNavigator.csv
把文件通过工具传到操作系统的/home/hdfs/CollegeNavigator.csv
然后用hdfs用户登录
执行spark-shell
scala> val arrayIterator = scala.io.Source.fromFile("/home/hdfs/CollegeNavigator.csv")
arrayIterator: scala.io.BufferedSource = non-empty iterator
scala> arrayIterator.next
res1: Char = "
scala> arrayIterator.next
res2: Char = N
2.通过getLines 函数直接得到每行
scala> val arrayIterator = scala.io.Source.fromFile("/home/hdfs/CollegeNavigator.csv").getLines
arrayIterator: Iterator[String] = non-empty iterator
scala> arrayIterator.next
res13: String = "Name","Address","Website","Type","Awards offered","Campus setting","Campus housing","Student population","Undergraduate students","Graduation Rate","Transfer-Out Rate","Cohort Year *","Net Price **","Largest Program","IPEDS ID","OPE ID"
scala> val collegesRdd= sc.parallelize(arrayIterator.toList)
collegesRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:26
scala> collegesRdd.first
你会发现生成的rdd第一行不是行头,直接是数字,是因为你在生成rdd之前把iterator已经next过一次了,也就是说如果以后你生成的rdd不想要行头的话,那么可以next一次再用parallelize方法。
三.通过hdfs中的文件获得数据
1.scala> val CollegesRdd =sc.textFile("/user/hdfs/CollegeNavigator.csv")
CollegesRdd: org.apache.spark.rdd.RDD[String] = /user/hdfs/CollegeNavigator.csv MapPartitionsRDD[3] at textFile at <console>:24
scala> CollegesRdd.take(10).foreach(println)
2.在hdfs可以直接获得目录,把目录中的数据直接生成rdd。
scala> val dirRdd=sc.textFile("/user/hdfs/data")
dirRdd: org.apache.spark.rdd.RDD[String] = /user/hdfs/data MapPartitionsRDD[8] at textFile at <console>:24
scala> dirRdd.count
res19: Long = 1008
如果没有这个直接装载目录的功能,还可以通过rdd2.union(rdd1)的方式实现
3.这种方式注意和spark.read.textFile("..")区别
scala> val test=spark.read.textFile("/user/hdfs/CollegeNavigator.csv")
test: org.apache.spark.sql.Dataset[String] = [value: string]
这个得到的是sparksql的Dataset
浙公网安备 33010602011771号