spark中将RDD转成DataFrame形式进行查询,再讲dataframe结构数据变成sql查询
1)通过反射确定,需要样例类,创建一个样例类
scala> case class People(name:String,age:Int)
defined class People
2)开始创建一个RDD
scala> val rdd =sc.makeRDD(List(("zhangsn",20),("lisi",20),("wangwu",40)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[52] at makeRDD at <console>:24
3)把每一个taple转换成一个个对象
scala> rdd.map(t=> {People(t._1,t._2)})
res27: org.apache.spark.rdd.RDD[People] = MapPartitionsRDD[53] at map at <console>:29
4)把people属性关系传为val peopleRDD
scala> val peopleRDD = rdd.map(t=> {People(t._1,t._2)})
peopleRDD: org.apache.spark.rdd.RDD[People] = MapPartitionsRDD[54] at map at <console>:28
5)转换为dataframe类型,toDF[名称],如果对象有结构可以不加名称,直接toDF就可以了,但是如果对象没有结构的时候需要加上名称。
scala> peopleRDD.toDF
res28: org.apache.spark.sql.DataFrame = [name: string, age: int]
6)写成df
scala> val df = peopleRDD.toDF
df: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> df.show
+-------+---+
| name|age|
+-------+---+
|zhangsn| 20|
| lisi| 20|
| wangwu| 40|
+-------+---+
7)在变成sql查询就简单了,直接定义全局表
scala>df.createGlobalTempView("people")
8)定义spark.sql表名 global_temp.people
scala> spark.sql("select * from global_temp.people").show()
+-------+---+
| name|age|
+-------+---+
|zhangsn| 20|
| lisi| 20|
| wangwu| 40|
+-------+---+
ps:spark读取文件,用sql语句查询

浙公网安备 33010602011771号