spark中将RDD转成DataFrame形式进行查询,再讲dataframe结构数据变成sql查询

1)通过反射确定,需要样例类,创建一个样例类

scala> case class People(name:String,age:Int)
defined class People

2)开始创建一个RDD

scala> val rdd =sc.makeRDD(List(("zhangsn",20),("lisi",20),("wangwu",40)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[52] at makeRDD at <console>:24

3)把每一个taple转换成一个个对象

scala> rdd.map(t=> {People(t._1,t._2)})
res27: org.apache.spark.rdd.RDD[People] = MapPartitionsRDD[53] at map at <console>:29

4)把people属性关系传为val peopleRDD

scala> val peopleRDD = rdd.map(t=> {People(t._1,t._2)})
peopleRDD: org.apache.spark.rdd.RDD[People] = MapPartitionsRDD[54] at map at <console>:28

5)转换为dataframe类型,toDF[名称],如果对象有结构可以不加名称,直接toDF就可以了,但是如果对象没有结构的时候需要加上名称。

scala> peopleRDD.toDF

res28: org.apache.spark.sql.DataFrame = [name: string, age: int]

6)写成df

scala> val df = peopleRDD.toDF

df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> df.show
+-------+---+
| name|age|
+-------+---+
|zhangsn| 20|
| lisi| 20|
| wangwu| 40|
+-------+---+

7)在变成sql查询就简单了,直接定义全局表

scala>df.createGlobalTempView("people")

8)定义spark.sql表名 global_temp.people

scala> spark.sql("select * from global_temp.people").show()
+-------+---+
| name|age|
+-------+---+
|zhangsn| 20|
| lisi| 20|
| wangwu| 40|
+-------+---+

ps:spark读取文件,用sql语句查询

https://www.cnblogs.com/markecc121/p/11638049.html

posted @ 2019-10-08 22:35  markecc121  阅读(406)  评论(0)    收藏  举报