SparkSQL
1.课程目标
- 掌握Spark SQL原理
- 熟悉SparkSQL编程模型DataFrame和DataSet以及SQL
- 熟练使用SparkSQL完成计算任务
2.SparkSQL概述
- Spark SQL is Apache Spark's module for working with structured data.
- SparkSQL主要用于处理结构化的数据。
- 结构化数据:特指关系型数据库中的数据,包含数据结构:表名称、列名称、每个列有特定的数据类型、特定的数据长度。在实际开发中,必须将所有的数据都转换结构化数据。一般系统日志都是半结构化的数据,Json、Json+text、按照一定规则的日志数据
- SparkSQL特点:
- 易整合:在一个应用程序中,可以同时使用SparkSQL、SparkCore,并且还能够支持多种开发语言,scala、java、python、R
- 统一的数据接入方式
- 在sparksql中可以通过统一的方式读取外部文件数据,spark.read.json("文件路径"),
- 读取cvs文件 spark.read.cvs(“路径”)
- 整合Hive
- SparkSQL能够非常容易的无缝的整合Hive,支持所有的Hive操作,Hsql、窗口函数、支持hive自定义函数UDF、UDAF
- 支持标准的JDBC或者ODBC协议
- 可以通过java的JDBC操作链接到SparkSQL,进而进行交互式查询
3.SparkSQL的编程模型
- SparkSQL提供了三大编程模型
- DataFrame
- Dataset
- SQL
- 以上三大编程模型都是基于统一的引擎去执行的
3.1DataFrame
- DataFrame:每一个DataFrame对象就相当于关系型数据库中一张表
- 特点:
- DataFrame包含两部分:结构信息(schema 元数据信息)+数据
- DataFrame 比RDD 具有更为丰富的API操作
- DataFrame数据存储在OFF-HEAP(堆外内存),并且以二进制的形式进行存储。DataFrame中数据不受GC的限制
- DataFrame可以减少数据的读取,提交执行效率
- DataFrame也是懒加载的,也需要有SparkSQL输出算子触发
- 特点:
- DataFrame数据来源
- 添加pom文件
-
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.0</version> </dependency>
-
- 创建DataFrame
- DataFrame入口 SparkSession,所以基于SparkSQL编程必须创建SparkSession
- 以json数据作为数据源
-
//首先以json数据作为数据源 val jsonDF:DataFrame = spark.read.json("people.json") //打印输出数据 jsonDF.show() //打印schema信息 jsonDF.printSchema()
-
- 以csv文件作为数据源
-
//读取csv文件 val csvStruct = StructType( StructField("name", StringType, true) :: StructField("age", IntegerType, false) :: StructField("post", StringType, false) :: Nil) val csvDF: DataFrame = spark.read.schema(csvStruct) csv ("people.csv") csvDF.printSchema() csvDF.show()
-
- 以parquet文件作为数据源
-
//读取parquet文件格式 val userDF: DataFrame = spark.read.parquet("users.parquet") userDF.printSchema() userDF.show()
-
- 以关系型数据库中数据作为数据源
-
//读取mysql中数据 val properties = new Properties() properties.setProperty("user", "root") properties.setProperty("password", "root") val mysqlDF: DataFrame = spark.read.jdbc("jdbc:mysql://node-02:3306/test", "person", properties) mysqlDF.printSchema() mysqlDF.show()
-
- 以txt文件或者log文件作为数据源
- 结合样例类的形式构建DataFrame
- 操作步骤
-
- 根据数据源中数据构建样例类People
- 通过sparkContext对象读取文件构建RDD[String]
- 将RDD[String]中每条数据构建成样例对象创建RDD[People]
- 将RDD[People]转换成peopleDataFrame,需要导入隐式转换功能 import spark.implicits._
- 使用DataFrame查询属性并展示
-
-
//结合样例类 构建txt文件的schame信息 case class Person(name: String, age: Int, salary: Int) object SparkSQLTxt { def main(args: Array[String]): Unit = { //读取txt文件中数据 val spark = SparkSession.builder().appName("SparkSQLTxt").master("local[*]").getOrCreate() spark.sparkContext.setLogLevel("OFF") val lineRDD: RDD[String] = spark.sparkContext.textFile("C:\\Users\\tp\\Desktop\\深圳大数据7期\\spark\\spark_day03\\数据\\people.txt") val personRDD: RDD[Person] = lineRDD.map(line => { val personStrArray = line.split(",") Person(personStrArray(0), personStrArray(1).toInt, personStrArray(2).toInt) }) //将rdd转换成DataFrame import spark.implicits._ val personDF:DataFrame = personRDD.toDF() personDF.show() } } val tupleRDD: RDD[Tuple3[String,Int,Int]] = lineRDD.map(line => { val personStrArray = line.split(",") (personStrArray(0), personStrArray(1).toInt, personStrArray(2).toInt) }) import spark.implicits._ val frame = tupleRDD.toDF("name","age","salary") frame.show() 样例类和tuple最多支持22个字段
- 通过反射的形式构建DataFame的schema信息
-
object SparkSQLTxt2 { def main(args: Array[String]): Unit = { //读取txt文件中数据 val spark = SparkSession.builder().appName("SparkSQLTxt").master("local[*]").getOrCreate() spark.sparkContext.setLogLevel("OFF") val lineRDD = spark.sparkContext.textFile("C:\\Users\\tp\\Desktop\\深圳大数据7期\\spark\\spark_day03\\数据\\people.txt") //将每一行数据映射成Row对象 val rowRDD: RDD[Row] = lineRDD.map(lineRDD => { val strArray = lineRDD.split(",") Row(strArray(0), strArray(1).toInt, strArray(2).toInt) }) //将rowRDD转换成DataFrame val structType=StructType( StructField("name", StringType, false):: StructField("age", IntegerType, false):: StructField("salary", IntegerType, false)::Nil ) val dataFrame:DataFrame = spark.createDataFrame(rowRDD,structType) dataFrame.printSchema() dataFrame.show() } }
-
- 添加pom文件
3.2基于DataFrame数据操作
- 文件game充值数据
- 说明:gameId gatewayId chargeDate userName chargeAmount OS
- 操作步骤 DSL编程风格
- 利用game充值数据 构建DataFrame
-
//构建DataFrame val spark = SparkSession.builder().appName("SparkSQLGame").master("local[*]").getOrCreate() spark.sparkContext.setLogLevel("OFF") val gameRDD = spark.sparkContext.textFile("C:\\Users\\tp\\Desktop\\深圳大数据7期\\spark\\spark_day03\\数据\\game数据.txt") val gameOrderRDD=gameRDD.map(line => { val gameStr = line.split(",") GameOrder(gameStr(0).toInt, gameStr(1).toInt, gameStr(2), gameStr(3), gameStr(4).toInt, gameStr(5)) }) //将gameOrderRDD 转换成DataFrame import spark.implicits._ val gameOrderDF:DataFrame = gameOrderRDD.toDF() //展示数据 gameOrderDF.show()
- 查询特定字段
-
//查询特定字段 gameOrderDF.select("userName","chargeAmount").show(5) gameOrderDF.select(gameOrderDF.col("userName"),gameOrderDF.col("chargeAmount")).show(5) gameOrderDF.select(gameOrderDF("userName"),gameOrderDF("chargeAmount")).show(5) gameOrderDF.select($"userName",$"chargeAmount",$"chargeAmount"/10).show(5) gameOrderDF.selectExpr("userName","chargeAmount+1 as charge").show(5)
-
- 过滤数据
-
gameOrderDF.filter("chargeAmount>5000").filter("gatewayId=201001").show() gameOrderDF.filter("chargeAmount>5000 and gatewayId=201001").show() gameOrderDF.where() //其实就是调用filter
-
- 统计数据
-
//统计数据 //每个下边网关的充值总次数 gameOrderDF.groupBy("gatewayId").count().show() //gatewayId count //每个下边网关的充值总金额 gameOrderDF.groupBy("gatewayId").sum("chargeAmount").show() //gatewayId sum //每个人的充值总次数、充值总金额、平均充值 import org.apache.spark.sql.functions._ gameOrderDF.groupBy("userName").agg(sum("chargeAmount"), count("chargeAmount"), avg("chargeAmount") ).show() import org.apache.spark.sql.functions._ gameOrderDF.groupBy("userName").agg( sum("chargeAmount").as("sumCharge"), count("chargeAmount").as("chargeTimes"), avg("chargeAmount").as("avgCharge") ).show() //join操作 //计算每个人充值rmb的总金额 import org.apache.spark.sql.functions._ gameOrderDF.join(gameChargeRateDF,"gameId"::"chargeAmount"::Nil).groupBy("userName").agg( sum("rmb").as("chargeMoney") ).show()
-
3.3基于sql的编程
- DataFrame要想基于SQL的编程,首先需要将DataFrame注册成一个临时视图
-
//将DataFrame注册成视图 //当前注册的视图 是SparkSession级别 gameChargeRateDF.createOrReplaceTempView("game_charge_rate") //当前注册的视图是sparkContext级别 gameOrderDF.createOrReplaceGlobalTempView("game_order") //展示数据 spark.sql("select * from game_charge_rate").show(5) spark.sql("select * from global_temp.game_order").show(5) // 统计不同操作系统充值金额和充值次数 spark.sql("select os,sum(chargeAmount),count(*) from global_temp.game_order group by os").show() //统计每个人的充值次数以及充值金额 并且冲值次数大于15次 spark.sql( """ |select | userName,count(chargeAmount) as chargeTimes,sum(chargeAmount) as chargeMoney |from global_temp.game_order |group by userName |having chargeTimes>15 """.stripMargin).show() //统计每天每个人的充值次数以及充值金额的 前 3名 topN spark.sql( """ |select chargeDate,userName,chargeTimes,chargeMoney,rank from ( |select | chargeDate,userName,count(chargeAmount) as chargeTimes,sum(chargeAmount) as chargeMoney, | row_number() over(partition by chargeDate order by sum(chargeAmount) desc) as rank |from global_temp.game_order |group by chargeDate,userName) a | where rank<=3 order by chargeDate asc """.stripMargin).show(100)
-
3.4 Dataset
- Dataset也是一个弹性分布式数据集合,他是继承了RDD编程(面向对象)和DataFrame编程(DSL编程风格)
- 实际上DataFrame是Dataset的一个特例:Dataset[Row]
- Dataset比RDD多了schema信息
- Dataset可以是用map、flatMap等函数编程 也可以使用sql进行编程
- Dataset是一个强类型的。DataFrame是弱类型
- Dateset数据来源
-
//构建SparkSession val spark=SparkSession.builder().master("local[*]").appName("SparkSQL_dataset").getOrCreate() import spark.implicits._ //准备数据 scala集合 val list=List( ("zhangsan",10,1000), ("Tom",20,1000), ("Lucy",30,1000) ) //创建Dataset val listDataset:Dataset[(String,Int,Int)]=spark.createDataset(list) listDataset.createOrReplaceTempView("person") spark.sql("select * from person").show() -
val ds = spark.read.json("C:\\Users\\tp\\Desktop\\深圳大数据7期\\spark\\spark_day03\\数据\\people2.json").as[Person2] ds.show()
-
3.5 保存数据到外部设备
-
//把数据输出到外部设备 //默认保存parquet文件 // gameOrderDF.write.save("D:/df/") //将数据保存成js格式 // gameOrderDF.write.json("d:/df/") //将数据写入到数据库 val pro=new Properties() pro.setProperty("user","root") pro.setProperty("password","root") gameOrderDF.write.mode(SaveMode.Append).jdbc("jdbc:mysql://node-02:3306/test?useUnicode=true&characterEncoding=utf8","games",pro) -
`SaveMode.Overwrite`: 如果存在则覆盖 SaveMode.Append`:追加数据 SaveMode.Ignore`: 忽略 SaveMode.ErrorIfExists: 如果存在则抛出异常信息 默认的保存方式4.SparkSQL整合Hive
-
整合步骤:
- 将hive中hive-site.xml文件复制到spark/conf
- 上传mysql.jar
-
导入pom依赖
-
<!-- spark 整合 hive必须的一个依赖包 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.2.0</version> </dependency>
-
-
-
package cn.itcast.sparksql import org.apache.spark.sql.SparkSession object SparkSQL_hive { def main(args: Array[String]): Unit = { val spark=SparkSession.builder().appName("SparkSQL_hive") .master("local[*]").enableHiveSupport().getOrCreate() //新建表 //chargeDate: String, userName: String, chargeAmount: Int, OS: String) spark.sql( """ |create table game_order( | gameId int, | gatewayId int, | chargeDate string, | userName string, | chargeAmount int, | os string |) |ROW FORMAT DELIMITED | FIELDS TERMINATED BY '\t' """.stripMargin) spark.sql( """ |load data local inpath 'src/game数据.txt' into table game_order """.stripMargin) spark.sql("select * from game_order").show() // spark.sql("drop table game_order") } } -
UDF 和 UDAF
-
class ChargeTo$ extends UDF1[Int, Double] { //chargeAmount : $ 1:200 override def call(t1: Int): Double = { t1.toDouble / 200.toDouble } } spark.udf.register("to_monery",new ChargeTo$(),DoubleType) spark.sql("select chargeAmount,to_monery(chargeAmount) from game_order").show() -
class ChargeTo$ extends UDF1[Int, Double] { //chargeAmount : $ 1:200 override def call(t1: Int): Double = { t1.toDouble / 200.toDouble } } //定义UDAF函数 sum 累加 count 次数 class Avg extends UserDefinedAggregateFunction { //输入数据结构 override def inputSchema: StructType = { StructType( StructField("input", IntegerType, false) :: Nil ) } //在计算过程中使用结构 override def bufferSchema: StructType = { StructType( StructField("sum", IntegerType, false) :: StructField("count", IntegerType, false) :: Nil ) } //返回值类型 override def dataType: DataType = { DoubleType } //是否确定 override def deterministic: Boolean = { true } //初始化 override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = 0 buffer(1) = 0 } //单节点的操作 在每一个task上进行update操作 override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { //sum 累加的 buffer(0) = buffer.getInt(0) + input.getInt(0) //count值 buffer(1) = buffer.getInt(1) + 1 } // 将每一个分片操作结果进行合并 override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { //sum值 buffer1(0) = buffer1.getInt(0) + buffer2.getInt(0) //count值 buffer1(1) = buffer1.getInt(1) + buffer2.getInt(1) } //最终计算产生结果 override def evaluate(buffer: Row): Any = { buffer.getInt(0).toDouble / buffer.getInt(1).toDouble } } spark.udf.register("my_avg", new Avg) spark.sql("select userName,my_avg(chargeAmount) from game_order group by userName").show()
-
5.使用jdbc标准链接链接SparkSQL进行交互式查询
- 操作步骤
- 启动spark集群
- 启动 start-thriftserver.sh 服务
- 导入pom文件
-
<dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <!--版本号需要与集群上的Hive版本一致--> <version>1.2.0</version> </dependency>
-
-
Class.forName("org.apache.hive.jdbc.HiveDriver") val connection = DriverManager.getConnection("jdbc:hive2://node-01:10000") val statement = connection.createStatement() val resultSet = statement.executeQuery("select * from person") while (resultSet.next()) { //name | age | salary println(resultSet.getString("name") + "\t" + resultSet.getInt("age") + "\t" + resultSet.getInt("salary")) } resultSet.close() statement.close() connection.close()
浙公网安备 33010602011771号