从Hive中使用HQL语句创建DataFrame--常用方式
- 默认情况下SparkSession不支持读取Hive中的数据,也不支持操作HQL语法,
如果要读取Hive中的数据,需要开启Hive的支持,
构建sparkSession的时候调用一个函数enableHiveSupport()
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("demo")
val session = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
- 引入spark连接操作hive以及MySQL驱动依赖
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.18</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.3.1</version>
</dependency>
- sparksql操作hive需要通过连接Hive的元数据操作,需要将hive配置了元数据库的配置文件hive-site.xml文件放到项目的resources路径下
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://node1:3306/hive_metastore?serverTimezone=UTC&createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Jsq123456...</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>node1</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
<description>TCP port number to listen on, default 10000</description>
</property>
<property>
<name>hive.server2.thrift.http.port</name>
<value>10001</value>
</property>
<property>
<name>hive.server2.thrift.client.user</name>
<value>root</value>
<description>Username to use against thrift client</description>
</property>
<property>
<name>hive.server2.thrift.client.password</name>
<value>root</value>
<description>Password to use against thrift client</description>
</property>
</configuration>
- 编写代码
object HiveCreateDataFrame {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("demo")
sparkConf.set("spark.sql.warehouse.dir", "hdfs://node1:9000/user/hive/warehouse")
/*
要开启Hive的支持,连接Hive操作Hive
Spark SQL连接上Hive之后,可以在Spark SQL中使用sql函数,在sql函数中编写HQL语句
(不仅限于查询语句,可以是创建语句)
但是在SparkSQL中创建的数据库默认不是在HDFS上存储的,
而是在代码的本地的一个路径下创建的
*/
val session = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
// 查询Hive数据表构建DataFrame hive中的project.area_pvs数据表构建DataFrame ???
val dataFrame0: DataFrame = session.sql("create database if not exists aa")
val dataFrame1: DataFrame = session.sql("show databases")
dataFrame0.show()
dataFrame1.show()
}
}
- 注意
- SparkSQL中创建的数据库默认不是在HDFS上存储的,而是在代码的本地的一个路径下创建的
-- 如果想在HDFS上存储,需设置sparkConf.set("spark.sql.warehouse.dir", "hdfs://node1:9000/user/hive/warehouse")或者设置配置文件hive-site.xml,将hive.metastore.warehouse.dir设置为hdfs://node1:9000/user/hive/warehouse
-- 如果在本地中存储,无需设置即可。
本文来自博客园,作者:jsqup,转载请注明原文链接:https://www.cnblogs.com/jsqup/p/16630341.html

浙公网安备 33010602011771号