Hive笔记
Hive学习笔记
本文介绍了Hive的基本概念、使用方法以及和其他大数据产品的整合。
1. Hive简介
Hive是一个建立在Hadoop上的数据仓库工具,它可以将结构化数据映射为一张数据库表,并提供了类似SQL的查询语言HQL(Hive Query Language)。Hive通过将HQL转换为MapReduce程序来执行查询操作。Hive还支持自定义函数和插件,可以方便地扩展其功能。
2. Hive的安装和部署
2.1 单机部署
2.1.1 安装Hadoop
首先需要安装Hadoop,可以使用官方提供的二进制包或自行编译安装。
2.1.2 下载和解压Hive
从官方网站上下载Hive的压缩包并解压。
2.1.3 配置环境变量
在.bashrc文件中添加以下配置:
export HADOOP_HOME=/path/to/hadoop
export HIVE_HOME=/path/to/hive
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin
2.1.4 启动Hive
在终端中执行以下命令启动Hive:
hive
2.2 集群部署
2.2.1 安装Hadoop集群
在Hadoop集群上安装Hive,可以将Hive安装在每个节点上,也可以将Hive安装在一个节点上并通过远程访问方式连接到其他节点上的Hadoop集群。
2.2.2 下载和解压Hive
从官方网站上下载Hive的压缩包并解压。
2.2.3 配置环境变量
在.bashrc文件中添加以下配置:
export HADOOP_HOME=/path/to/hadoop
export HIVE_HOME=/path/to/hive
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin
2.2.4 启动Hive
在终端中执行以下命令启动Hive:
hive
3. Hive的常用命令
3.1 Shell命令
3.1.1 创建数据库
CREATE DATABASE database_name;
3.1.2 删除数据库
DROP DATABASE database_name;
3.1.3 创建表
CREATE TABLE table_name (column_name data_type, column_name data_type, ...)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
3.1.4 删除表
DROP TABLE table_name;
3.1.5 加载数据
LOAD DATA INPATH 'path/to/data' INTO TABLE table_name;
3.1.6 查询数据
SELECT column_name FROM table_name;
3.2 Java操作
3.2.1 创建数据库
import java.sql.*;
public class HiveCreateDatabase {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
// Register driver and create connection
Class.forName(driverName);
Connection connection = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "", "");
// Create database
Statement statement = connection.createStatement();
String query = "CREATE DATABASE IF NOT EXISTS database_name";
statement.executeUpdate(query);
// Close connection and statement
statement.close();
connection.close();
}
}
3.2.2 创建表
import java.sql.*;
public class HiveCreateTable {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
// Register driver and create connection
Class.forName(driverName);
Connection connection = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "", "");
// Create table
Statement statement = connection.createStatement();
String query = "CREATE TABLE IF NOT EXISTS table_name (column_name data_type, column_name data_type, ...)";
statement.executeUpdate(query);
// Close connection and statement
statement.close();
connection.close();
}
}
3.2.1 创建和配置Hive连接
import java.sql.*;
public class HiveJdbcClient {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
e.printStackTrace();
System.exit(1);
}
Connection con = null;
Statement stmt = null;
ResultSet rs = null;
try {
con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "", "");
stmt = con.createStatement();
String tableName = "test";
stmt.execute("CREATE TABLE IF NOT EXISTS " + tableName + " (key INT, value STRING)");
rs = stmt.executeQuery("SELECT * FROM " + tableName);
while (rs.next()) {
System.out.println(rs.getInt(1) + " " + rs.getString(2));
}
} catch (SQLException e) {
e.printStackTrace();
} finally {
try {
if (rs != null) {
rs.close();
}
if (stmt != null) {
stmt.close();
}
if (con != null) {
con.close();
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
}
3.2.2 执行HQL查询
import java.sql.*;
public class HiveJdbcClient {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
e.printStackTrace();
System.exit(1);
}
Connection con = null;
Statement stmt = null;
ResultSet rs = null;
try {
con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "", "");
stmt = con.createStatement();
String tableName = "test";
stmt.execute("CREATE TABLE IF NOT EXISTS " + tableName + " (key INT, value STRING)");
stmt.execute("LOAD DATA LOCAL INPATH '/path/to/data' INTO TABLE " + tableName);
rs = stmt.executeQuery("SELECT * FROM " + tableName);
while (rs.next()) {
System.out.println(rs.getInt(1) + " " + rs.getString(2));
}
} catch (SQLException e) {
e.printStackTrace();
} finally {
try {
if (rs != null) {
rs.close();
}
if (stmt != null) {
stmt.close();
}
if (con != null) {
con.close();
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
}
4. Hive的整合
4.1 HBase和Hive整合
HBase是一个分布式的、面向列的NoSQL数据库,可以将HBase与Hive整合来实现实时查询和批量分析。具体整合省略。
4.2 Spark和Hive整合
Spark是一个快速的、通用的分布式计算系统,可以将Spark与Hive整合来实现高效的数据分析和处理。
5. 安装脚本和Dockerfile文件
5.1 安装脚本
可以使用以下脚本自动安装Hive:
#!/bin/bash
# 安装Java
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk
# 下载Hadoop
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
tar zxvf hadoop-3.2.2.tar.gz
# 配置Hadoop
echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/hadoop-3.2.2/etc/hadoop/hadoop-env.sh
echo "export HADOOP_HOME=~/hadoop-3.2.2" >> ~/.bashrc
echo "export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin" >> ~/.bashrc
source ~/.bashrc
# 下载Hive
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar zxvf apache-hive-3.1.2-bin.tar.gz
# 配置Hive
echo "export HIVE_HOME=~/apache-hive-3.1.2-bin" >> ~/.bashrc
echo "export PATH=$PATH:$HIVE_HOME/bin" >> ~/.bashrc
source ~/.bashrc
# 启动Hadoop
cd ~/hadoop-3.2.2
bin/hdfs namenode -format
sbin/start-dfs.sh
# 启动Hive Metastore
cd ~/apache-hive-3.1.2-bin
bin/schematool -initSchema -dbType derby
nohup bin/hive --service metastore > metastore.log 2>&1 &
5.2 Dockerfile文件
也可以使用Dockerfile来构建Hive镜像:
FROM openjdk:8-jdk
# 设置时区为上海
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
# 设置apt-get源为阿里云镜像
RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list
RUN sed -i 's/security.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list
# 安装必要的软件包
RUN apt-get update && \
apt-get install -y wget net-tools procps && \
rm -rf /var/lib/apt/lists/*
# 下载Hadoop
RUN wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz && \
tar zxvf hadoop-3.3.0.tar.gz && \
rm hadoop-3.3.0.tar.gz
# 配置Hadoop
ENV JAVA_HOME /usr/local/openjdk-8
ENV HADOOP_HOME /hadoop-3.3.0
ENV PATH $PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
RUN echo "export JAVA_HOME=/usr/local/openjdk-8" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh && \
echo "export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin" >> /root/.bashrc
# 下载Hive
RUN wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz && \
tar zxvf apache-hive-3.1.2-bin.tar.gz && \
rm apache-hive-3.1.2-bin.tar.gz
# 配置Hive
ENV HIVE_HOME /apache-hive-3.1.2-bin
ENV PATH $PATH:$HIVE_HOME/bin
RUN echo "export HIVE_HOME=/apache-hive-3.1.2-bin" >> /root/.bashrc && \
echo "export PATH=$PATH:$HIVE_HOME/bin" >> /root/.bashrc