24.11.1

实验6
熟悉Hive的基本操作

1.实验目的
(1)理解Hive作为数据仓库在Hadoop体系结构中的角色。
(2)熟练使用常用的HiveQL。
2.实验平台
操作系统:Ubuntu18.04(或Ubuntu16.04)。
Hadoop版本:2.7.3。
Hive版本:2.7.3。
JDK版本:1.8。
3.数据集
由《Hive编程指南》(O'Reilly系列,人民邮电出版社)提供,下载地址:
https://raw.githubusercontent.com/oreillymedia/programming_hive/master/prog-hive-1st-ed-data.zip
备用下载地址:
https://www.cocobolo.top/FileServer/prog-hive-1st-ed-data.zip
解压后可以得到本实验所需的stocks.csv和dividends.csv两个文件。
4.实验步骤
(1)创建一个内部表stocks,字段分隔符为英文逗号,表结构如表14-11所示。
表14-11 stocks表结构
col_name data_type
exchange string
symbol string
ymd string
price_open float
price_high float
price_low float
price_close float
volume int
price_adj_close float
CREATE TABLE stocks (exchange STRING, symbol STRING, ymd STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

(2)创建一个外部分区表dividends(分区字段为exchange和symbol),字段分隔符为英文逗号,表结构如表14-12所示。
表14-12 dividends表结构
col_name data_type
ymd string
dividend float
exchange string
symbol string

CREATE EXTERNAL TABLE dividends (ymd STRING, dividend FLOAT) PARTITIONED BY (exchange STRING, symbol STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/path/to/dividends/';

(3)从stocks.csv文件向stocks表中导入数据。
先上传到hdfs,再从hdfs导入
LOAD DATA INPATH '/hdfs/hive/stocks.csv' INTO TABLE stocks;
这个还挺麻烦的,hive没有权限,然后给权限还需要hdfs用户
换了好几个权限,直接把目录归属者给了hive就行了

(4) 创建一个未分区的外部表dividends_unpartitioned,并从dividends.csv向其中导入数据,表结构如表14-13所示。
表14-13 dividends_unpartitioned表结构
col_name data_type
ymd string
dividend float
exchange string
symbol string
CREATE EXTERNAL TABLE dividends_unpartitioned (ymd STRING, dividend FLOAT, exchange STRING, symbol STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/hdfs/hive/dividends_unpartitioned/'; LOAD DATA LOCAL INPATH '/ hdfs/hive /dividends.csv' INTO TABLE dividends_unpartitioned;

(4)通过对dividends_unpartitioned的查询语句,利用Hive自动分区特性向分区表dividends各个分区中插入对应数据。

先分片加速不然内存不够
SET mapreduce.input.fileinputformat.split.maxsize=134217728; -- 128 MB

SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; INSERT OVERWRITE TABLE dividends PARTITION (exchange, symbol) SELECT ymd, dividend, exchange, symbol FROM dividends_unpartitioned;

(6)查询IBM公司(symbol=IBM)从2000年起所有支付股息的交易日(dividends表中有对应记录)的收盘价(price_close)。

(7)查询苹果公司(symbol=AAPL)2008年10月每个交易日的涨跌情况,涨显示rise,跌显示fall,不变显示unchange。
SELECT ymd,
CASE
WHEN price_close > price_open THEN 'rise'
WHEN price_close < price_open THEN 'fall'
ELSE 'unchange'
END AS status
FROM stocks
WHERE symbol = 'AAPL'
AND ymd LIKE '2008-10%';

(8)查询stocks表中收盘价(price_close)比开盘价(price_open)高得最多的那条记录的交易所(exchange)、股票代码(symbol)、日期(ymd)、收盘价、开盘价及二者差价。

SELECT exchange, symbol, ymd, price_close, price_open, (price_close - price_open) AS price_difference
FROM stocks
ORDER BY price_difference DESC
LIMIT 1;

(9)从stocks表中查询苹果公司(symbol=AAPL)年平均调整后收盘价(price_adj_close) 大于50美元的年份及年平均调整后收盘价。
WITH yearly_avg AS (
SELECT symbol, YEAR(ymd) AS year, AVG(price_adj_close) AS avg_price_adj_close
FROM stocks
GROUP BY symbol, YEAR(ymd)
)
SELECT year, symbol, avg_price_adj_close
FROM (
SELECT year, symbol, avg_price_adj_close,
ROW_NUMBER() OVER (PARTITION BY year ORDER BY avg_price_adj_close DESC) AS rank
FROM yearly_avg
) AS ranked
WHERE rank <= 3;

(10)查询每年年平均调整后收盘价(price_adj_close)前三名的公司的股票代码及年平均调整后收盘价。
WITH yearly_avg AS (
SELECT symbol, YEAR(ymd) AS year, AVG(price_adj_close) AS avg_price_adj_close
FROM stocks
GROUP BY symbol, YEAR(ymd)
)
SELECT year, symbol, avg_price_adj_close
FROM (
SELECT year, symbol, avg_price_adj_close,
ROW_NUMBER() OVER (PARTITION BY year ORDER BY avg_price_adj_close DESC) AS rank
FROM yearly_avg
) AS ranked
WHERE rank <= 3;
最后一个命令master节点卡死机了,不跑了
重新试了,跑了就死机

posted @ 2024-11-01 18:07  起名字真难_qmz  阅读(14)  评论(0)    收藏  举报