电商项目实战Hive实现-将ETL数据加载到Hive表中
1、创建文件夹,放入原始数据
[hadoop@hadoop000 ~]$ hadoop fs -mkdir -p /project/input/raw [hadoop@hadoop000 data]$ hadoop fs -put trackinfo_20130721.data /project/input/raw/ [hadoop@hadoop000 data]$ hadoop fs -ls /project/input/raw -rw-r--r-- 1 hadoop supergroup 173555592 2018-12-09 08:50 /project/input/raw/trackinfo_20130721.data
2、通过ETL代码,将原始数据清洗出来,生成我们需要字段的日志
3、回到shell里面
[hadoop@hadoop000 ~]$ cd shell/ [hadoop@hadoop000 shell]$ ls [hadoop@hadoop000 shell]$ vi etl.sh
hadoop jar /home/hadoop/lib/hadoop-train-v2-1.0.jar com.imooc.bigdata.hadoop.mr.project.mrv2.ETLApp hdfs://hadoop000:8020/project/input/raw/ hdfs://hadoop000:8020/project/input/etl/
[hadoop@hadoop000 shell]$ ./etl.sh [hadoop@hadoop000 shell]$ hadoop fs -du -s -h /project/input/etl/ 35.7 M 35.7 M /project/input/etl
4、crontab表达式进行调度
建议使用Azkaban调度。可以配置依赖关系。
因为我们先进行ETLApp的,然后进行其他维度的统计分析,相当有一个依赖关系。
5、将清洗后的数据,拉至本地
[hadoop@hadoop000 ~]$ cd data/ [hadoop@hadoop000 data]$ ll [hadoop@hadoop000 data]$ rm part-r-00000 [hadoop@hadoop000 data]$ hadoop fs -get /project/input/etl/part-r-00000 . [hadoop@hadoop000 data]$ ll [hadoop@hadoop000 data]$ more part-r-00000
6、将part-r-00000中的数据加载到外部表track_info
hive (testzhang_db)> LOAD DATA INPATH 'hdfs://hadoop000:8020/project/input/etl' OVERWRITE INTO TABLE track_info partition(day='2013-07-21');
7、依据page,检测数据是否加载到外部表中
hive (testzhang_db)> select page from track_info where day='2013-07-21' limit 5;
8、检测数据量
hive (testzhang_db)> select count(*) from track_info where day='2013-07-21';

浙公网安备 33010602011771号