Spark / hadoop / hive - 随笔分类 - 机器狗mo

摘要：阅读全文

posted @ 2021-07-18 12:54 机器狗mo 阅读(116) 评论(0) 推荐(0)

摘要：select apply.*, label.* from apply_month apply left join overdue_label label on apply.transactionid = label.transid where apply.stat_month=${month} an 阅读全文

posted @ 2020-10-16 17:33 机器狗mo 阅读(748) 评论(0) 推荐(0)

hive 使用总结

摘要：日期时间相关固定日期转换成时间戳 select unix_timestamp('2016-08-16','yyyy-MM-dd') --1471276800 select unix_timestamp('20160816','yyyyMMdd') --1471276800 select unix_ 阅读全文

posted @ 2020-10-14 11:53 机器狗mo 阅读(126) 评论(0) 推荐(0)

hadoop streaming 分桶排序

摘要：https://blog.csdn.net/qq_26033611/article/details/86541808 阅读全文

posted @ 2020-06-13 15:42 机器狗mo 阅读(41) 评论(0) 推荐(0)

hadoop streaming map输入文件路径获取

摘要：## 判断输入文件 import os import sys for line in sys.stdin: map_input_file = os.environ.get("map_input_file") if path in map_input_file: # do sth 阅读全文

posted @ 2019-10-29 16:56 机器狗mo 阅读(356) 评论(0) 推荐(0)

pyspark 记录

摘要：import os import sys spark_name = os.environ.get('SPARK_HOME',None) if not spark_name: raise ValueErrorError('spark环境没有配置好') sys.path.insert(0,os.path 阅读全文

posted @ 2018-11-23 22:08 机器狗mo 阅读(425) 评论(0) 推荐(0)

jupyter 连接 pyspark

摘要：参考： "spark的介绍和pyspark的使用" 阅读全文

posted @ 2018-09-07 20:23 机器狗mo 阅读(1222) 评论(0) 推荐(0)

hadoop 日常使用记录

摘要：1.Hadoop分布式文件系统（HDFS） HDFS基于GFS（Google File System），能够存储海量的数据，并且使用分布式网络客户端透明访问。 HDFS中将文件拆分成特定大小的块结构（block structured filesystem），一个文件的不同块存储在不同的节点中。为了阅读全文

posted @ 2018-07-12 21:09 机器狗mo 阅读(451) 评论(0) 推荐(0)

Hive建立外部表

摘要：CREATE EXTERNAL TABLE `table_name`( `column1` string, `column2` string, `column3` string) PARTITIONED BY ( `proc_date` string) ROW FORMAT SERDE 'org.a 阅读全文

posted @ 2017-11-21 15:47 机器狗mo 阅读(663) 评论(0) 推荐(0)

Ubuntu下安装spark

摘要：方法一： jps 查看Java 包 sudo apt-get install openjdk** sudo apt-get install scala 选择安装源然后 sudo wget 下载链接 sudo tar xf sprak*** cd sprk** sudo ./bin/pyspark ( 阅读全文

posted @ 2017-03-01 13:51 机器狗mo 阅读(1010) 评论(0) 推荐(0)

随笔分类 - Spark / hadoop / hive

公告