David_Zhu - 博客园

大数据入门到精通18--sqoop 导入关系库到hdfs中和hive表中

摘要：一，选择数据库，这里使用标准mysql sakila数据库 mysql -u root -D sakila -p 二。首先尝试把表中的数据导入到hdfs文件中，这样后续就可以使用spark来dataframe或者rdd来处理数据 sqoop import --connect "jdbc:mysql: 阅读全文

posted @ 2018-12-26 14:10 David_Zhu 阅读(634) 评论(0) 推荐(0)

大数据入门到精通17--union all 和disctinct 的用法

摘要：一。union all 的用法。使用union all 或者 unionselect * from rental where rental_id <10union allselect * from rental where rental_id >30 and rental_id <40 二。disc 阅读全文

posted @ 2018-12-20 11:36 David_Zhu 阅读(255) 评论(0) 推荐(0)

大数据入门到精通16--hive 的条件语句和聚合函数

摘要：一。条件表达 case when ... then when .... then ... when ... then ...end select film_id,rpad(title,20," "),case when rating in ("G","PG","PG-13") then "YOUNG 阅读全文

posted @ 2018-12-12 18:33 David_Zhu 阅读(848) 评论(0) 推荐(0)

大数据入门到精通15--hive 对 date类型的处理

摘要：一。基础日期处理 //date 日期处理select current_date;select current_timestamp;//to_date(time) ;to_date(string)select to_date(current_timestamp);select to_date(rent 阅读全文

posted @ 2018-12-12 16:46 David_Zhu 阅读(1461) 评论(0) 推荐(0)

大数据入门到精通14--hive 对字符串的操作

摘要：一、基本操作 concat(string,string,string)concat_ws(string,string,string)select customer_id,concat_ws(" ",first_name,last_name),email,address_id from custome 阅读全文

posted @ 2018-12-12 14:42 David_Zhu 阅读(240) 评论(0) 推荐(0)

大数据入门到精通13--为后续和MySQL数据库准备

摘要： We will be using the sakila database extensively inside the rest of the course and it would be great if you can follow the installation process below. 阅读全文

posted @ 2018-12-11 18:46 David_Zhu 阅读(142) 评论(0) 推荐(0)

大数据入门到精通12--spark dataframe 注册成hive 的临时表

摘要：一、获得最初的数据并形成dataframe val ny= sc.textFile("data/new_york/")val header=ny.firstval filterNY =ny.filter(listing=>{ listing.split(",").size==14 && listin 阅读全文

posted @ 2018-12-11 13:52 David_Zhu 阅读(1103) 评论(0) 推荐(0)

大数据入门到精通11-spark dataframe 基础操作

摘要： // dataframe is the topic 一、获得基础数据。先通过rdd的方式获得数据 val ny= sc.textFile("data/new_york/")val header=ny.firstval filterNY =ny.filter(listing=>{ listing.sp 阅读全文

posted @ 2018-12-10 12:03 David_Zhu 阅读(315) 评论(0) 推荐(0)

大数据入门到精通10--spark rdd groupbykey的使用

摘要： //groupbykey 一、准备数据val flights=sc.textFile("data/Flights/flights.csv")val sampleFlights=sc.parallelize(flights.take(1000))val header=sampleFlights.fir 阅读全文

posted @ 2018-12-07 17:10 David_Zhu 阅读(2737) 评论(0) 推荐(0)

大数据入门到精通9-真正得wordcount

摘要：本章节实现一个真正得wordcount 得spark程序。一、从本地获得一个数据集 val speechRdd= sc.parallelize(scala.io.Source.fromFile("/home/hdfs/Data/WordCount/speech").getLines.toList) 阅读全文

posted @ 2018-12-06 14:22 David_Zhu 阅读(186) 评论(0) 推荐(0)

导航

2018年12月26日

2018年12月20日

2018年12月12日

2018年12月11日

2018年12月10日

2018年12月7日

2018年12月6日