DataFrame、DataSet

一、DataFrame

1、创建

https://www.cnblogs.com/frankdeng/p/9301743.html

　　DataFrame创建方式应该按照数据源进行划分，数据源是：普通的txt文件、json/parquet文件、mysql数据库、hive数据仓库

1、普通txt文件：

（1）case class 创建

（2）structType 创建

2、json/parquet文件：

　　直接读取

3、mysql数据库：创建连接读取

4、hive数据仓库：创建连接读取

2、DataFrame的常见操作：

参考博客：https://blog.csdn.net/dabokele/article/details/52802150

二、DataSet

https://www.jianshu.com/p/77811ae29fdd

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

三、二者区别

Dataset和DataFrame拥有完全相同的成员函数，区别只是每一行的数据类型不同。

DataFrame也可以叫Dataset[Row],每一行的类型是Row，不解析，每一行究竟有哪些字段，各个字段又是什么类型都无从得知，只能用上面提到的getAS方法或者共性中的第七条提到的模式匹配拿出特定字段。

而Dataset中，每一行是什么类型是不一定的，在自定义了case class之后可以很自由的获得每一行的信息。

case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型
/**
rdd
("a", 1)
("b", 1)
("a", 1)
* */
val test: Dataset[Coltest]=rdd.map{line=>
Coltest(line._1,line._2)
}.toDS
test.map{
line=>
println(line.col1)
println(line.col2)
}

可以看出，Dataset在需要访问列中的某个字段时是非常方便的，然而，如果要写一些适配性很强的函数时，如果使用Dataset，行的类型又不确定，可能是各种case class，无法实现适配，这时候用DataFrame即Dataset[Row]就能比较好的解决问题。

posted @ 2019-12-24 16:47 guoyu1 阅读(401) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

打怪up