Dataset.scala(sql)

1  object Dataset

private to sql level



test & errors:  :后为解释source code内容; //为插入分析
1 spark.read.textFile("...")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
:read= new DataFrameReader(self) DFReader extends spark.internal.Logging

DataFrameReader由SparkSession.read access, 接口来读取Dataset type的外存数据(filesystem or k-v stores )。

// type DataFrame = Dataset[Row] 该接口在package object sql中

// sql将一个logical plan转为0或n个SparkPlan。这里的接口用于实验优化query plan,稳定的接口在sql.sources中

因数据来自外部,数据源source、schema作为private value。(例如JSON的schema inference step https://openproceedings.org/2017/conf/edbt/paper-62.pdf 将免去,减少scan一遍的时间。link

DFReader中还有option(for adding input option),load(用于有路径、无路径、多路径=>DataFrame).

jdbc(DF被创建,代表可以JDBC连接了。extraOptions默认由HashMap生成,需要重写,再将JDBC的properties format后load)

(@param 其中lowerBound和upperBound为columnName的最值,用以决定numPartitions,以及url,tablename,usr,pass等连接properties)

numPartitions最小为1。  

This, along with `lowerBound` (inclusive),`upperBound` (exclusive), form partition strides for generated WHERE clause expressions used to split the column `columnName` evenly.

The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DF.

predicates.zipWithIndex.map { case (part, i) =>JDBCPartition(part, i) : Partition}

 

posted on 2017-09-28 16:46  satyrs  阅读(248)  评论(0)    收藏  举报

导航