Dataset.scala(sql)
1 object Dataset
private to sql level
test & errors: :后为解释source code内容; //为插入分析
1 spark.read.textFile("...")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
:read= new DataFrameReader(self) DFReader extends spark.internal.Logging
DataFrameReader由SparkSession.read access, 接口来读取Dataset type的外存数据(filesystem or k-v stores )。
// type DataFrame = Dataset[Row] 该接口在package object sql中
// sql将一个logical plan转为0或n个SparkPlan。这里的接口用于实验优化query plan,稳定的接口在sql.sources中
因数据来自外部,数据源source、schema作为private value。(例如JSON的schema inference step https://openproceedings.org/2017/conf/edbt/paper-62.pdf 将免去,减少scan一遍的时间。link)
DFReader中还有option(for adding input option),load(用于有路径、无路径、多路径=>DataFrame).
jdbc(DF被创建,代表可以JDBC连接了。extraOptions默认由HashMap生成,需要重写,再将JDBC的properties format后load)
(@param 其中lowerBound和upperBound为columnName的最值,用以决定numPartitions,以及url,tablename,usr,pass等连接properties)
numPartitions最小为1。
This, along with `lowerBound` (inclusive),`upperBound` (exclusive), form partition strides for generated WHERE clause expressions used to split the column `columnName` evenly.
The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DF.
predicates.zipWithIndex.map { case (part, i) =>JDBCPartition(part, i) : Partition}
浙公网安备 33010602011771号