Spark:Dateset和DataFrame
spark2+引入了SparkSession,封装了1.0的SparkContext和SqlContext。
在spark-shell中有个spark变量是默认的SparkSession对象。
读取和保存举例:
spark表示SparkSession对象ds表示Dataset对象df表示DataFram对象
spark.read.textFile("input_file_path")
ds.write.text("output_file_path")
DataFrame的定义在org.apache.spark.sql包里:
type DataFrame = Dataset[Row]
下例展示了从文本文件获取数据,然后抽象为一个DataFrame:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
object Test {
def main(args: Array[String]): Unit = {
val app_name = "test_" + System.currentTimeMillis()
val spark = SparkSession.builder().appName(app_name).getOrCreate()
val ds: Dataset[String] = spark.read.textFile("file:///root/dir/data/people")
import spark.implicits._
val fs = ds.map(cov).toDF
fs.show(false)
}
case class People(id: Long, name: String, age: Int)
def cov(row: String): People = {
val words = row.split(" ")
People(words(0).toLong, words(1), words(2).toInt)
}
}
其他常用方法:
// 将name列重命名为newName
ds.withColumnRenamed("name", "newName")
// 将ds转为df,可以同时指定列名,也可以不指定(使用原有列名)
val df = ds.toDF("id", "name", "age")
从内存中产生Dataset的方法:
val ds1 = spark.createDataset( List( (1,2), (3,4) ) )
ds1.show
+---+---+
| _1| _2|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
val ds2 = spark.createDataset( List( Array(1,2), Array(3,4) ) )
ds2.show
+------+
| value|
+------+
|[1, 2]|
|[3, 4]|
+------+
val ds3 = ds2.map( s => (s(0), s(1)) )
ds3.show
+---+---+
| _1| _2|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
val df = ds3.toDF("a", "b")
df.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+

浙公网安备 33010602011771号