Spark学习笔记——rdd,dataframe和dataset转换
1.生成RDD
val rdd: RDD[(String, Int)] = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
2.生成case class RDD
case class WordCount(word: String, count: Long)
val rdd: RDD[WordCount] = sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 60)))
3.rdd转df,注意这里需要隐式转换
import spark.implicits._ val df = rdd.toDF()
4.rdd转ds,注意 WordCount 需要写在主函数之外
import spark.implicits._ val ds: Dataset[WordCount]= rdd.toDS()
5.df转ds
val ds: Dataset[WordCount]= df.as[WordCount]
6.thrift class rdd转df
val df = spark.createDataFrame(rdd, classOf[MyBean])
7.thrift class df转ds,需要为thrift class指定encoder
val ds = df.as[MyBean](Encoders.bean(classOf[MyBean]))
或者
implicit val mapEncoder = Encoders.bean(classOf[MyBean]) val ds = df.as[MyBean]
8.avro class rdd转df,参考
https://stackoverflow.com/questions/47264701/how-to-convert-rddgenericrecord-to-dataframe-in-scala
使用databricks的spark-avro包
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
工具类,注意包名必须写成 com.databricks.spark.avro
package com.databricks.spark.avro
import com.databricks.spark.avro.SchemaConverters.createConverterToSQL
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
object SchemaConverterUtils {
def converterSql(schema: Schema, sqlType: StructType) = {
createConverterToSQL(schema, sqlType)
}
}
代码
val sqlType = SchemaConverters.toSqlType(test_serializer.SCHEMA$).dataType.asInstanceOf[StructType]
val converter = SchemaConverterUtils.converterSql(test_serializer.SCHEMA$, sqlType)
val rowRdd = rdd.flatMap(record => {
Try(converter(record).asInstanceOf[Row]).toOption
})
val df = sparkSession.createDataFrame(rowRdd, sqlType)
df.show()
输出
本文只发表于博客园和tonglin0325的博客,作者:tonglin0325,转载请注明原文链接:https://www.cnblogs.com/tonglin0325/p/4584150.html

浙公网安备 33010602011771号