spark 数据读取与保存
在Scala中读取文本文件:
val input = sc.textFile("..")
一个目录下多个文件读取可用wholeTextFiles()方法
保存文本文件
result.saveAsTextFile(outputFile)
在Scala中读取JSON
import com.fasterxml.jackson.module.scala.DefaultScalaModelimport
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature
...
case class Person(name:String, lovespandas:Boolean)
val result = input.flatMap(record =>{
try{
some(mapper.readValue(record, classOf[Person]))
} catch{
case e: Exception => None
}})
在Scala中使用textFile()读取CSV(假设CSV数据字段没有包含换行符)
import java.io.StringReader
import au.com.bytecode.opencsv.CSVReader
val input = textFile(inputFile)
val result = input.map{line =>
val reader = new CSVReader(new StringReader(line));
reader.readNext();
}
若字段嵌有换行符,完整读取后解析
case class Person(name: String, fa: String)
val input = sc.wholeTextFiles(inputFile)
val result = input.flatMap{ case (_, txt) =>
val reader = new CSVReader( new StringReader(txt));
reader.readAll().map(x => Person(x(0), x(1))
}
读取SequenceFile
val data = sc.sequenceFile(inFile, classOf[Text], classOf[IntWritable]).
map{case (x, y) => (x.toString, x.get())}
保存SequenceFile
data.saveAsSequenceFile(outFile)
浙公网安备 33010602011771号