spark 数据读取与保存

在Scala中读取文本文件:

val input = sc.textFile("..")

一个目录下多个文件读取可用wholeTextFiles()方法

保存文本文件

result.saveAsTextFile(outputFile)

 

在Scala中读取JSON

import com.fasterxml.jackson.module.scala.DefaultScalaModelimport 
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature
...
case class Person(name:String, lovespandas:Boolean)

val result = input.flatMap(record =>{
try{
      some(mapper.readValue(record, classOf[Person]))
} catch{
   case e: Exception => None
}})

在Scala中使用textFile()读取CSV(假设CSV数据字段没有包含换行符)

import java.io.StringReader
import au.com.bytecode.opencsv.CSVReader

val input = textFile(inputFile)
val result = input.map{line =>
    val reader = new CSVReader(new StringReader(line));
    reader.readNext();
}

若字段嵌有换行符,完整读取后解析

case class Person(name: String, fa: String)

val input = sc.wholeTextFiles(inputFile)
val result = input.flatMap{ case (_, txt) =>
     val reader = new CSVReader( new StringReader(txt));
     reader.readAll().map(x => Person(x(0), x(1))
}

读取SequenceFile

val data = sc.sequenceFile(inFile, classOf[Text], classOf[IntWritable]).
     map{case (x, y) => (x.toString, x.get())}

保存SequenceFile

data.saveAsSequenceFile(outFile)

  

 

 

posted on 2016-04-06 19:30  who_a  阅读(727)  评论(0编辑  收藏  举报

导航