2025/2/4
Scala可以通过编写MapReduce程序与Hadoop集成,实现高效的数据处理。本篇博客将展示如何使用Scala编写一个简单的MapReduce程序来统计单词出现的次数。
MapReduce程序:编写Mapper和Reducer。
运行MapReduce任务:将Scala程序打包并提交到Hadoop。
示例代码:
import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path import org.apache.hadoop.io.{IntWritable, Text} import org.apache.hadoop.mapreduce.{Job, Mapper, Reducer} object WordCount { class TokenizerMapper extends Mapper[Object, Text, Text, IntWritable] { private val one = new IntWritable(1) private val word = new Text() override def map(key: Object, value: Text, context: Mapper[Object, Text, Text, IntWritable]#Context): Unit = { value.toString.split("\\s+").foreach { w => word.set(w.toLowerCase) context.write(word, one) } } } class IntSumReducer extends Reducer[Text, IntWritable, Text, IntWritable] { private val result = new IntWritable() override def reduce(key: Text, values: java.lang.Iterable[IntWritable], context: Reducer[Text, IntWritable, Text, IntWritable]#Context): Unit = { var sum = 0 values.forEach(value => sum += value.get()) result.set(sum) context.write(key, result) } } def main(args: Array[String]): Unit = { val job = Job.getInstance(new Configuration(), "word count") job.setJarByClass(WordCount.getClass) job.setMapperClass(classOf[TokenizerMapper]) job.setCombinerClass(classOf[IntSumReducer]) job.setReducerClass(classOf[IntSumReducer]) job.setOutputKeyClass(classOf[Text]) job.setOutputValueClass(classOf[IntWritable]) job.setInputFormatClass(classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat]) job.setOutputFormatClass(classOf[org.apache.hadoop.mapreduce.lib.output.TextOutputFormat]) job.setInputPaths(new Path(args(0))) job.setOutputPath(new Path(args(1))) System.exit(if (job.waitForCompletion(true)) 0 else 1) } }
运行步骤:
将上述代码保存为WordCount.scala。
使用SBT打包项目:
sbt package
将生成的JAR文件提交到Hadoop集群:
hadoop jar target/scala-2.13/wordcount_2.13-0.1.jar WordCount input output
查看输出结果:
hdfs dfs -cat output/part-r-00000
Scala与Hadoop的集成使得我们可以利用Scala的强大功能编写MapReduce程序,处理大规模数据集。