Spark和Java API(六)求平均值
本文介紹如何基于Spark和Java来计算每个key的value的平均值
创建工程
创建一个Maven工程,pom.xml文件如下:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.github.ralgond</groupId>
<artifactId>spark-java-api</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
编写java类Average
创建一个包com.github.ralgond.sparkjavaapi,在该包下创建一个名为Average的类,该类内容如下:
package com.github.ralgond.sparkjavaapi;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
public class Average {
public static void main(String args[]) {
SparkConf conf = new SparkConf().setAppName("Average Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data1 = sc.parallelize(Arrays.asList("A 0", "B 3", "C 3", "A 1", "B 4"), 3);
JavaPairRDD<String, Long> data2 = data1.mapToPair(line -> {
String[] a = line.split("\\s+", 2);
return new Tuple2<String, Long>(a[0], Long.parseLong(a[1]));
});
JavaPairRDD<String, Tuple2<Long, Long>> data3 = data2.mapValues(x -> new Tuple2<Long, Long>(x, 1L));
JavaPairRDD<String, Tuple2<Long, Long>> data4 = data3.reduceByKey((x, y) -> {
return new Tuple2<Long, Long>(x._1 + y._1, x._2 + y._2);
});
JavaPairRDD<String, Double> data5 = data4.mapToPair(x -> {
return new Tuple2<String, Double>(x._1, x._2._1 * 1.0D / x._2._2);
});
System.out.println(data5.collect());
}
}
编译并运行
通过mvn clean package编译出jar包spark-java-api-0.0.1-SNAPSHOT.jar。
到spark安装目录里,执行如下命令:
bin\spark-submit --class com.github.ralgond.sparkjavaapi.Average {..}\spark-java-api-0.0.1-SNAPSHOT.jar
便可以看到结果:


浙公网安备 33010602011771号