Spark和Java API(六)求平均值

本文介紹如何基于Spark和Java来计算每个key的value的平均值

创建工程

创建一个Maven工程,pom.xml文件如下:

<project xmlns="http://maven.apache.org/POM/4.0.0"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.github.ralgond</groupId>
	<artifactId>spark-java-api</artifactId>
	<version>0.0.1-SNAPSHOT</version>

	<dependencies>
		<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core_2.12</artifactId>
			<version>3.1.1</version>
			<scope>provided</scope>
		</dependency>

	</dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>3.0</version>
				<configuration>
					<source>1.8</source>
					<target>1.8</target>
				</configuration>
			</plugin>
		</plugins>
	</build>
</project>

编写java类Average

创建一个包com.github.ralgond.sparkjavaapi,在该包下创建一个名为Average的类,该类内容如下:

package com.github.ralgond.sparkjavaapi;

import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import scala.Tuple2;

public class Average {
	public static void main(String args[]) {

		SparkConf conf = new SparkConf().setAppName("Average Application");
		
		JavaSparkContext sc = new JavaSparkContext(conf);
		
		JavaRDD<String> data1 = sc.parallelize(Arrays.asList("A 0", "B 3", "C 3", "A 1", "B 4"), 3);
		
		JavaPairRDD<String, Long> data2 = data1.mapToPair(line -> { 
			String[] a = line.split("\\s+", 2);
			return new Tuple2<String, Long>(a[0], Long.parseLong(a[1]));
		});
		
		JavaPairRDD<String, Tuple2<Long, Long>> data3 = data2.mapValues(x -> new Tuple2<Long, Long>(x, 1L));
		
		JavaPairRDD<String, Tuple2<Long, Long>> data4 = data3.reduceByKey((x, y) -> {
			return new Tuple2<Long, Long>(x._1 + y._1, x._2 + y._2);
		});
		
		JavaPairRDD<String, Double> data5 = data4.mapToPair(x -> {
			return new Tuple2<String, Double>(x._1, x._2._1 * 1.0D / x._2._2);
		});
		
		System.out.println(data5.collect());
	}
}

编译并运行

通过mvn clean package编译出jar包spark-java-api-0.0.1-SNAPSHOT.jar。

到spark安装目录里,执行如下命令:

bin\spark-submit --class com.github.ralgond.sparkjavaapi.Average {..}\spark-java-api-0.0.1-SNAPSHOT.jar

便可以看到结果:

posted @ 2021-07-22 16:35  ralgo  阅读(222)  评论(0)    收藏  举报