Spark单机部署与词频统计示例
Spark单机部署
我的hadoop是集群模式,Spark在hadoop集群模式下也是可以部署为单机模式的。
找到spark-3.4.0-bin-without-hadoop.tgz文件,下载到本地
解压
tar -zxvf spark-3.4.0-bin-without-hadoop.tgz
修改配置文件
在conf下有个spark-env.sh.template 实际上这只是一个示例,Spark读取的是 spark-env.sh 而不是 .template 文件
# 进入Spark配置目录
cd /export/server/spark-3.4.0-bin-without-hadoop/conf
# 从模板创建spark-env.sh文件
cp spark-env.sh.template spark-env.sh
# 给新文件执行权限
chmod +x spark-env.sh
在spark-env.sh中添加下方代码,改为你的hadoop所在地
export SPARK_DIST_CLASSPATH=$(/export/server/hadoop-3.3.0/bin/hadoop classpath)
然后再spark的bin目录下运行示例,如果成功就可以看到估算的圆周率
# 然后运行
cd /export/server/spark-3.4.0-bin-without-hadoop/bin
./run-example SparkPi 10
词频统计示例
编写代码并打包
WordCount.java
查看代码
package com.example;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public class WordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) {
// 检查参数
if (args.length < 1) {
System.err.println("Usage: WordCount <input-file> [output-dir]");
System.exit(1);
}
String inputPath = args[0];
String outputPath = args.length > 1 ? args[1] : "wordcount-output";
// 创建SparkSession
SparkSession spark = SparkSession
.builder()
.appName("JavaWordCount")
.config("spark.master", "local[*]")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
try {
// 读取输入文件
JavaRDD<String> lines = jsc.textFile(inputPath);
// 执行单词计数
JavaPairRDD<String, Integer> counts = lines
.flatMap(line -> Arrays.asList(SPACE.split(line)).iterator())
.mapToPair(word -> new Tuple2<>(word.toLowerCase(), 1))
.reduceByKey(Integer::sum);
// 收集结果并打印前20个
List<Tuple2<String, Integer>> output = counts.take(20);
System.out.println("\n=== Top 20 Word Count Results ===");
for (Tuple2<String, Integer> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
// 保存完整结果到输出目录
counts.saveAsTextFile(outputPath);
System.out.println("\n=== Full results saved to: " + outputPath + " ===");
// 显示统计信息
System.out.println("Total unique words: " + counts.count());
System.out.println("Total words processed: " + counts.values().reduce(Integer::sum));
} catch (Exception e) {
System.err.println("Error processing file: " + e.getMessage());
e.printStackTrace();
} finally {
// 关闭SparkContext
jsc.close();
spark.stop();
}
}
}
pom.xml
查看代码
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>spark-wordcount-java</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>Spark WordCount Java</name>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spark.version>3.4.0</spark.version>
<scala.version>2.12</scala.version>
</properties>
<dependencies>
<!-- Spark Core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark SQL -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.example.WordCount</mainClass>
</transformer>
</transformers>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
将程序打包为jar包
运行
在HDFS的spark文件夹下创建input.txt
Hello Spark Hello World
This is a Spark Java application
We are learning Spark with Java
Spark is fast and powerful
Java and Spark work well together
NameNode的RPC服务正在8020端口上运行
提交spark作业
cd /export/server/spark-3.4.0-bin-without-hadoop
# 设置Hadoop类路径(可选)
export SPARK_DIST_CLASSPATH=$(/export/server/hadoop-3.3.0/bin/hadoop classpath)
# 使用正确的8020端口提交作业
./bin/spark-submit \
--class com.example.WordCount \
--master local[2] \
/export/server/spark-wordcount-java-1.0-SNAPSHOT.jar \
hdfs://node1:8020/spark/input.txt \
hdfs://node1:8020/spark/wordcount-output
查看结果:

浙公网安备 33010602011771号