Spark单机部署与词频统计示例

  Spark单机部署

我的hadoop是集群模式,Spark在hadoop集群模式下也是可以部署为单机模式的。

Spark官方下载地址

找到spark-3.4.0-bin-without-hadoop.tgz文件,下载到本地

解压

tar -zxvf spark-3.4.0-bin-without-hadoop.tgz

 修改配置文件

在conf下有个spark-env.sh.template 实际上这只是一个示例,Spark读取的是 spark-env.sh 而不是 .template 文件

# 进入Spark配置目录
cd /export/server/spark-3.4.0-bin-without-hadoop/conf

# 从模板创建spark-env.sh文件
cp spark-env.sh.template spark-env.sh

# 给新文件执行权限
chmod +x spark-env.sh

在spark-env.sh中添加下方代码,改为你的hadoop所在地

export SPARK_DIST_CLASSPATH=$(/export/server/hadoop-3.3.0/bin/hadoop classpath)

然后再spark的bin目录下运行示例,如果成功就可以看到估算的圆周率

# 然后运行
cd /export/server/spark-3.4.0-bin-without-hadoop/bin
./run-example SparkPi 10

词频统计示例

编写代码并打包

WordCount.java

查看代码
package com.example;

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

public class WordCount {
    private static final Pattern SPACE = Pattern.compile(" ");

    public static void main(String[] args) {
        // 检查参数
        if (args.length < 1) {
            System.err.println("Usage: WordCount <input-file> [output-dir]");
            System.exit(1);
        }

        String inputPath = args[0];
        String outputPath = args.length > 1 ? args[1] : "wordcount-output";

        // 创建SparkSession
        SparkSession spark = SparkSession
                .builder()
                .appName("JavaWordCount")
                .config("spark.master", "local[*]")
                .getOrCreate();

        JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());

        try {
            // 读取输入文件
            JavaRDD<String> lines = jsc.textFile(inputPath);

            // 执行单词计数
            JavaPairRDD<String, Integer> counts = lines
                    .flatMap(line -> Arrays.asList(SPACE.split(line)).iterator())
                    .mapToPair(word -> new Tuple2<>(word.toLowerCase(), 1))
                    .reduceByKey(Integer::sum);

            // 收集结果并打印前20个
            List<Tuple2<String, Integer>> output = counts.take(20);
            System.out.println("\n=== Top 20 Word Count Results ===");
            for (Tuple2<String, Integer> tuple : output) {
                System.out.println(tuple._1() + ": " + tuple._2());
            }

            // 保存完整结果到输出目录
            counts.saveAsTextFile(outputPath);
            System.out.println("\n=== Full results saved to: " + outputPath + " ===");

            // 显示统计信息
            System.out.println("Total unique words: " + counts.count());
            System.out.println("Total words processed: " + counts.values().reduce(Integer::sum));

        } catch (Exception e) {
            System.err.println("Error processing file: " + e.getMessage());
            e.printStackTrace();
        } finally {
            // 关闭SparkContext
            jsc.close();
            spark.stop();
        }
    }
}

pom.xml

查看代码
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>spark-wordcount-java</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>Spark WordCount Java</name>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <spark.version>3.4.0</spark.version>
        <scala.version>2.12</scala.version>
    </properties>

    <dependencies>
        <!-- Spark Core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- Spark SQL -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.4</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>com.example.WordCount</mainClass>
                                </transformer>
                            </transformers>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

将程序打包为jar包

运行

在HDFS的spark文件夹下创建input.txt

Hello Spark Hello World
This is a Spark Java application
We are learning Spark with Java
Spark is fast and powerful
Java and Spark work well together

NameNode的RPC服务正在8020端口上运行

提交spark作业

cd /export/server/spark-3.4.0-bin-without-hadoop

# 设置Hadoop类路径(可选)
export SPARK_DIST_CLASSPATH=$(/export/server/hadoop-3.3.0/bin/hadoop classpath)

# 使用正确的8020端口提交作业
./bin/spark-submit \
    --class com.example.WordCount \
    --master local[2] \
    /export/server/spark-wordcount-java-1.0-SNAPSHOT.jar \
    hdfs://node1:8020/spark/input.txt \
    hdfs://node1:8020/spark/wordcount-output

查看结果:

image

posted @ 2025-11-05 14:54  雨花阁  阅读(3)  评论(0)    收藏  举报