Spark 3.2.2 java编程案例
用vscode新建一个maven项目,添加maven配置:
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.2.2</version>
<scope>provided</scope>
</dependency>
修改App.class
package com.example;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.Iterator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
/**
* Hello world!
*
*/
public class App {
// public static void main( String[] args )
// {
// System.out.println( "Hello World!" );
// }
public static void main(String[] args) throws IOException {
// todo:1、构建sparkconf,设置配置信息
SparkConf sparkConf = new SparkConf().setAppName("WordCount_Java").setMaster("local[2]");
// todo:2、构建java版的sparkContext
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.hadoopConfiguration().set("dfs.nameservice", "node-4");
sc.hadoopConfiguration().set("fs.defaultFS", "hdfs://node-4:9000");
// todo:3、读取数据文件
JavaRDD<String> dataRDD = sc.textFile("/home/hp/word.txt");
// todo:4、对每一行单词进行切分
JavaRDD<String> wordsRDD = dataRDD.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String s) throws Exception {
String[] words = s.split(" ");
return Arrays.asList(words).iterator();
}
});
// todo:5、给每个单词计为 1
// Spark为包含键值对类型的RDD提供了一些专有的操作。这些RDD被称为PairRDD。
// mapToPair函数会对一个RDD中的每个元素调用f函数,其中原来RDD中的每一个元素都是T类型的,
// 调用f函数后会进行一定的操作把每个元素都转换成一个<K2,V2>类型的对象,其中Tuple2为多元组
JavaPairRDD<String, Integer> wordAndOnePairRDD = wordsRDD
.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word, 1);
}
});
// todo:6、相同单词出现的次数累加
JavaPairRDD<String, Integer> resultJavaPairRDD = wordAndOnePairRDD
.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
// TODO Auto-generated method stub
return v1 + v2;
}
});
// todo:7、反转顺序
JavaPairRDD<Integer, String> reverseJavaPairRDD = resultJavaPairRDD
.mapToPair(new PairFunction<Tuple2<String, Integer>, Integer, String>() {
@Override
public Tuple2<Integer, String> call(Tuple2<String, Integer> tuple) throws Exception {
return new Tuple2<Integer, String>(tuple._2, tuple._1);
}
});
// todo:8、把每个单词出现的次数作为key,进行排序,并且在通过mapToPair进行反转顺序后输出
JavaPairRDD<String, Integer> sortJavaPairRDD = reverseJavaPairRDD.sortByKey(false)
.mapToPair(new PairFunction<Tuple2<Integer, String>, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Tuple2<Integer, String> tuple) throws Exception {
return new Tuple2<String, Integer>(tuple._2, tuple._1);
// 或者使用tuple.swap() 实现位置互换,生成新的tuple;
}
});
// todo:执行输出
System.out.println(sortJavaPairRDD.collect());
// 测试写入文件
bufferedOutputStreamTest("/home/hp/test/job.log", sortJavaPairRDD.collect().toString());
// todo:关闭sparkcontext
sc.stop();
}
private static void bufferedOutputStreamTest(String filepath, String content) throws IOException {
try (BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(
new FileOutputStream(filepath))) {
bufferedOutputStream.write(content.getBytes());
}
}
}
用mvn打包, 将打包后的jar文件上传到spark运行环境的机器上:
mvn package
运行Hadoop\Spark集群
$HADOOP_HOME/sbin/start-all.sh
$SPARK_HOME/sbin/start-all.sh
执行Spark应用
spark-submit --class com.example.App --master spark://node-4:7077 --num-executors 1 /home/hp/spark-3.2.2-bin-hadoop3.2/examples/jars/demo-1.0-SNAPSHOT.jar
之后查看输出信息:
[hp@node-4 hadoop-3.2.4]$ spark-submit --class com.example.App --master spark://node-4:7077 --num-executors 1 /home/hp/spark-3.2.2-bin-hadoop3.2/examples/jars/demo-1.0-SNAPSHOT.jar
22/09/19 11:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[(of,18), (to,16), (the,14), (your,10), (I,9), (and,9), (my,5), (so,5), (it,5), (which,5), (not,5), (by,5), (have,4), (be,4), (on,4), (were,4), (this,3), (any,3), (as,3), (causes,3), (me.,3), (for,3), (though,3), (in,3), (those,3), (that,3), (what,2), (me,2), (want,2), (is,2), (with,2), (from,2), (last,2), (will,2), (night,2), (friend,2), (could,2), (you,,2), (required,2), (was,2), (an,2), (But,2), (had,2), (demand,2), (soon,2), (you,2), (which,,2), (or,2), (you.,2), (must,2), (farther,1), (letter,,1), ( Be,1), (its,1), (utmost,1), (apprehension,1), (equal,1), (still,1), (only,1), (been,1), (eldersister,,1), (evening,,1), (opinion,1), (evil,1), (without,1), (connection,1), (parties,1), (madam,,1), (immediately,1), (returning.,1), (too,1), (offend,1), (intention,1), (preserve,1), (Pardon,1), (concern,1), (read.,1), (sentiments,1), (aside,,1), (objections,1), (certain,,1), (nearest,1), (humbling,1), (there,1), (dwelling,1), (merely,1), (design,1), (propriety,1), (betrayed,1), (own,1), (myself,1), (connection.,1), (passed,1), (them,,1), (receiving,1), (bestow,1), (acknowledged,1), (freedom,1), (am,1), (even,1), (unwillingly,,1), (pardon,1), (day,1), (myself,,1), (letter,1), (Netherfield,1), (total,1), (sense,1), (defects,1), (amidst,1), (they,1), (instances,,1), (therefore,,1), (father.,1), (paining,1), (esteemed,1), (other,1), (family,,1), (degree,1), (praise,1), (perusal,1), (because,1), (effort,1), (marriage,1), (containing,1), (situation,1), (London,,1), (,1), (alarmed,,1), (three,1), (inducement,1), (honorable,1), (almost,1), (renewal,1), (These,1), (spared,,1), (unhappy,1), (censure,,1), (comparison,1), (bestowed,1), (existing,1), (than,1), (avoid,1), (The,1), (nothing,1), (force,1), (should,1), (great,1), (following,,1), (character,1), (heightened,1), (less,1), ( My,1), (repetition,1), (conducted,1), (existing,,1), (wishes,1), (put,1), (disgusting,1), (passion,1), (most,1), (attention;,1), (feelings,,1), (no,1), (all,1), (that,,1), (written,1), (younger,1), (like,1), (generally,1), (It,1), (disposition,1), (both,1), (forget,,1), (uniformly,1), (but,1), (justice.,1), (confirmed,,1), (happiness,1), (mother's,1), (repugnance;,1), (occasionally,1), (representation,1), (both,,1), (You,1), (relations,,1), (consider,1), (briefly.,1), (let,1), (before,1), (forgotten;,1), (at,1), (offers,1), (sisters,,1), (know,,1), (occasion,,1), (give,1), (endeavored,1), (left,1), (share,1), (He,1), (every,1), (must,,1), (cannot,1), (case;,1), (displeasure,1), (a,1), (say,1), (herself,,1), (stated,,1), (consolation,1), (yourselves,1), (pains,1), (frequently,,1), (before,,1), (both.,1), (objectionable,,1), (remember,,1), (write,1), (led,1), (formation,1)]
说明Spark的java应用访问到了hdfs文件,并运行计算成功
报错处理
1. hadoop链接失败
[hp@node-4 ~]$ spark-submit --class com.example.App --master spark://node-4:7077 --num-executors 1 /home/hp/spark-3.2.2-bin-hadoop3.2/examples/jars/demo-1.0-SNAPSHOT.jar
22/09/19 11:09:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.net.ConnectException: Call From node-4/192.168.208.132 to node-4:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
解决办法,检查hadoop服务是否成功运行,端口是否正确并可以访问
2. hdfs文件访问不到
[hp@node-4 ~]$ spark-submit --class com.example.App --master spark://node-4:7077 --num-executors 1 /home/hp/spark-3.2.2-bin-hadoop3.2/examples/jars/demo-1.0-SNAPSHOT.jar
22/09/19 11:18:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://node-4:9000/home/hp/word.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
解决办法:
[hp@node-4 hadoop-3.2.4]$hadoop fs -mkdir -p /home/hp/
[hp@node-4 hadoop-3.2.4]$ hadoop fs -ls /
Found 4 items
drwxr-xr-x - hp supergroup 0 2022-09-06 17:31 /home
drwxr-xr-x - hp supergroup 0 2022-09-02 12:12 /test
drwxrwx--- - hp supergroup 0 2022-09-02 11:52 /tmp
drwxr-xr-x - hp supergroup 0 2022-09-06 11:17 /user
[hp@node-4 hadoop-3.2.4]$ hadoop fs -put /home/hp/word.txt /home/hp
# 检查文件是否正确
hadoop fs -text|tail /home/hp/word.txt
保证文件存在之后,重新运行程序

浙公网安备 33010602011771号