SparkCore根据key写往不同目录下

在做统计计算的时候,根据条件将不同数据分组存放,可以方便后续取数、分析。

Flink中有分流算子,可以将这一批处理后的数据,分成不同的流数据,Spark虽然没有这种算子,但是有类似的操作。

  • 根据key值,将数据写到不同目录下
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;

public class XXXX
{
    public static void main(String[] args) throws IOException {
		/*
		.....
		*/
        SparkConf conf = new SparkConf();
        conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
        SparkSession sparkSession = new SparkSession.Builder().config(conf).getOrCreate();
        JavaSparkContext jsc = new JavaSparkContext(sparkSession.sparkContext());
        jsc.setLogLevel("WARN");

        jsc.textFile(input)
			/*
			...
			*/
			.mapPartitionsToPair(new PairFlatMapFunction<Iterator<Tuple2<String, Result>>, String, String>() { // 主要是要将数据转换成Pair格式,其他算子也能实现
				@Override
				public Iterator<Tuple2<String, String>> call(Iterator<Tuple2<String, Result>> tuple2Iterator) throws Exception {

					return new Iterator<Tuple2<String, String>>() {
						@Override
						public boolean hasNext() {
							return tuple2Iterator.hasNext();
						}

						@Override
						public Tuple2<String, String> next() {
							Tuple2<String, Result> next = tuple2Iterator.next();
							int key = -1;
							String value = "";
							/*
							...
							...
							*/

							if (key == -1){
								return new Tuple2<>("key1", value);
							} else if (0 <= key && key <=3 ) {
								return new Tuple2<>("key2", value);
							} else if (4 <= key && key <= 5) {
								return new Tuple2<>("key3", value);
							} else if (key == 6) {
								sixNum.add(1L);
								return new Tuple2<>("key4", value);
							} else if (7 <= key && key <= 11) {
								return new Tuple2<>("key5", value);
							}else {
								return new Tuple2<>("other", value);
							}
							return null;
						}
					};
				}
			}).filter(data -> data != null)
			.coalesce(10)
			.saveAsHadoopFile(output, String.class, String.class, XXXX.RDDMultipleTextOutputFormat.class); // 第一个String.class是key的类型,第二个是value的类型

        sparkSession.stop();
        jsc.stop();
    }

    public static class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat<String, String> {
        @Override
        public String generateFileNameForKeyValue(String key, String value,
                                                  String name) {
            return key + "/" + name;
        }

        @Override
        public String generateActualKey(String key, String value) {
            return null;
        }
    }
}

最后生成写入的文件路径就是/output/key

posted @ 2025-06-27 15:52  MrSponge  Views(9)  Comments(0)    收藏  举报