【系统设计】笔记18 map reduce

大数据职位

分布式数据处理框架

count the word frequency of a web page?

for循环，存在hashmap

缺点：只有一台机器，慢，内存大小受限。

多台机器，并行处理

合并的时候是瓶颈。

map把任务打散，reduce把任务合并

step1 input 输入

0: a b a c d d

1: a b c c d b

step2 split 输入的拆分，给不同机器

m1 - 0: a b a c d d

m2 - 1: a b c c d b

step3 map 机器分别执行，不做aggregation

m1 - a,1 b,1 a,1 c,1 d,1 d,1

m2 - a,1 b,1 b,1 c,1 c1 d,1

step4 partition + sort

m1 - a,1 a,1 b,1 | c,1 d,1 d,1

m2 - a,1 b,1 b,1 | c,1 d,1 d,1

step5 fetch + merge sort

m3 - a,1 a,1 b,1 | a,1 b,1 b,1

m4 - c,1 d,1 d,1 | c,1 d,1 d,1

m3 - a,[1,1,1] b,[1,1,1]

m4 - c,[1,1,1] d,[1,1,1]

step6 reduce 合起来

m3 - a,[3] b,[3]

m4 - c,[3] d,[3]

step7 output 输出

a,[3] b,[3] c,[3] d,[3]

step3不合并，不用hashmap

public static class Map {

　　public void map(String key, String value, OutputCollector<String, Integer> output) { // key 文章储存地址，value文章内容

　　　　// 切割文章中的单词

　　　　StringTokenizer tokenizer = new StringTokenizer(value);

　　　　while (tokenizer.hasMoreTokens()) {

　　　　　　String outputkey = tokenizer.nextToken()；

　　　　　　output.collect(outputkey, 1);

　　　　}

　　public static class Reduce {

　　　　public void reduce(String key, Iterator<Integer> values, OutputCollector<String, Integer> output) { // key map输出的key ..

　　　　　　int sum = 0;

　　　　　　while (values.hasNext()) {

　　　　　　　　sum += values.next();

　　　　　　}

　　　　　　output.collect(key, sum);

　　　　}

partition and sort

master consistant hashing进行分组。硬盘上外排序

reduce把排好序的文件拿到对应的机器

map, reduce 多少机器。1000 + 1000

机器多，每台处理的时间越少，总时间越快。启动时间变长

reduce数目上限,key的数目

给定正倒排索引，建立倒排索引，给词返回文章编号

key 文章关键词，value: 文章编号

reduce 去重操作，同一文章出现关键词两次的情况

// 同一文章下打散

public static class Map {

　　public void map (String key, Document value, OutputCollector<String, Integer> output) {

　　　　StringTokenizer tokenizer = new StringTokenizer(value.content);

　　　　while (tokenizer.hasMoreToken()) {

　　　　　　String word = tokenizer.nextToken();

　　　　　　output.collect(word, value.id);

　　　　}

// 同一单词的合并

public static class Reduce {

　　public void reduce(String key, Iterator<Integer> values, OutputColllector<String, List<Integer>> output) {

　　　　List<Integer> results = new ArrayList<>();

　　　　int left = -1;

　　　　while (values.hasNext()) {

　　　　　　int now = values.next();

　　　　　　if (left != now) {

　　　　　　　　results.add(now);

　　　　　　}

　　　　　　left = now;

　　　　}

　　　　output.collect(key, results);

　　}

anagram:

map key: 每个单词的root value: word

public static class Map{

　　public void map(String. key, String value, OutputCollector<String, String> output) {

　　　　StringTokenizer tokenizer = new StringTokenizer(value);

　　　　while (tokenizer.hasMoreTokens()) {

　　　　　　String word = tokenizer.nextToken();

　　　　　　char[] sc = word.toCharArray();

　　　　　　Arrays.sort(sc);

　　　　　　output.collect(new String(sc), word);

　　　　　　}

reduce key：单词 value：list

public static class Reduce {

　　public void reudce(String key, Iterator<String> values, OutputCollector<String, List<String>> output) {

　　　　List<String> results = new ArrrayList<>();

　　　　while (values.hasNext()) {

　　　　　　results.add(values.next());

　　　　}

　　　　output.collect(key, results);

　　}

top k frequency

class Pair {

　　String key;

　　int value;

　　Pair(String k, int v) {

　　　　key = k;

　　　　value = v;

　　}

public void map(String _, Document value, OutputCollector<String, Integer> output) {

　　StringTokenizer tokenizer = new StringTokenizer(value.content);

　　　while (tokenizer.hasMoreTokens()) {
　　　　String word = tokenizer.nextToken();

　　　　output.collect(word, 1);

　　　}

public static class Reduce {

　　private PriorityQueue<Pair> Q;

　　private int k;

　　private Comparator<Pair> cmp = new Comparator<Pair>() {

　　　　public int compare(Pair a, Pair b) {

　　　　　　if (a.value != b.value) {

　　　　　　　　return a.value - b.value;

　　　　　　}

　　　　　　return b.key.compareTo(a.key);

　　　　}

　　};

　　public void setup(int k) {

　　　　Q = new PriorityQueue<Pair>(k, cmp);

　　　　this.key = k;

　　}

　　public void reduce(String key, Iterator<Integer> values) {

　　　　int sum = 0;　　

　　　　while (values.hasNext()) {

　　　　　　sum += values.next();

　　　　}

　　　　Pair cur = new Pair(key, sum);

　　　　if (Q.size() < k) {

　　　　　　Q.add(cur);

　　　　} else {

　　　　　　Pair peek = Q.peek();

　　　　　　if (cmp.compare(cur, peek) > 0) {

　　　　　　　　Q.poll();

　　　　　　　　Q.add(cur);

　　　　　　}

　　public void cleanup(OutputCollector<String, Integer> output) {

　　　　List<Pair> res = new ArrayList<>();

　　　　while (!Q.isEmpty()) {

　　　　　　res.add(Q.poll());

　　　　}

　　　　for (int i = res.size() - 1; i >= 0; i --) {

　　　　　　Pair cur = res.get(i);

　　　　　　output.collect(cur.key, cur.value);

　　　　}
　　}

}

design a MR system:

master 控制整个系统流程 - slave 完成真正的工作

1. 用户指定多少map，多少reduce。启动相应机器

2. master分配哪些slave作为map/ reduce。

3. master将input尽量等分给map, map读取文件后执行map工作

4. map工作后将结果写到本地硬盘上

5. 传输整理将map结果传给reduce

6. reduce工作，结束后将结果写出

map结束了reduce

如果挂了一台，重新分配一台机器

reducer一个key特别多。加random后缀。类似shard key。 fb1, fb2, fb3分配到不同

input, output存放到GFS

local disk的mapper output data不需要保存GFS，丢了重做。中间数据不重要。

mapper和reducer之前有预处理，放在不同机器上

MapReduce whole process

1. start: user program start master and worker

2. assign task: master assign task to the map worker and reduce worker. assign map and reduce code

3. split: master split the input data

4. map read: each map worker read the split input data

5. map: each map worker do the map job on their machine

6. map output: each map worker output the file in the local disk of its worker

6. reduce fetch: each reduce worker fetch the data from the map worker

7. reduce: each reducer worker do the reduce job on their machine

8. reduce output: reduce worker outpt the final output data

posted on 2024-02-27 05:31 dddddcoke 阅读(4) 评论(0) 编辑收藏举报