[分布式论文笔记] - MapReduce

Problems to solve

When processing large amounts of raw data, for example, crawled documents, web request logs aimed at achieving derived data, this kind of computations are conceptually straightforward. However, facing an incredibly large amount of data set, we have to distribute these tasks to hundreds or thousands of commodity machines.

Before MapReduce, people will need to distribute all the data to each machine separately which will take a lot of network traffic. And these data will have to store on both server and clients which will take storage overheads. At the same time, people have to write codes for computation and integration.

Summary of MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets.
On a large cluster of commodity machines.

Key features

Parallelization
Fault-tolerance
- ping workers periodically
  - if no response: failed
  - reset state and become eligible for rescheduling
Data distribution
- Data-Partitioning: divide data into multiple chunks
  - Properties:
    - Shared, physically distributed, many machines
    - All files split into 16 - 64 MByte chunks
    - All chunks replicated, and available from at least 3 machines
  - Featured by Google File System (GFS)
- Task-Partitioning:
  - mappers will be assigned tasks to finish the computation
    - using input data which is on their local disk
Load balancing

Implementation

Split the input files.
Master assign works to workers. M map tasks and R reduce tasks
map()
1. reads the contents
2. parse key/value pairs out of the input data and passes each pair to the user-defined Map function.
3. intermediate key/value pair buffered in memory
Buffered pairs are written to local disk
1. partitioned into R regions by the partitioning function
2. passed back the locations of pairs to master
Reducer read the buffered data
1. sort by intermediate keys
2. group same key together
reduce()
1. iterate over the sorted intermediate data
2. when encountering each unique key, pass the values to reduce()
  1. append to final output
Completed
1. output will be available in the R output files

There will be a combiner function applied on map task, it mainly reduces the data on map node from the map() output.

After getting the output of map(), there is a partitioning function for partitions the key/value pairs of intermediate outputs to result in fairly well-balanced partitions.

Contribution (Why it matters)

The key contributions of the MapReduce framework are not the actual map and reduce functions ..., but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine.

The way of hiding sophisticated processes to achieve fault-tolerances and parallelization is the main success of MapReduce since this allows programmers to use machine clusters without programming for failure and distribution.

Supplementary materials

Hashing: the hash function evenly distributes the key space over the reduce tasks
Idempotence: methods can be applied multiple times without changing the result beyond the initial application
- this is mainly for failure control
  - if a reducer fails (straggles), the master node will send the same task to another reducer
  - so even the two tasks eventually succeed
  - there will be no collision
Google File System
Remote Procedure Call
- a method call from one to another
- MapReduce: blocking, wait for the result to continue processing
Applications: TF/IDF (Term frequency - Inverse document frequency), PageRank

References

posted @ 2021-02-19 09:47 zl78 阅读(40) 评论(0) 收藏举报

刷新页面返回顶部

zl78