[分布式论文笔记] - MapReduce
Problems to solve
When processing large amounts of raw data, for example, crawled documents, web request logs aimed at achieving derived data, this kind of computations are conceptually straightforward. However, facing an incredibly large amount of data set, we have to distribute these tasks to hundreds or thousands of commodity machines.
Before MapReduce, people will need to distribute all the data to each machine separately which will take a lot of network traffic. And these data will have to store on both server and clients which will take storage overheads. At the same time, people have to write codes for computation and integration.
Summary of MapReduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets.
On a large cluster of commodity machines.
Key features
- Parallelization
- Fault-tolerance
- ping workers periodically
- if no response: failed
- reset state and become eligible for rescheduling
- ping workers periodically
- Data distribution
- Data-Partitioning: divide data into multiple chunks
- Properties:
- Shared, physically distributed, many machines
- All files split into 16 - 64 MByte chunks
- All chunks replicated, and available from at least 3 machines
- Featured by Google File System (GFS)
- Properties:
- Task-Partitioning:
- mappers will be assigned tasks to finish the computation
- using input data which is on their local disk
- mappers will be assigned tasks to finish the computation
- Data-Partitioning: divide data into multiple chunks
- Load balancing
Implementation and refinement
Implementation
- Split the input files.
- Master assign works to workers. M map tasks and R reduce tasks
- map()
- reads the contents
- parse key/value pairs out of the input data and passes each pair to the user-defined Map function.
- intermediate key/value pair buffered in memory
- Buffered pairs are written to local disk
- partitioned into R regions by the partitioning function
- passed back the locations of pairs to master
- Reducer read the buffered data
- sort by intermediate keys
- group same key together
- reduce()
- iterate over the sorted intermediate data
- when encountering each unique key, pass the values to reduce()
- append to final output
- Completed
- output will be available in the R output files
Refinement
There will be a combiner function applied on map task, it mainly reduces the data on map node from the map() output.
After getting the output of map(), there is a partitioning function for partitions the key/value pairs of intermediate outputs to result in fairly well-balanced partitions.
Contribution (Why it matters)
The key contributions of the MapReduce framework are not the actual map and reduce functions ..., but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine.
The way of hiding sophisticated processes to achieve fault-tolerances and parallelization is the main success of MapReduce since this allows programmers to use machine clusters without programming for failure and distribution.
Supplementary materials
- Hashing: the hash function evenly distributes the key space over the reduce tasks
- Idempotence: methods can be applied multiple times without changing the result beyond the initial application
- this is mainly for failure control
- if a reducer fails (straggles), the master node will send the same task to another reducer
- so even the two tasks eventually succeed
- there will be no collision
- this is mainly for failure control
- Google File System
- Remote Procedure Call
- a method call from one to another
- MapReduce: blocking, wait for the result to continue processing
- Applications: TF/IDF (Term frequency - Inverse document frequency), PageRank

浙公网安备 33010602011771号