Top K - System Design

Fast Path (like 1 minute, 5 minutes)
Use a count-min sketch algorithm (Counting frequency by using multiple hash functions) and aggregates data for a short period of time. No need to partition the data.
Slow Path (like 1 hour, 1 day)
Data partitioners parse batches of events into individual events and do hash partitioning, send messages.
Data processors do aggregation and send to file system.
The MapReduce jobs do the frequency count and select topK in each job.
Thesis
A Survey of Top-k Query Processing Techniques in Relational Database Systems
http://www.cs.umd.edu/~samir/498/topk.pdf
Efficient Computation of Frequent and Top-k Elements in Data Streams ⋆
http://www.cse.ust.hk/~raywong/comp5331/References/EfficientComputationOfFrequentAndTop-kElementsInDataStreams.pdf
Continuous Monitoring of Top-K Queries over Sliding Windows
http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1546&context=sis_research
Reference
[1] https://www.youtube.com/watch?v=kx-XDoPjoHw
[2] https://soulmachine.gitbooks.io/system-design/content/cn/bigdata/heavy-hitters.html
[3] https://github.com/thachlp/system-design-concept/blob/master/linkedin/topk.md

浙公网安备 33010602011771号