2014-04-24 22:01

题目:你有10亿条url,怎么检测其中时候有重复呢?

解法:Hash,算签名,然后用K-V数据库保存数据查重。

代码:

1 // 10.6 You have 10 billion URLs, how would you do to detect duplicates in them.
2 // Answer:
3 //    1. Use digital sign algorithm to convert string to a number of checksum.
4 //    2. Use this sign as the hash key, if memory allow, use an in-memory hash table to detect duplicates.
5 //    3. If memory won't fit in, use K-V database instead. 10GB scale should be acceptable for one machine, so I won't seek help from another computer.
6 int main()
7 {
8     return 0;
9 }

 

 posted on 2014-04-24 22:04  zhuli19901106  阅读(153)  评论(0编辑  收藏  举报