浙江省高等学校教师教育理论培训

微信搜索“毛凌志岗前心得”小程序

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

cascading-simhash a library to cluster by minhashes in Hadoop

cascading-simhash a library to cluster by minhashes in Hadoop

simhashing

Say you have a large corpus of web documents and you want to group them together by some notion of “similarity”. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.

In this scenario, it’s impractical to do a pairwise comparison of all documents. Fortunately, we can use simhashing.

Broadly speaking, simhashing is a algorithm that calculates a “cluster id” (the minimum hash, or minhash) from the content. Because the minhash for an item is calculated independently of the other items in the set, minhashing is an ideal candidate for MapReduce.

posted on 2012-09-22 13:34  lexus  阅读(324)  评论(0编辑  收藏  举报