cascading-simhash a library to cluster by minhashes in Hadoop

« Why is XOR the default way to combine hashes
hector.rb: the pleasant JRuby Cassandra client (wraps Hector) »
cascading-simhash a library to cluster by minhashes in Hadoop
By Nate Murray | Published: May 9, 2011
simhashing
Say you have a large corpus of web documents and you want to group them together by some notion of “similarity”. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.
In this scenario, it’s impractical to do a pairwise comparison of all documents. Fortunately, we can use simhashing.
Broadly speaking, simhashing is a algorithm that calculates a “cluster id” (the minimum hash, or minhash) from the content. Because the minhash for an item is calculated independently of the other items in the set, minhashing is an ideal candidate for MapReduce.

posted on 2012-09-22 13:34 lexus 阅读(351) 评论(0) 收藏举报

By Nate Murray | Published: May 9, 2011

simhashing

Say you have a large corpus of web documents and you want to group them together by some notion of “similarity”. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.

In this scenario, it’s impractical to do a pairwise comparison of all documents. Fortunately, we can use simhashing.

Broadly speaking, simhashing is a algorithm that calculates a “cluster id” (the minimum hash, or minhash) from the content. Because the minhash for an item is calculated independently of the other items in the set, minhashing is an ideal candidate for MapReduce.

浙江省高等学校教师教育理论培训

公告

cascading-simhash a library to cluster by minhashes in Hadoop

cascading-simhash a library to cluster by minhashes in Hadoop

simhashing