Simhashing in Erlang–beauty with binary comprehension

Simhashing is popular technique to detect near duplicates in content. Given two files the similarity in their simhashes gives a mathematical way to compute the similarity of the documents. The algorithm works like this

  • split the content into a set of features
  • for each of the feature compute a hash of fixed width
  • for each bit in the hash let 1 be a positive increment and 0 a negative increment
  • sum up the bits in hashes using the above translation
  • for each bit position if the value is positive the sim hash has 1 in the corresponding bit position
  • for a negative value the resultant hash has 0 in that bit position

Working this out in a simplified form, if the text is “Twitter is littered with spam”, I break this into features which are the individual words “twitter” “is” “littered” “with” “spam”. Suppose the hashes for these are 10101, 11001, 11000, 01100, 01000. This translates then to 1 -1 1 -1 1, 1 1 -1 -1 1, 1 1 -1 -1 -1, -1 1 1 -1 -1, -1 1 -1 -1 -1. Adding these up 1 3 -3 -5 1. The simhash for this is 11001.

The hamming distance between the hashes of two documents can be used to figure out the similarity of the docs. This presentation has a detailed explanation of the simhashing technique.

Now tweets are not an ideal place to experiment with the simhashes given we are limited to 140 chars. But its a pretty effective technique for deciding similarity, and can be expressed very easily. The power of Erlang binary comprehensions and the bit syntaxes allows us to calculate simhashes in a very nifty way. Without further blabber here is an Erlang implementation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-module(simhash).
 
-define(HASH_RANGE, 1 bsl 128).
-define(HASH_ACCU, 12).
-define(HASH_WIDTH, 128).
-export([hash_file/1]).
 
hash_file(File) ->
  {ok, Binary} = file:read_file(File),
  Tokens = re:split(Binary, "\\W"),
  calculate_simhash(Tokens).
 
calculate_hash(A) ->
  %% Hash = erlang:phash2(A, ?HASH_RANGE),
  << Hash:(?HASH_WIDTH) >> = erlang:md5(A),
  Hash.
 
calculate_simhash(Tokens) ->
  FeatureHashes = [calculate_hash(A) || A <- Tokens, A =/= <<>>],
  {HashAcc, Len} = lists:foldl(fun accumulate_simhash/2, {0,0} , FeatureHashes),
  << <<(is_hash_valid(B, (Len / 2))):1>> || 
     <<B:(?HASH_ACCU)>> <= <<HashAcc:(?HASH_WIDTH * ?HASH_ACCU)>> >>.
 
accumulate_simhash(Hash, {Accum, L}) ->
  <<A:(?HASH_WIDTH * ?HASH_ACCU)>> = 
          << <<B:(?HASH_ACCU)>> || <<B:1>> <= << Hash:(?HASH_WIDTH) >> >>,
  {Accum + A, L + 1}.
 
is_hash_valid(E, Len) ->
  case (E > Len) of 
    true -> 1; 
    _ -> 0 
  end.

The beauty of this is in accumulate_simhash function. Here I expand each bit of the feature hash into an arbitrary width representation and just add it to the accumulator. Depending on the number of features change the HASH_ACCU which is the number of bits for each bit of the resultant accumulator. At line 21 the resultant hash is calculated by reducing HASH_ACCU bits back to a 1 or 0 based on the number of hashes found.

Is that not beautiful?

原文地址:http://blog.jebu.net/2009/08/simhashing-in-erlang-beauty-with-binary-comprehension/

posted on 2013-06-02 10:25  应无所住而生其心  阅读(218)  评论(0)    收藏  举报

导航