Simhashing in Erlang–beauty with binary comprehension
Simhashing is popular technique to detect near duplicates in content. Given two files the similarity in their simhashes gives a mathematical way to compute the similarity of the documents. The algorithm works like this
- split the content into a set of features
- for each of the feature compute a hash of fixed width
- for each bit in the hash let 1 be a positive increment and 0 a negative increment
- sum up the bits in hashes using the above translation
- for each bit position if the value is positive the sim hash has 1 in the corresponding bit position
- for a negative value the resultant hash has 0 in that bit position
Working this out in a simplified form, if the text is “Twitter is littered with spam”, I break this into features which are the individual words “twitter” “is” “littered” “with” “spam”. Suppose the hashes for these are 10101, 11001, 11000, 01100, 01000. This translates then to 1 -1 1 -1 1, 1 1 -1 -1 1, 1 1 -1 -1 -1, -1 1 1 -1 -1, -1 1 -1 -1 -1. Adding these up 1 3 -3 -5 1. The simhash for this is 11001.
The hamming distance between the hashes of two documents can be used to figure out the similarity of the docs. This presentation has a detailed explanation of the simhashing technique.
Now tweets are not an ideal place to experiment with the simhashes given we are limited to 140 chars. But its a pretty effective technique for deciding similarity, and can be expressed very easily. The power of Erlang binary comprehensions and the bit syntaxes allows us to calculate simhashes in a very nifty way. Without further blabber here is an Erlang implementation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
-module(simhash). -define(HASH_RANGE, 1 bsl 128). -define(HASH_ACCU, 12). -define(HASH_WIDTH, 128). -export([hash_file/1]). hash_file(File) -> {ok, Binary} = file:read_file(File), Tokens = re:split(Binary, "\\W"), calculate_simhash(Tokens). calculate_hash(A) -> %% Hash = erlang:phash2(A, ?HASH_RANGE), << Hash:(?HASH_WIDTH) >> = erlang:md5(A), Hash. calculate_simhash(Tokens) -> FeatureHashes = [calculate_hash(A) || A <- Tokens, A =/= <<>>], {HashAcc, Len} = lists:foldl(fun accumulate_simhash/2, {0,0} , FeatureHashes), << <<(is_hash_valid(B, (Len / 2))):1>> || <<B:(?HASH_ACCU)>> <= <<HashAcc:(?HASH_WIDTH * ?HASH_ACCU)>> >>. accumulate_simhash(Hash, {Accum, L}) -> <<A:(?HASH_WIDTH * ?HASH_ACCU)>> = << <<B:(?HASH_ACCU)>> || <<B:1>> <= << Hash:(?HASH_WIDTH) >> >>, {Accum + A, L + 1}. is_hash_valid(E, Len) -> case (E > Len) of true -> 1; _ -> 0 end. |
The beauty of this is in accumulate_simhash function. Here I expand each bit of the feature hash into an arbitrary width representation and just add it to the accumulator. Depending on the number of features change the HASH_ACCU which is the number of bits for each bit of the resultant accumulator. At line 21 the resultant hash is calculated by reducing HASH_ACCU bits back to a 1 or 0 based on the number of hashes found.
Is that not beautiful?
原文地址:http://blog.jebu.net/2009/08/simhashing-in-erlang-beauty-with-binary-comprehension/
浙公网安备 33010602011771号