Well, there goes the neighborhood…
Most clustering algorithms are frustratingly non-local, and what is frustrating at small scale becomes intractable at large scale. Limiting your scope to a neighborhood of items usually requires heuristics that are clustering algorithms in their own right (
Yo dawg, I put some clustering in your clustering.) Any algorithm that requires a notion of pairwise similarity at best requires fetching many items from your data store, and at worse requires

time and space.
Wouldn’t it be nice if you could look at an item once, and determine its cluster immediately without consulting any other data? Wouldn’t it be nice if the clusters were stable between runs, so that the existence of one item would never change the cluster of another? Simhashing does exactly this. There are many approaches tosimhashing, in this document I’m going to talk only about my favorite. It’s simple to implement, mathematically elegant, works on anything with many binary features, and produces high quality results. It’s also simple to analyze, so don’t let the notation scare you off.
Comparing Two Sets
Suppose you have two sets,

and

, and you would like to know how similar they are. First you might ask, how big is their intersection?
That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.
This is called the
Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.
Hashing and Sorting
Suppose you have a uniform pseudo-random hash function
from elements in your set to the range
. For simplicity, assume that the output of
is unique for each input. I’ll use
to denote the set of hashes produced by applying
to each element of
, i.e.
.
Consider
)%7D)
. When you insert and delete elements from

, how often does
)%7D)
change?
If you delete

from

then
)%7D)
will only change if
%3D%5Cmin(H(A))%7D)
. Since any element has an equal chance of having the minimum hash value, the probability of this is

.
If you insert

into

then
)%7D)
will only change if
%5C%20%3C%5C%20%5Cmin(H(A))%7D)
. Again, since any element has an equal chance of having the minimum hash value, the probability of this is

.
For our purposes, this means that
)%7D)
is useful as a stable description of

.
Probability of a Match
What is the probability that
?
If an element produces the minimum hash in both sets on their own, it also produces the minimum hash in their union.
if and only if
.
Let
be the member of
that produces the minimum hash value. The probability that
and
share the minimum hash is equivalent to the probability that
is in both
and
. Since any element of
has an equal chance of having the minimum hash value, this becomes
Look familiar? Presto, we now have a simhash.
Tuning for Precision
This may be too generous for your purposes, but it is easy to make it more restrictive. One approach is to repeat the whole process with

independent hash functions, and concatenate the results. This makes the probability of a match
I prefer an alternate approach. Use only one hash function, but instead of selecting only the minimum value as the simhash, select the least
values. The probability of a match then becomes
and if

,
The advantage of this over independent hash functions is that it sets a minimum on the number of members that the two sets must share in order to match. This mitigates the effect of extremely common set members on your clusters. With several independent hash functions, a very common set member that produces low values in a small number of hash functions can cause a huge blowup of the resulting clusters. Selecting

from a single hash function ensures that it can only effect one term. It is for this reason that many simhash implementations unrelated to this one take into account the global frequency of each feature, but this complicates their implementation.
Turning Anything Into a Set
This algorithm works on a set, but the things we’d like to cluster usually aren’t sets. Mapping from one to the other is straightforward if each item has many binary features, but can require some experimentation to get good results. If your items are text documents, you can produce a set using a sliding window of n-grams. I’ve found 3-grams to work well on lyrics, but YMMV. Since there’s no order to the members of the set, it’s important to make them long enough to preserve some of the local structure of the thing you’d like to cluster.
Pseudocode
int SimHash(Item item, int restrictiveness)
Set set = SplitItemToSet(item)
PriorityQueue queue
for x in set
queue.Insert(Hash(x))
simhash = 0
for x in [0 : restrictiveness]
simhash ^= queue.PopMin()
return simhash
Further Reading