4字节哈希因子

转:http://ddrv.cn/a/61882 

 

 

未全

 

4-byte Integer Hashing

The hashes on this page (with the possible exception of HashMap.java’s) are all public domain. So are the ones on Thomas Wang’s page. Thomas recommends citing the author and page when using them.

Thomas Wang has an integer hash using multiplication that’s faster than any of mine on my Core 2 duo using gcc -O3, and it passes my favorite sanity tests well. I’ve had reports it doesn’t do well with integer sequences with a multiple of 34.

 

uint32_t hash( uint32_t a)
    a = (a ^ 61) ^ (a >> 16);
    a = a + (a << 3);
    a = a ^ (a >> 4);
    a = a * 0x27d4eb2d;
    a = a ^ (a >> 15);
    return a;
}

Thomas has 64-bit integer hashes too. I don’t have any of those yet.

Here’s a way to do it in 6 shifts:

 

uint32_t hash( uint32_t a)
{
    a = (a+0x7ed55d16) + (a<<12);
    a = (a^0xc761c23c) ^ (a>>19);
    a = (a+0x165667b1) + (a<<5);
    a = (a+0xd3a2646c) ^ (a<<9);
    a = (a+0xfd7046c5) + (a<<3);
    a = (a^0xb55a4f09) ^ (a>>16);
    return a;
}

  

Or 7 shifts, if you don’t like adding those big magic constants:

uint32_t hash( uint32_t a)
{
    a -= (a<<6);
    a ^= (a>>17);
    a -= (a<<9);
    a ^= (a<<4);
    a -= (a<<3);
    a ^= (a<<10);
    a ^= (a>>15);
    return a;
}

Thomas Wang has a function that does it in 6 shifts (provided you use the low bits, hash & (SIZE-1), rather than the high bits if you can’t use the whole value):

uint32_t hashint( uint32_t a)
{
    a += ~(a<<15);
    a ^=  (a>>10);
    a +=  (a<<3);
    a ^=  (a>>6);
    a += ~(a<<11);
    a ^=  (a>>16);
}

Here’s a 5-shift one where you have to use the high bits, hash >> (32-logSize), because the low bits are hardly mixed at all:

uint32_t hashint( uint32_t a)
{
    a = (a+0x479ab41d) + (a<<8);
    a = (a^0xe4aa10ce) ^ (a>>5);
    a = (a+0x9942f0a6) - (a<<14);
    a = (a^0x5aedd67d) ^ (a>>3);
    a = (a+0x17bea992) + (a<<7);
    return a;
}

Here’s one that takes 4 shifts. You need to use the bottom bits, and you need to use at least the bottom 11 bits. It doesn’t achieve avalanche at the high or the low end. It does pass my integer sequences tests, and all settings of any set of 4 bits usually maps to 16 distinct values in bottom 11 bits.

uint32_t hashint( uint32_t a)
{
    a = (a^0xdeadbeef) + (a<<4);
    a = a ^ (a>>10);
    a = a + (a<<7);
    a = a ^ (a>>13);
    return a;
}

 

And this one isn’t too bad, provided you promise to use at least the 17 lowest bits. Passes the integer sequence and 4-bit tests.

uint32_t hashint( uint32_t a)
{
    a = a ^ (a>>4);
    a = (a^0xdeadbeef) + (a<<5);
    a = a ^ (a>>11);
    return a;
}

More Wordy Stuff

Adam Zell points out that this hash is used by the HashMap.java:

private static int newHash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

Although this hash leaves something to be desired (h & 2047 uses only 1/8 of the buckets for sequences incremented by 8), it does bring up an interesting point: full avalanche is stronger than what you really need for a hash function. All you really need is that the entropy in your keys be represented in the bits of the hash value that you use. Often you can show this by matching every differing input bit to a distinct bit that it changes in the portion of the hash value that you use.

One very non-avalanchy example of this is CRC hashing: every input bit affects only some output bits, the ones it affects it changes 100% of the time, and every input bit affects a different set of output bits. If the input bits that differ can be matched to distinct bits that you use in the hash value, you’re golden. Otherwise you’re not.

4-byte integer hash, half avalanche

Full avalanche says that differences in any input bit can cause differences in any output bit. A weaker property is also good enough for integer hashes if you always use the high bits of a hash value: every input bit affects its own position and every higher position. I’ll call this half avalanche. (Multiplication is like this, in that every bit affects only itself and higher bits. But multiplication can’t cause every bit to affect EVERY higher bit, especially if you measure “affect” by both – and ^.) Half-avalanche is sufficient: if you use the high n bits and hash 2n keys that cover all possible values of n input bits, all those bit positions will affect all n high bits, so you can reach up to 2n distinct hash values. It’s also sometimes necessary: if you use the high n+1 bits, and the high n input bits only affect their position and greater, and you take the 2n+1 keys differing in the high n bits plus one other bit, then the only way to get over 2n hash values is if that one other input bit affects position n+1 from the top. For all n less than itself. So it has to affect itself and all higher bits.

Actually, that wasn’t quite right. Half-avalanche says that an input bit will change its output bit (and all higher output bits) half the time. But if the later output bits are all dedicates to representing other input bits, you want this output bit to be affected 100% of the time by this input bit, not 50% of the time. This doesn’t entirely kill the idea though. If every bit affects itself and all higher bits, plus a couple lower bits, and you use just the high-order bits, then the lowest high-order bit you use still contains entropy from several differing input bits. So it might work. Hum. Better check how this does in practice!

Similarly for low-order bits, it would be enough for every input bit to affect only its own position and all lower bits in the output (plus the next few higher ones). Half-avalanche is easier to achieve for high-order bits than low-order bits because a*=k (for odd k), a+=(a<<k), a-=(a<<k), a^=(a<<k) are all permutations that affect higher bits, but only a^=(a>>k) is a permutation that affects lower bits. (There’s also table lookup, but unless you get a lot of parallelism that’s going to be slower than shifts.)

Here’s a 5-shift function that does half-avalanche in the high bits:

uint32_t half_avalanche( uint32_t a)
{
    a = (a+0x479ab41d) + (a<<8);
    a = (a^0xe4aa10ce) ^ (a>>5);
    a = (a+0x9942f0a6) - (a<<14);
    a = (a^0x5aedd67d) ^ (a>>3);
    a = (a+0x17bea992) + (a<<7);
    return a;
}

 

Every input bit affects itself and all higher output bits, plus a few lower output bits. I hashed sequences of n consecutive integers into an n-bucket hash table, for n being the powers of 2 21 .. 220, starting at 0, incremented by odd numbers 1..15, and it did OK for all of them. Also, for “differ” defined by +, -, ^, or ^~, for nearly-zero or random bases, inputs that differ in any bit or pair of input bits will change each equal or higher output bit position between 1/4 and 3/4 of the time. Here’s a table of how the ith input bit (rows) affects the jth output bit (columns) in that hash (single bit differences, differ defined as ^, with a random base):

 

51	46	48	51	55	52	45	51	53	50	50	50	50	50	49	50	50	51	50	50	50	49	50	50	51	50	56	50	44	65	50	44
51	32	46	52	54	55	55	51	45	51	50	53	51	50	50	46	50	50	53	51	50	50	48	50	50	52	50	56	50	44	65	50
52	47	38	50	50	55	48	49	51	47	50	50	51	51	50	50	47	50	50	53	50	50	50	48	50	50	52	50	56	50	44	65
85	60	43	33	51	48	57	51	51	50	45	51	53	53	50	50	50	45	50	51	55	50	50	50	47	50	50	53	50	56	50	44
15	93	58	45	32	50	48	58	50	51	50	50	49	50	50	50	50	50	50	50	51	50	50	50	50	47	50	50	53	50	56	50
54	61	62	53	51	39	49	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	49	50	50	51	50	56
51	68	40	52	54	40	38	50	51	51	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	48	50	50	52	50
51	32	81	53	55	42	48	50	50	53	50	50	49	50	50	49	50	50	51	50	50	50	50	50	50	50	50	50	47	50	50	50
100	50	44	64	55	54	45	54	50	50	53	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	46	50	50
0	100	50	47	60	55	51	49	63	51	49	50	50	50	50	50	50	49	50	50	50	50	50	50	50	50	50	50	50	50	42	50
0	0	100	50	48	64	50	49	51	61	50	50	52	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	47	42
0	0	0	100	51	43	63	49	58	50	60	50	50	53	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	48
0	0	0	0	100	50	48	60	50	54	49	49	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50
0	0	0	0	0	100	50	45	70	50	54	51	56	50	50	52	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50	50
0	0	0	0	0	0	100	50	37	74	50	59	50	56	50	50	52	50	50	50	50	50	50	50	50	50	50	50	50	51	50	50
0	0	0	0	0	0	0	100	50	38	72	50	60	50	56	50	50	52	50	50	50	50	50	50	50	50	50	50	50	50	50	50
0	0	0	0	0	0	0	0	100	50	40	67	50	61	50	56	50	50	52	50	50	50	50	50	50	50	50	50	50	50	50	50
0	0	0	0	0	0	0	0	0	100	50	50	62	50	51	50	55	50	50	51	50	50	50	50	50	50	50	50	50	50	50	51
0	0	0	0	0	0	0	0	0	0	100	50	43	67	50	57	50	56	50	50	52	50	50	50	50	50	50	50	51	50	53	50
0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	56	50	56	50	50	52	50	50	50	50	50	50	50	50	50	52
0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	56	50	50	52	50	50	50	50	50	50	50	50	50
0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	56	50	50	51	50	50	50	50	50	50	50	50
0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	56	50	50	52	50	50	50	50	45	50	50
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	56	50	50	53	50	50	50	50	40	50
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	56	50	50	54	50	55	50	44	66
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	56	50	50	54	50	56	50	44
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	56	50	50	54	50	55	50
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	60	50	50	55	50	55
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	52	50	51	53	50
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	66	50	57	50	61	50	50	51
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	44	70	50	57	50	62	50	50
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	100	50	38	75	50	62	50	73	50

If you use high-order bits for hash values, adding a bit to the hash value to double the size of the hash table will add a low-order bit, so old bucket 0 maps to the new 0,1, old bucket 1 maps to the new 2,3, and so forth. They overlap. It’s not as nice as the low-order bits, where the new buckets are all beyond the end of the old table. Also, using the n high-order bits is done by (a>>(32-n)), instead of (a&((1<<n)-1)), note that >> takes 2 cycles while & takes only 1. But, on the plus side, if you use high-order bits for buckets and order keys inside a bucket by the full hash value, and you split the bucket, all the keys in the low bucket precede all the keys in the high bucket (Shalev ’03, split-ordered lists). Incrementally splitting the table is still feasible if you split high buckets before low buckets; that way old buckets will be empty by the time new buckets take their place.

4-byte integer hash, full avalanche

I was able to do it in 6 shifts.

uint32_t hash( uint32_t a)
{
    a = (a+0x7ed55d16) + (a<<12);
    a = (a^0xc761c23c) ^ (a>>19);
    a = (a+0x165667b1) + (a<<5);
    a = (a+0xd3a2646c) ^ (a<<9);
    a = (a+0xfd7046c5) + (a<<3);
    a = (a^0xb55a4f09) ^ (a>>16);
    return a;
}

These magic constants also worked: 0x7fb9b1ee, 0xab35dd63, 0x41ed960d, 0xc7d0125e, 0x071f9f8f, 0x55ab55b9 .

For one or two bit diffs, for “diff” defined as subtraction or xor, for random or nearly-zero bases, every output bit changes with probability between 1/4 and 3/4. I also hashed integer sequences incremented by odd 1..31 times powers of two; low bits did marvelously, high bits did sorta OK. Here’s the table for one-bit diffs on random bases with “diff” defined as XOR:

 

posted @ 2020-08-11 11:53  XZHDJH  阅读(239)  评论(0)    收藏  举报