About the Implementation of .Net Framework's HashTable

From Rotor(Shared Source CLI) :
/*
77:         Implementation Notes:
78:         Dictionary was copied from Hashtable's source - any bug fixes here
79:         probably need to be made to Dictionary as well.
80:
81:         This Hashtable uses double hashing. There are hashsize buckets in the
82:         table, and each bucket can contain 0 or 1 element. We a bit to mark
83:         whether there's been a collision when we inserted multiple elements
84:         (ie, an inserted item was hashed at least a second time and we probed
85:         this bucket, but it was already in use). Using the collision bit, we
86:         can terminate lookups & removes for elements that aren't in the hash
87:         table more quickly. We steal the most significant bit from the hash code
88:         to store the collision bit.
89:
90:         Our hash function is of the following form:
91:
92:         h(key, n) = h1(key) + n*h2(key)
93:
94:         where n is the number of times we've hit a collided bucket and rehashed
95:         (on this particular lookup). Here are our hash functions:
96:
97:         h1(key) = GetHash(key); // default implementation calls key.GetHashCode();
98:         h2(key) = 1 + (((h1(key) >> 5) + 1) % (hashsize - 1));
99:
100:         The h1 can return any number. h2 must return a number between 1 and
101:         hashsize - 1 that is relatively prime to hashsize (not a problem if
102:         hashsize is prime). (Knuth's Art of Computer Programming, Vol. 3, p. 528-9)
103:         If this is true, then we are guaranteed to visit every bucket in exactly
104:         hashsize probes, since the least common multiple of hashsize and h2(key)
105:         will be hashsize * h2(key). (This is the first number where adding h2 to
106:         h1 mod hashsize will be 0 and we will search the same bucket twice).
107:
108:         We previously used a different h2(key, n) that was not constant. That is a
109:         horrifically bad idea, unless you can prove that series will never produce
110:         any identical numbers that overlap when you mod them by hashsize, for all
111:         subranges from i to i+hashsize, for all i. It's not worth investigating,
112:         since there was no clear benefit from using that hash function, and it was
113:         broken.
114:
115:         For efficiency reasons, we've implemented this by storing h1 and h2 in a
116:         temporary, and setting a variable called seed equal to h1. We do a probe,
117:         and if we collided, we simply add h2 to seed each time through the loop.
118:
119:         A good test for h2() is to subclass Hashtable, provide your own implementation
120:         of GetHash() that returns a constant, then add many items to the hash table.
121:         Make sure Count equals the number of items you inserted.
122:
123:         Note that when we remove an item from the hash table, we set the key
124:         equal to buckets, if there was a collision in this bucket. Otherwise
125:         we'd either wipe out the collision bit, or we'd still have an item in
126:         the hash table.
127:         */

The Insert Method of HashTable:
718:         // Inserts an entry into this hashtable. This method is called from the Set
719:         // and Add methods. If the add parameter is true and the given key already
720:         // exists in the hashtable, an exception is thrown.
721:         private void Insert (Object key, Object nvalue, bool add) {
722:             if (key == null) {
723:                 throw new ArgumentNullException("key", Environment.GetResourceString("ArgumentNull_Key"));
724:             }
725:             if (count >= loadsize)
726:                 expand();
727:             uint seed;
728:             uint incr;
729:             // Assume we only have one thread writing concurrently. Modify
730:             // buckets to contain new data, as long as we insert in the right order.
731:             uint hashcode = InitHash(key, buckets.Length, out seed, out incr);
732:             int ntry = 0;
733:             int emptySlotNumber = -1; // We use the empty slot number to cache the first empty slot. We chose to reuse slots
734:                                     // create by remove that have the collision bit set over using up new slots.
735:
736:             do {
737:                 int bucketNumber = (int) (seed % (uint)buckets.Length);
738:
739:                 if (emptySlotNumber == -1 && (buckets[bucketNumber].key == buckets) && (buckets[bucketNumber].hash_coll < 0))//(((buckets[bucketNumber].hash_coll & unchecked(0x80000000))!=0)))
740:                     emptySlotNumber = bucketNumber;
741:
742:                 //We need to check if the collision bit is set because we have the possibility where the first
743:                 //item in the hash-chain has been deleted.
744:                 if ((buckets[bucketNumber].key == null) ||
745:                     (buckets[bucketNumber].key == buckets && ((buckets[bucketNumber].hash_coll & unchecked(0x80000000))==0))) {
746:                     if (emptySlotNumber != -1) // Reuse slot
747:                         bucketNumber = emptySlotNumber;
748:
749:                     // We pretty much have to insert in this order. Don't set hash
750:                     // code until the value & key are set appropriately.
751:                     buckets[bucketNumber].val = nvalue;
752:                     buckets[bucketNumber].key = key;
753:                     buckets[bucketNumber].hash_coll |= (int) hashcode;
754:                     count++;
755:                     version++;
756:                     return;
757:                 }
758:                 if (((buckets[bucketNumber].hash_coll & 0x7FFFFFFF) == hashcode) &&
759:                     KeyEquals (key, buckets[bucketNumber].key)) {
760:                     if (add) {
761:                         throw new ArgumentException(Environment.GetResourceString("Argument_AddingDuplicate__", buckets[bucketNumber].key, key));
762:                     }
763:                     buckets[bucketNumber].val = nvalue;
764:                     version++;
765:                     return;
766:                 }
767:                 if (emptySlotNumber == -1) // We don't need to set the collision bit here since we already have an empty slot
768:                     buckets[bucketNumber].hash_coll |= unchecked((int)0x80000000);
769:                 seed += incr;
770:             } while (++ntry < buckets.Length);
771:
772:             if (emptySlotNumber != -1)
773:             {
774:                     // We pretty much have to insert in this order. Don't set hash
775:                     // code until the value & key are set appropriately.
776:                     buckets[emptySlotNumber].val = nvalue;
777:                     buckets[emptySlotNumber].key = key;
778:                     buckets[emptySlotNumber].hash_coll |= (int) hashcode;
779:                     count++;
780:                     version++;
781:                     return;
782:
783:             }
784:
785:             // If you see this assert, make sure load factor & count are reasonable.
786:             // Then verify that our double hash function (h2, described at top of file)
787:             // meets the requirements described above. You should never see this assert.
788:             BCLDebug.Assert(false, "hash table insert failed! Load factor too high, or our double hashing function is incorrect.");
789:             throw new InvalidOperationException(Environment.GetResourceString("InvalidOperation_HashInsertFailed"));
790:         }
791:

Double Hashing in <<Introduction to Algorithmics>>:

Double hashing is one of the best methods available for open addressing because the permutations produced have many of the characteristics of randomly chosen permutations. Double hashing uses a hash function of the form

h(k, i) = (h₁(k) + ih₂(k)) mod m,

where h₁ and h₂ are auxiliary hash functions. The initial probe is to position T[h₁(k)]; successive probe positions are offset from previous positions by the amount h₂(k), modulo m. Thus, unlike the case of linear or quadratic probing, the probe sequence here depends in two ways upon the key k, since the initial probe position, the offset, or both, may vary. Figure 11.5 gives an example of insertion by double hashing.

Figure 11.5: Insertion by double hashing. Here we have a hash table of size 13 with h₁(k) = k mod 13 and h₂(k) = 1 + (k mod 11). Since 14 ≡ 1 (mod 13) and 14 ≡ 3 (mod 11), the key 14 is inserted into empty slot 9, after slots 1 and 5 are examined and found to be occupied.

The value h₂(k) must be relatively prime to the hash-table size m for the entire hash table to be searched. (See Exercise 11.4-3.) A convenient way to ensure this condition is to let m be a power of 2 and to design h₂ so that it always produces an odd number. Another way is to let m be prime and to design h₂ so that it always returns a positive integer less than m. For example, we could choose m prime and let

h₁(k)	=	k mod m,
h₂(k)	=	1 + (k mod m'),

where m' is chosen to be slightly less than m (say, m - 1). For example, if k = 123456, m = 701, and m' = 700, we have h₁(k) = 80 and h₂(k) = 257, so the first probe is to position 80, and then every 257th slot (modulo m) is examined until the key is found or every slot is examined.

Double hashing improves over linear or quadratic probing in that Θ(m²) probe sequences are used, rather than Θ(m), since each possible (h₁(k), h₂(k)) pair yields a distinct probe sequence. As a result, the performance of double hashing appears to be very close to the performance of the "ideal" scheme of uniform hashing.

Posted on 2006-11-12 17:25 Adrian H. 阅读(904) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Adrian's Tech Blog

About the Implementation of .Net Framework's HashTable