对unordered_map的一些研究

感谢 neal 的文章！

原理

有些必须要写 \(O(1)\) 哈希映射的题，用 map 会被卡，直接用 unordered_map 还会被卡 😃

unordered_map 在均匀分布中是 \(O(1)\) 的复杂度的，但在特殊的构造下会被卡掉。

实时上，unordered_map采用的是拉链哈希方法，其定义若干个桶，将哈希值相同的键放到同一个桶里，查找时在桶中逐个查找。

如果使用默认构造，在具体的流程中，传入的键会调用 std::hash<Key> 函数计算一个哈希值，对当前的桶直接取模获得桶编号插入。

struct _Mod_range_hashing
  {
    typedef std::size_t first_argument_type;
    typedef std::size_t second_argument_type;
    typedef std::size_t result_type;

    result_type
    operator()(first_argument_type __num,
	       second_argument_type __den) const noexcept
    { return __num % __den; }
  };

这是调用取模找桶的过程，__num 的数值通过 __detail::_hash_code_base 传递，有一句：

const _Hash&
      _M_hash() const { return __ebo_hash::_M_cget(); }

这里实际是使用了 std::hash<key> 作为默认哈希函数传递。

接着我们能找到 _Prime_rehash_policy ，作用是当元素负载超过负载因子时，返回一个更大的桶个数用于 rehash 操作，接着往下深入，我们找到了 _M_next_bkt，位于 hashtable_policy.h：

  // Return a prime no smaller than n.
  inline std::size_t
  _Prime_rehash_policy::
  _M_next_bkt(std::size_t __n) const
  {
    // Don't include the last prime in the search, so that anything
    // higher than the second-to-last prime returns a past-the-end
    // iterator that can be dereferenced to get the last prime.
    const unsigned long* __p
      = std::lower_bound(__prime_list, __prime_list + _S_n_primes - 1, __n);
    _M_next_resize = 
      static_cast<std::size_t>(__builtin_ceil(*__p * _M_max_load_factor));
    return *__p;
  }

找到大于当前桶值得下一个桶值，在质数表中查找。质数表在 hashtable_aux 中，特别的，有一个 _S_growth_factor 常量，它要求下一个膨胀的桶大小至少是上一个的指定倍数，该值通常为 \(2\) 。

值得注意的是，std::hash 对于整数直接返回其数值，不进行额外映射操作。

rehash 时，所有存储的键值都会重新计算桶编号并插入，综上，锁定一个质数 \(P\) ，我们通过不断选取类似 \(A+Px\) 的键，在 rehash 到这个质数时，所有的键都会存在一个桶中，单次查询复杂度直接退化成 \(O(n)\) ，当然这个质数要尽可能大一点，不然会被立刻 rehash 优化。

unordered_map有一个 bucket_count 函数，可以返回当前的桶大小，当数值为 \(6 \times 10^5\) 时，会到 \(712697\) ，那下面的程序就能卡掉（在插入操作时就被卡了）：

#include<bits/stdc++.h>
using namespace std;
typedef long long ll;
const int N=6e5;

void insert(ll x) {
    unordered_map<ll,int>mp;
    for(int i=1;i<=N;i++) mp[1ll*i*x]=i;
    cout<<"bucket_size:"<<mp.bucket_count()<<endl;
    ll sum=0;
    for(int i=1;i<=N;i++) {
        ll value=mp[1ll*i*x];
        sum+=1ll*value*i;
    }
    cout<<sum<<endl;
}

int main() {
    insert(712697);
}

防御

以上问题的关键在于 std::hash 没有做额外的映射，如果我们添加一个随机映射，问题就能被改善，一种方法是记录当前系统时间戳 \(t\) ，做映射 \(x \gets x+t\) ，但这么做没有改变同余性，我们再对这个值做一个 xorshift ，这个映射就变得比较随机了。

为什么要加时间戳？不添加时间戳，所有的映射都是确定的，那就可以根据值反推原象，因此需要一个随机化的操作。

注意不要用 time(NULL) ，这个值 \(1s\) 更新一次。

写法如下：

struct custom_hash{
    static uint64_t gen(uint64_t x) {
        x^=(x<<17);
        x^=(x>>7);
        x^=(x<<11);
        return x;
    }
    size_t operator()(uint64_t x) const {
        static const uint64_t FIXED_NUMBER = chrono::steady_clock().now().time_since_epoch().count();
        return gen(x+FIXED_NUMBER);
    }
};
unordered_map<ll,int,custom_hash> hash_table;

对上文代码添加 custom_hash 后的结果：

bucket_size:712697
72000180000100000

real    0m0.301s
user    0m0.261s
sys     0m0.040s

posted @ 2025-05-19 13:11 蒻蒻虫阅读(123) 评论(0) 收藏举报

刷新页面返回顶部

-cchen-

GG

对unordered_map的一些研究

原理

防御

公告