#1.字符串哈希

“十分简单易懂的基础字符串哈希教程”

字符串哈希

0x01. 什么是哈希

定义(摘自OI wiki)

我们定义一个把字符串映射到整数的函数 $f$，这个$f$称为是$Hash$ 函数。
我们希望这个函数$f$可以方便地帮我们判断两个字符串是否相等。

浅显地说：

把字符串以特定的方式表示为一串数，可以直接通过比较这一串数来判断字符串是否（相等/不同）。

0x02. 为什么要哈希

——复杂度优化

我们记字符串长度为$L$，显然，直接比较是$O(L)$的，但如果将字符串以特定的方式映射为整数，那么就可以$O(1)$比较，大大节约了时间（祖传空间换时间！）

0x03. “特定的方式”

我们不妨用$S[i]$表示字符串的各位字符，$A(x)$表示字符x对应的$ASCII$值，

一个十分粗糙与简陋的思路：

根据定义，想到可以将$A(S[i])$乘以一个正整数$(Base)$，再将各位加起来。

hash += s[i] * Base

太弱了

Hack:对于任意的$A(S[m])+A(S[n])=A(S[i])$，甚至只需要相同字符，排列方式不同的字符串都能使上述哈希失去正确性。

实际上，以上这种情况我们称其为哈希冲突。

0x04. Hash 冲突

定义(摘自OI wiki)

$Hash$ 冲突是指两个不同的字符串映射到相同的 $Hash$ 值。
我们设$Hash$ 的取值空间（所有可能出现的字符串的数量）为$d$，计算次数（要计算的字符串数量）为$n$。
则 $Hash$ 冲突的概率为：$p(n,d)=1-exp(-\frac{n(n-1)}{2d})$

初学者不必深究此公式 证明CLICK HERE

0x05. 一种常用的 Hash

为了尽量确保字符串相同字符在不同位置的 $Hash$ 值不同以及不同字符在相同位置的 $Hash$ 值不同,前人提出：

多项式 Hash：

即 $Hash(S)=A(S[1])*Base^1+A(S[2])*Base^2+...+A(S[L])*Base^L$

代码实现:

const int Base = 133 //*声明常量可以大大加快运算
int get_Hash(string S, int L) {
    int ans = 0;
    for (int i = 0; i < L; i++) {
       ans = ans * Base + S[i];
    }
    return ans;
}

*以上的Base尽量取接近字符串所含字符类型数的质数，这样可以尽量避免$Hash$冲突。
$eg：$ 若字符串包含大小写字母和数字，$cnt()=26+26+10=62$,则可以选择 $61 (or 67)$ 作为 $Base$ ; 若字符串包含一般字符，则可以选择 $133$ 作为 $Base$（接近 $ASCII$ 码上限 $127$ ）。

0x06. 数据溢出的问题

实际操作发现，多项式 $Hash$ 很容易就会超出int甚至long long上限（在足够长的字符串中）
可以考虑以下几种解决方案：

1.自然溢出 Hash

即将 $Hash$ 值定为unsighed long long,由于其存储特性,相当于每次将 $Hash$ 值自动 $\bmod 2^{64}$。

代码实现:

const int Base = 133;
unsigned long long get_Hash(string S, int L) {
    unsigned long long ans = 0;
    for (int i = 0; i < L; i++) {
       ans = ans * Base + S[i];
    }
    return ans;
}

2.大模数 Hash

即将每次得到的 $Hash$ 值 $\bmod $ 一个大质数（如 $998244353$ 或 $1e9+7$ ）

代码实现:

const int Base = 133;
const int Mod = 1e9 + 7;
int get_Hash(string S, int L) {
    int ans = 0;
    for (int i = 0; i < L; i++) {
       ans = ((long long)(ans * Base) + S[i]) % Mod;//*ans * Base可能超出int上限，当成long long类型处理
    }
    return ans;
}

0x07. Hash Killer

实际上，以上的 $Hash$ 并不全能，仍然可以被刻意制造的数据卡掉。
简而言之，就是根据0x04. $Hash$ 冲突的公式来人为卡Hash。
详见OI wiki卡大模数Hash和卡自然溢出Hash
所以，如何优化Hash使其更进一步降低冲突率？

0x08. 双值 Hash

考虑每次用不同的 $Base$ 与 $Mod$ 进行两次 $get\_hash$

代码实现:

const int Base1 = 133, Base2 = 137;
const int Mod1 = 1e9 + 7, Mod2 = 998244353;
int get_Hash_1(string S, int L) {
    int ans = 0;
    for (int i = 0; i < L; i++) {
       ans = ((long long)(ans * Base1) + S[i]) % Mod1;
    }
    return ans;
}
int get_Hash_2(string S, int L) {
    int ans = 0;
    for (int i = 0; i < L; i++) {
       ans = ((long long)(ans * Base2) + S[i]) % Mod2;
    }
    return ans;
}

0x09. 例题

好了，到这一步，字符串 $Hash$ 已经足够一般算法题的使用，下面我们来看几道例题。

P3370 【模板】字符串哈希

分析：哈希板子，题面没有明确说出是否卡 $Hash$，尝试单 $Hash$

被卡了。。
所以只能老老实实写双 $Hash$了。

代码实现(暴力匹配)：

#include<bits/stdc++.h>
using namespace std;
const int Mod1 = 1e9 + 7, mod2 = 998244353;
const int base1 = 137, base2 = 131;
int get_hash_1(string s, int L) {
	int ans = 0;
	for(int i = 0; i < L; i++) {
		ans = ((long long)ans * base1 + s[i]) % Mod1;
	}
	return ans;
}
int get_hash_2(string s, int L) {
	int ans=0;
	for(int i = 0; i < L; i++) {
		ans = ((long long)ans * base2 + s[i]) % mod2;
	}
	return ans;
}
string str;
int ans = 0, n;
int s_hash1[10005], s_hash2[10005];
int main(){
	scanf("%d", &n);
	for(int i = 1; i <= n; i++){
		cin >> str;
		s_hash1[i] = get_hash_1(str, str.length());
		s_hash2[i] = get_hash_2(str, str.length());
		bool fg = true;
		for(int j = i-1; j >= 1; j--){
			if(s_hash1[i] == s_hash1[j] && s_hash2[i] == s_hash2[j]){fg = false; break;}
		}
		if(fg) ++ans;
	}
	printf("%d", ans);
    return 0;
}

双哈希就可以AC了！。

posted @ 2024-08-18 22:06 Ydoc770 阅读(108) 评论(0) 收藏举报

刷新页面返回顶部

Ydoc770

#1.字符串哈希

字符串哈希

0x01. 什么是哈希

定义(摘自OI wiki)

浅显地说：

0x02. 为什么要哈希

——复杂度优化

0x03. “特定的方式”

一个十分粗糙与简陋的思路：

太弱了

0x04. Hash 冲突

定义(摘自OI wiki)

0x05. 一种常用的 Hash

多项式 Hash：

代码实现:

0x06. 数据溢出的问题

1.自然溢出 Hash

代码实现:

2.大模数 Hash

代码实现:

0x07. Hash Killer

0x08. 双值 Hash

代码实现:

0x09. 例题

P3370 【模板】字符串哈希

代码实现(暴力匹配)：

公告