Perfect Hash能用于分词吗?

CMPH〕- C Minimal Perfect Hashing Library

装起来很方便。apt list有:libcmph0 libcmph-dev libcmph-tools

测试数据是229837个词,从 意义 一一 ... 中航技进出口有限责任公司 2MB

在Intel N100上,cmph -g words 仅耗时0.142秒,让我大吃一惊。生成的words.mph 1.9M

然后用cmph -v -m words.mph query去查询。

请注意,不加-v选项不知道查询结果。query是个文件而不是要查的字符串。

首先cmph -v -m words.mph words查自己,每个词都都有个unique ID.

然后造了个query内容如下:

aa
ccc
dd
e
fgh
这也不是个词
这可不是个词

全都有id,都找到了。


然后用words生成w2,所有的词都是12个汉字,不够的用“无”填充,如:意义无无无无无无无无无无

这次-g跑得更快了:0.109s,w2.mph还是1.9M. 查询结果如下:

aa -> 0
ccc -> 127860
dd -> 217306
More than 1 keys were mapped to bin 0
Duplicated or unknown key e in the input
e -> 0
fgh -> 90963
More than 1 keys were mapped to bin 0
Duplicated or unknown key 这也不是个词 in the input
这也不是个词 -> 0
这可不是个词 -> 147619

查下hash再memcmp一次?

How do I define the ids of the keys? You don't. The ids will be assigned by the algorithm creating the minimal perfect hash function. If the algorithm creates an ordered minimal perfect hash function, the ids will be the indices of the keys in the input. Otherwise, you have no guarantee of the distribution of the ids -- CMPH FAQ, 


apt search gperf的结果

  • ace-gperf ACE perfect hash function generator
  • gperf Perfect hash function generator
  • triehash Generates perfect hash functions as native machine code

Dynamic Perfect Hash Function (dphf) generate a perfect hash function object according to an user provided array.

In order to use dphf:

  1. include "dphf.hpp"
  2. define a class derived from dphf_hook
  3. populate a vector of your class object (defined in step 2)
  4. construct a dphf object using the vector (created in step 3)
  5. using the object to find the desired item.

作者是Charles Zhang.


PTHash is a C++ library implementing fast and compact minimal perfect hash functions as described in the following research papers:

PHOBIC: Perfect Hashing with Optimized Bucket Sizes and Interleaved Coding (ESA 2024).

可能最新,但API看起来不简单,不试了。


AI说:

image

unordered_map并不总需要用户自己实现哈希函数。对于标准库已经支持的类型,它会使用内置的哈希函数;只有当你使用自定义类型作为键时,才需要提供自定义的哈希函数。

比如struct Person { string name; int age; } 可以 return hash<string>()(name)


(绝大多数/常用)汉字用UTF-16比UTF-8少一个字节。如果语料UTF-16,分词程序可以方便地mmap()

wiki/a/00.txt 1010K; iconv -f utf-8 -t utf-16 -c 0.017秒; 结果722K

image

 

posted @ 2025-11-08 14:04  华容道专家  阅读(2)  评论(0)    收藏  举报