修改whoosh支持hanlp中文分词
先修改两个Python代码文件
在 anaconda3/lib/python3.7/site-packages/whoosh/analysis 下
analyzers.py加了一句话
from whoosh.analysis.tokenizers import ChineseTokenizer
然后tokenizers.py底部加了一个类
ChineseTokenizer
class ChineseTokenizer(Tokenizer):
def __call__(self, value,positions=False,chars=False,
keeporiginal=False,removestops=True,
start_pos=0,start_char=0,mode="",**kwargs):
t = Token(positions,chars,removestops=removestops,mode=mode,**kwargs)
seglist = HanLP.segment(value)
for wf in seglist:
w=str(wf).strip().split('/')[0]
f=str(wf).strip().split('/')[1]
#判断名词实体 rr是代词 t是时间,仅索引名词
if (re.search('n', f) and f is not 'begin' and f is not 'end') or f == 'rr' or f == 't':
t.original = t.text = w
t.boost = 1.0
if positions:
t.pos = start_pos + value.find(w)
if chars:
t.startchar = start_char + value.find(w)
t.endchar = start_char + value.find(w) + len(w)
#print(t.text+' ',end='')
yield t
修改好后怎么用
修改whoosh支持hanlp中文分词:
step1 >>> 将tokenizers.py和analyzers.py放入python环境下site-packages/whoosh/analysis中(
例如:我的环境为anaconda3/lib/python3.7/site-packages/whoosh/analysis)
step2 >>> 以如下方式导入包即可:
from whoosh.analysis import ChineseAnalyzer