python 搜索引擎Whoosh中文文档和代码以及jieba的使用

注意，数据库的表最好别有下划线

中文文档链接：

　　　　　　https://mr-zhao.gitbooks.io/whoosh/content/%E5%A6%82%E4%BD%95%E7%B4%A2%E5%BC%95%E6%96%87%E6%A1%A3.html

　　　　　　https://mr-zhao.gitbooks.io/whoosh/content/如何索引文档.html?q=

代码：

 https://github.com/renfanzi/myWhoosh

jieba的使用案例 =====》版本3.5

from jieba.analyse import ChineseAnalyzer
import jieba

analyzer = ChineseAnalyzer()
a = analyzer("我的好朋友是李明;我爱北京天安门;IBM和Microsoft; I have a dream. this is intetesting and interested me a lot") # 这样的是只过滤词，更加简洁，更方便
print([i.text for i in a])

seg_list = jieba.cut("我的好朋友是李明;我爱北京天安门;IBM和Microsoft; I have a dream. this is intetesting and interested me a lot",cut_all=False) # cut_all = False 就是连标点等都不过滤，完全切割cut
seg_list1 = jieba.cut("IBM和Microsoft; I have a dream. this is intetesting and interested me a lot",cut_all=True) # cut_all = False 就是过滤标点等
print([i for i in seg_list if i])
print([i for i in seg_list1 if i])
# 注意，其实也可以用join来拼接

结果：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.865 seconds.
Prefix dict has been built succesfully.
['我', '好', '朋友', '是', '李明', '我', '爱', '北京', '天安', '天安门', 'ibm', 'microsoft', 'dream', 'intetest', 'interest', 'me', 'lot']
['我', '的', '好', '朋友', '是', '李明', ';', '我', '爱', '北京', '天安门', ';', 'IBM', '和', 'Microsoft', ';', ' ', 'I', ' ', 'have', ' ', 'a', ' ', 'dream', '.', ' ', 'this', ' ', 'is', ' ', 'intetesting', ' ', 'and', ' ', 'interested', ' ', 'me', ' ', 'a', ' ', 'lot']
['IBM', '和', 'Microsoft', 'I', 'have', 'a', 'dream', 'this', 'is', 'intetesting', 'and', 'interested', 'me', 'a', 'lot']

posted @ 2017-07-17 16:15 我当道士那儿些年阅读(2239) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

python 搜索引擎Whoosh中文文档和代码 以及jieba的使用

注意， 数据库的表最好别有下划线

公告

python 搜索引擎Whoosh中文文档和代码以及jieba的使用

注意，数据库的表最好别有下划线