python 搜索引擎Whoosh中文文档和代码 以及jieba的使用

注意, 数据库的表最好别有下划线

 

 

中文文档链接:

      https://mr-zhao.gitbooks.io/whoosh/content/%E5%A6%82%E4%BD%95%E7%B4%A2%E5%BC%95%E6%96%87%E6%A1%A3.html

      https://mr-zhao.gitbooks.io/whoosh/content/如何索引文档.html?q=

 

代码:

 https://github.com/renfanzi/myWhoosh

 

 

jieba的使用案例 =====》版本3.5

from jieba.analyse import ChineseAnalyzer
import jieba

analyzer = ChineseAnalyzer()
a = analyzer("我的好朋友是李明;我爱北京天安门;IBM和Microsoft; I have a dream. this is intetesting and interested me a lot") # 这样的是只过滤词,更加简洁,更方便
print([i.text for i in a])

seg_list = jieba.cut("我的好朋友是李明;我爱北京天安门;IBM和Microsoft; I have a dream. this is intetesting and interested me a lot",cut_all=False) # cut_all = False 就是连标点等都不过滤,完全切割cut
seg_list1 = jieba.cut("IBM和Microsoft; I have a dream. this is intetesting and interested me a lot",cut_all=True) # cut_all = False 就是过滤标点等
print([i for i in seg_list if i])
print([i for i in seg_list1 if i])
# 注意,其实也可以用join来拼接

 

 

结果:

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.865 seconds.
Prefix dict has been built succesfully.
['', '', '朋友', '', '李明', '', '', '北京', '天安', '天安门', 'ibm', 'microsoft', 'dream', 'intetest', 'interest', 'me', 'lot']
['', '', '', '朋友', '', '李明', ';', '', '', '北京', '天安门', ';', 'IBM', '', 'Microsoft', ';', ' ', 'I', ' ', 'have', ' ', 'a', ' ', 'dream', '.', ' ', 'this', ' ', 'is', ' ', 'intetesting', ' ', 'and', ' ', 'interested', ' ', 'me', ' ', 'a', ' ', 'lot']
['IBM', '', 'Microsoft', 'I', 'have', 'a', 'dream', 'this', 'is', 'intetesting', 'and', 'interested', 'me', 'a', 'lot']

 

posted @ 2017-07-17 16:15  我当道士那儿些年  阅读(2239)  评论(0编辑  收藏  举报