elastic+ik中文分词器
IK分词
IK分词是ES常用的中文分词器,支持自定义词库,词库热更新,不需要重启ES集群。
github地址。https://github.com/medcl/elasticsearch-analysis-ik
IK支持Analyzer: ik_smart , ik_max_word , Tokenizer: ik_smart , ik_max_word
ik_max_word: 会将文本做最细粒度的拆分,比如会将"中华人民共和国国歌"拆分为"中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌",会穷尽各种可能的组合;
ik_smart: 会做最粗粒度的拆分,比如会将"中华人民共和国国歌"拆分为"中华人民共和国,国歌"。
IK分词有自己的词库,包含关键词词库和停用词词库,同时也支持扩展自定义词库,其中关键词词库会把搜索语句按照关键词切割,停用词词库会直接去掉不参与分词。
分词示例:
ik_max_word:细粒度分词
GET _analyze
{
"analyzer": "ik_max_word",
"text": ["我们是一家人"]
}
Response:
{
"tokens" : [
{
"token" : "是一家",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "是一",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "一家人",
"start_offset" : 3,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "一家",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "家人",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 4
}
]
}
ik_smart:粗粒度分词
GET _analyze
{
"analyzer": "ik_smart",
"text": ["我们是一家人"]
}
Response:
{
"tokens" : [
{
"token" : "是一",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "家人",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 1
}
]
}
synonym 同义词
ES自带synonym token filter功能,参考官网地址:
词典格式
苹果,apple
苹果,手机
apple,手机 => 苹果
苹果, apple 这个格式如果搜索苹果,会转换问苹果,apple两个词同时搜索。
=> 格式会的搜索词会=>前的搜索词转换为=> 之后的词搜索,例如搜索手机,apple,最终只会搜索苹果。
IK分词和同义词可以同时使用,达到更好的分词效果。
使用示例:
PUT /test_synonyms
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms_path": "config/synonyms.txt"
}
},
"analyzer": {
"my_synonyms_analyzer": {
"tokenizer": "ik_max_word",
"filter": [
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "my_synonyms_analyzer"
}
}
}
}
}
同义词使用:
GET test_synonyms/_analyze
{
"analyzer": "my_synonyms_analyzer",
"text": "苹果"
}
Response:
{
"tokens" : [
{
"token" : "苹果",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "apple",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
}
]
}
type字段表示当前词是分词类型。
也有动态同义词分词插件,可以仿照基于自己的es适配一个版本。
bells/elasticsearch-analysis-dynamic-synonym
pingyin分词
pingyin分词可以把pingyin 转换为文字搜索。
github地址:medcl/elasticsearch-analysis-pinyin
分词效果:
GET _analyze
{
"analyzer": "pinyin",
"text": "刘德华"
}
Response:
{
"tokens" : [
{
"token" : "liu",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "de",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "hua",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
}
]
}
集成到索引中使用:
PUT /pingyin_test/
{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}
结合ik,同义词和pingyin搜索的例子。
{
"product_search_v1" : {
"mappings" : {
"product_search" : {
"properties" : {
"product_name" : {
"type" : "text",
"fields" : {
"first_py" : {
"type" : "text",
"analyzer" : "first_py_analyzer"
},
"full_pinyin" : {
"type" : "text",
"analyzer" : "full_pinyin_analyzer"
},
"ik_pinyin" : {
"type" : "text",
"analyzer" : "my_ik_pinyin"
},
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"analyzer" : "ik_syno_max",
"search_analyzer" : "ik_syno"
}
}
}
},
"settings" : {
"index" : {
"analysis" : {
"filter" : {
"my_synonym_filter" : {
"type" : "synonym",
"synonyms_path" : "config/synonyms.txt"
},
"my_pinyin_filter" : {
"lowercase" : "true",
"keep_original" : "true",
"remove_duplicated_term" : "true",
"keep_separate_first_letter" : "false",
"type" : "pinyin",
"limit_first_letter_length" : "16",
"keep_full_pinyin" : "true"
}
},
"analyzer" : {
"full_pinyin_analyzer" : {
"tokenizer" : "full_pinyin_tokenizer"
},
"first_py_analyzer" : {
"tokenizer" : "first_py_letter_tokenizer"
},
"ik_syno" : {
"filter" : [
"lowercase",
"my_synonym_filter"
],
"type" : "custom",
"tokenizer" : "ik_smart"
},
"ik_syno_max" : {
"filter" : [
"lowercase",
"my_synonym_filter"
],
"type" : "custom",
"tokenizer" : "ik_max_word"
},
"my_ik_pinyin" : {
"filter" : [
"lowercase",
"my_pinyin_filter"
],
"tokenizer" : "ik_max_word"
}
},
"tokenizer" : {
"first_py_letter_tokenizer" : {
"keep_none_chinese_in_first_letter" : "false",
"lowercase" : "true",
"none_chinese_pinyin_tokenize" : "false",
"keep_none_chinese_in_joined_full_pinyin" : "true",
"keep_original" : "false",
"keep_first_letter" : "true",
"trim_whitespace" : "true",
"type" : "pinyin",
"keep_none_chinese" : "true",
"limit_first_letter_length" : "16",
"keep_full_pinyin" : "false"
},
"full_pinyin_tokenizer" : {
"keep_joined_full_pinyin" : "true",
"lowercase" : "true",
"none_chinese_pinyin_tokenize" : "false",
"keep_none_chinese_in_joined_full_pinyin" : "true",
"keep_original" : "true",
"remove_duplicated_term" : "true",
"keep_separate_first_letter" : "false",
"type" : "pinyin",
"limit_first_letter_length" : "16",
"keep_full_pinyin" : "true"
}
}
}
}
}
}
}
//转载https://zhuanlan.zhihu.com/p/357625669

浙公网安备 33010602011771号