Fork me on GitHub

elastic+ik中文分词器

IK分词

IK分词是ES常用的中文分词器,支持自定义词库,词库热更新,不需要重启ES集群。

github地址。

IK支持Analyzer: ik_smart , ik_max_word , Tokenizer: ik_smart , ik_max_word

ik_max_word: 会将文本做最细粒度的拆分,比如会将"中华人民共和国国歌"拆分为"中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌",会穷尽各种可能的组合;

ik_smart: 会做最粗粒度的拆分,比如会将"中华人民共和国国歌"拆分为"中华人民共和国,国歌"。

IK分词有自己的词库,包含关键词词库和停用词词库,同时也支持扩展自定义词库,其中关键词词库会把搜索语句按照关键词切割,停用词词库会直接去掉不参与分词。

分词示例:

ik_max_word:细粒度分词

GET _analyze
{
  "analyzer": "ik_max_word", 
  "text": ["我们是一家人"]
}
Response:
{
  "tokens" : [
    {
      "token" : "是一家",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是一",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "一家人",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "一家",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "家人",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

ik_smart:粗粒度分词

GET _analyze
{
  "analyzer": "ik_smart", 
  "text": ["我们是一家人"]
}

Response:
{
  "tokens" : [
    {
      "token" : "是一",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "家人",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

synonym 同义词

ES自带synonym token filter功能,参考官网地址:

词典格式

苹果,apple
苹果,手机
apple,手机 => 苹果

苹果, apple 这个格式如果搜索苹果,会转换问苹果,apple两个词同时搜索。

=> 格式会的搜索词会=>前的搜索词转换为=> 之后的词搜索,例如搜索手机,apple,最终只会搜索苹果。

IK分词和同义词可以同时使用,达到更好的分词效果。

使用示例:

PUT /test_synonyms
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms_path": "config/synonyms.txt"
        }
      },
      "analyzer": {
        "my_synonyms_analyzer": {
          "tokenizer": "ik_max_word",
          "filter": [
            "my_synonym_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "my_synonyms_analyzer"
        }
      }
    }
  }
}

同义词使用:

GET test_synonyms/_analyze
{
  "analyzer": "my_synonyms_analyzer",
  "text": "苹果"
}
Response:
{
  "tokens" : [
    {
      "token" : "苹果",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "apple",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}

type字段表示当前词是分词类型。

也有动态同义词分词插件,可以仿照基于自己的es适配一个版本。

bells/elasticsearch-analysis-dynamic-synonym

pingyin分词

pingyin分词可以把pingyin 转换为文字搜索。

github地址:medcl/elasticsearch-analysis-pinyin

分词效果:

GET _analyze
{
  "analyzer": "pinyin", 
  "text": "刘德华"
}

Response:
{
  "tokens" : [
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hua",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    }
  ]
}

集成到索引中使用:

PUT /pingyin_test/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}

结合ik,同义词和pingyin搜索的例子。

{
  "product_search_v1" : {
    "mappings" : {
      "product_search" : {
        "properties" : {
          "product_name" : {
            "type" : "text",
            "fields" : {
              "first_py" : {
                "type" : "text",
                "analyzer" : "first_py_analyzer"
              },
              "full_pinyin" : {
                "type" : "text",
                "analyzer" : "full_pinyin_analyzer"
              },
              "ik_pinyin" : {
                "type" : "text",
                "analyzer" : "my_ik_pinyin"
              },
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            },
            "analyzer" : "ik_syno_max",
            "search_analyzer" : "ik_syno"
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "analysis" : {
          "filter" : {
            "my_synonym_filter" : {
              "type" : "synonym",
              "synonyms_path" : "config/synonyms.txt"
            },
            "my_pinyin_filter" : {
              "lowercase" : "true",
              "keep_original" : "true",
              "remove_duplicated_term" : "true",
              "keep_separate_first_letter" : "false",
              "type" : "pinyin",
              "limit_first_letter_length" : "16",
              "keep_full_pinyin" : "true"
            }
          },
          "analyzer" : {
            "full_pinyin_analyzer" : {
              "tokenizer" : "full_pinyin_tokenizer"
            },
            "first_py_analyzer" : {
              "tokenizer" : "first_py_letter_tokenizer"
            },			
            "ik_syno" : {
              "filter" : [
                "lowercase",
                "my_synonym_filter"
              ],
              "type" : "custom",
              "tokenizer" : "ik_smart"
            },
            "ik_syno_max" : {
              "filter" : [
                "lowercase",
                "my_synonym_filter"
              ],
              "type" : "custom",
              "tokenizer" : "ik_max_word"
            },
            "my_ik_pinyin" : {
              "filter" : [
                "lowercase",
                "my_pinyin_filter"
              ],
              "tokenizer" : "ik_max_word"
            }
          },
          "tokenizer" : {
            "first_py_letter_tokenizer" : {
              "keep_none_chinese_in_first_letter" : "false",
              "lowercase" : "true",
              "none_chinese_pinyin_tokenize" : "false",
              "keep_none_chinese_in_joined_full_pinyin" : "true",
              "keep_original" : "false",
              "keep_first_letter" : "true",
              "trim_whitespace" : "true",
              "type" : "pinyin",
              "keep_none_chinese" : "true",
              "limit_first_letter_length" : "16",
              "keep_full_pinyin" : "false"
            },
            "full_pinyin_tokenizer" : {
              "keep_joined_full_pinyin" : "true",
              "lowercase" : "true",
              "none_chinese_pinyin_tokenize" : "false",
              "keep_none_chinese_in_joined_full_pinyin" : "true",
              "keep_original" : "true",
              "remove_duplicated_term" : "true",
              "keep_separate_first_letter" : "false",
              "type" : "pinyin",
              "limit_first_letter_length" : "16",
              "keep_full_pinyin" : "true"
            }
          }
        }
      }
    }
  }
}
//转载https://zhuanlan.zhihu.com/p/357625669

posted @ 2022-09-08 00:17  v_nice  阅读(206)  评论(0)    收藏  举报
1