7.ElasticSearch系列之分词

1. Analysis与Analyzer

Analysis: 文本分析是把全文本转换一系列单词（term/token）的过程，也要分词
Analysis是通过Analyzer来实现的。可使用ElasticSearch内置分析器或按需定制化分析器
除了在数据写入时转换词条，匹配Query语句时候也需要相同的分析器对查询语句进行分析

2. Analyzer组成

分词器Analyzer由三部分组成

Character Filters(针对原始文本处理，例如去除html)
Tokenizer(按照规则切分单词)
Token Filter(将切分的单词进行加工，小写，删除stopwords,增加同义词)

3. ElasticSearch内置分词器

Stanadard Analyzer - 默认分词器，按词切分，小写处理
Keyword Analyzer - 不分词，直接将输入当作输出
Customer Analyzer - 自定义分词器
Simple Analyzer/Stop Analyzer/Whitespace Analyzer/Pattern Analyzer/Language

4. 使用_analyzer API

# 直接使用analyzer进行测试
GET _analyze
{
  "analyzer": "standard",
  "text": "I Love China"
}
# 指定索引的字段进行测试
POST books/_analyze
{
  "field": "title",
  "text": "I Love China"
}
# 自定义分词进行测试
POST _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "I Love China"
}

# 结果
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "love",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "china",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

5. 中文分词IK安装与使用

下载对应版本分词器https://github.com/medcl/elasticsearch-analysis-ik/releases

于ES安装目录plugins下并重命名为ik重启ES即可

对于docker-compose方式部署，可参考https://gitee.com/SJshenjian/blog-code/tree/master/src/main/java/online/shenjian/es并按照docker-compose.yml中挂载plugins目录重启即可

验证中文分词ik

# 最粗粒度分词
GET _analyze
{
  "analyzer": "ik_smart",
  "text": "我爱中华人民共和国"
}
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "爱",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中华人民共和国",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}
# 最细粒度分词
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "我爱中华人民共和国"
}
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "爱",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中华人民共和国",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中华人民",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "中华",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "华人",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "人民共和国",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "人民",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "共和国",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "共和",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "国",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 10
    }
  ]
}

欢迎关注公众号算法小生或沈健的技术博客

posted @ 2022-10-18 21:05 算法小生阅读(100) 评论(0) 收藏举报

刷新页面返回顶部