Elasticsearch-分词器

一、内置分词器

分词步骤
1). character filter:在一段文本进行分词之前，先进行预处理，eg：最常见的过滤html标签(hello -> hello）, & -> and ( I & you -> I and you)
2). tokenizer:分词， eg:hello you and me -> hello, you, and, me
3). token filter:一个个小单词标准化转换 lowercase(转小写) , stop word(停用词，了的呢), dogs -> dog(单复数转换), liked ->like(时态转换), Tom -> tom（大小写转换), a/the/an ->干掉， mother -> mom（简写）, small -> little（同义词）.
standard

分词三个组件，character filter(预处理),tokenizer(分词),token filter(标准化转换)

standard tokenizer:以单词边界进行切分
standard token filter：什么都不做
lowercase token filter:将所有字母转换为小写
stop token filter(默认被禁用)：移除停用词 eg: a in the is 等

修改分词器设置
启用english停用词

DELETE my_index

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": { # 自定义分词器名称
          "type": "standard",
          "stopwords": "_english_" # 启用english停用词
        }
      }
    }
  }
}

使用默认分词查询结果

GET /my_index/_analyze
{
  "analyzer": "standard",
  "text": "a dog in the house"
}

使用开启的停用词分词查询结果

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text": "a dog in the house"
}

二、自定义分词器

DELETE my_index


PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": { # 预处理时自定义
        "&_to_and": { # 名称
          "type": "mapping",
          "mappings": ["&=>and"] # 将&转换为and
        }
      },
      "filter": { # 标准化转换时自定义
        "my_stopwords": { # 名称
          "type": "stop",
          "stopwords": ["the", "a"] # 去掉的停用词
        }
      }, 
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip", "&_to_and"], 
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

验证

GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "tom&jerry are a friend in the house, <a> HAHA!!!"
}

三、中文分词器

引入中文分词器es-ik插件
官方网站下载：中文分词器IK插件
注意:es-ik分词插件版本一定要和es安装的版本对应

第一步:下载es的IK插件
第二步:上传到elasticsearch/plugins/ik/ 然后使用unzip命令解压
第三步:重启 elasticsearch即可

ik分词器基础

ik_max_word: 最细粒度拆分
ik_smart: 粗粒度拆分

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国人民大会堂"
}
# 会拆分成“中华人民共和国”，“中华人民”，“中华”，“华人”，“人民共和国”，“人民”，“共和国”，“共和”，“国人”，“人民大会堂”，“人民大会”，“人民”，“大会堂”，“大会”，“会堂”
GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国人民大会堂"
}
# 会拆分成“中华人民共和国”，“人民大会堂”

ik分词器使用

存储时使用ik_max_word,搜索时使用ik_smart

PUT /my_index
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

配置文件 plugins/ik/config/

IKAnalyzer.cfg.xml ：配置自定义词库
main.dic ：ik原生内置的中文词库，总共有27万多条，只要是这些单词，都会被分在一起
preposition.dic ：介词
quantifier.dic：放了一些单位相关的词，量词
suffix.dic：放了一些后缀
surname.dic：中国的姓氏
stopword.dic：英文停词

ik原生最重要的两个配置文件：

main.dic
stopword.dic

自定义扩展字典

在elasticsearch/plugins/ik/config/目录下，创建dic文件
vi new_word.dic

马云
王者荣耀
公式相声

修改vi IKAnalyzer.cfg.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict">new_word.dic</entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

修改后重启elasticsearch
查询结果：

热更新字典

官方给出的是通过http请求url实现热更新热更新词典
也可以基于Mysql热更新IK词典

posted @ 2024-04-07 16:30 py卡卡阅读(53) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

py卡卡

Elasticsearch-分词器

公告