Elasticsearch Text analysis

本文档主要介绍分词器，如何使用内置分词器、自定义分词器。

Concepts

分词器通常有几部分组成：

字符过滤器：可以有0-N个。例如字符转换等操作。

分词执行器：只有有1个。例如把 "Quick brown fox!" 分词为[Quick, brown, fox!] 。

分词过滤器：可以有0-N个。例如把tokens转换为小写。

分词器的使用时机是在 Index 和 Search 阶段，通常2者使用相同的分词器，也可以使用不同的。

Configure text analysis

在使用分词器之前，可以测试分词的结果是否符合预期：

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}

使用内置分词器：

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { 
          "type":      "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type":     "text",
        "analyzer": "standard", 
        "fields": {
          "english": {
            "type":     "text",
            "analyzer": "std_english" 
          }
        }
      }
    }
  }
}

创建自定义的分词器 Create a custom analyzer

指定使用分词器 Specify an analyzer

以下是一个自定义分词器并把他作为默认分词器的index

{

"settings": {

"index": {

"refresh_interval": "1s",

"number_of_shards": 5,

"number_of_replicas": 1,

"mapping.total_fields.limit": 5000

"analysis": {

"analyzer": {

"default": {

"type": "custom",

"tokenizer": "standard",

"char_filter": [

"my_mappings_char_filter"

"filter": [

"lowercase",

"asciifolding"

]

}

"char_filter": {

"my_mappings_char_filter": {

"type": "mapping",

"mappings": [

"_ => -"

]

}

"mappings": {

"dynamic": "true",

"dynamic_date_formats": [

"yyyy-MM-dd HH:mm:ss"

"properties": {

"name": {

"type": "text"

"createdAt": {

"type": "date",

"format": "yyyy-MM-dd HH:mm:ss"

"createdBy": {

"type": "keyword"

"status": {

"type": "integer"

"updatedAt": {

"type": "date",

"format": "yyyy-MM-dd HH:mm:ss"

"updatedBy": {

"type": "keyword"

}

Anatomy of an analyzer

posted on 2021-11-11 10:28 icodegarden 阅读(55) 评论(0) 收藏举报