Elasticsearch Mapping Field data types字段类型

Field data types 字段类型

Field data types

Binary 接受一个Base64后的字符串，且不能包含\n符，该字段默认不会存储，也不能用于搜索。

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "blob": {
        "type": "binary"
      }
    }
  }
}

PUT my-index-000001/_doc/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}

支持以下参数：

`doc_values`	是否需要存储，以便用于sort、agg、script，true或false（默认false）
`store`	是否需要跟_source分开存储和检索，true或false（默认false）

Boolean 接受true false或其字符串形式。false情形：false, "false", "" (empty string)，true情形：true, "true"

支持以下参数：

`boost`	查询评分权重，float类型，默认1.0
`doc_values`	是否需要存储，以便用于sort、agg、script，true或false（默认true）
`index`	是否可搜索，默认true
`null_value`	当入参是null时存什么？默认存null。如果使用script参数，则无法设置此参数
`on_script_error`	当由于script引起error时怎么处理？默认`fail` 即驳回；如果是continue，则会忽略该字段并继续索引。该字段只能在script字段设置时设置
`script`	以script的规则产生该字段的值，而不是直接使用入参值。如果入参时该字段有值，则会以error驳回文档
`store`	是否需要跟_source分开存储和检索，true或false（默认false）
`meta`	字段的Metadata

以下是script的使用例子

PUT my-index-000001/
{
  "mappings": {
    "runtime": {
      "day_of_week": {
        "type": "keyword",
        "script": {
          "source": "emit(doc['@timestamp'].value.dayOfWeekEnum.getDisplayName(TextStyle.FULL, Locale.ROOT))"
        }
      }
    },
    "properties": {
      "@timestamp": {"type": "date"}
    }
  }
}

Keyword 是一个家族，分为

keyword 用于结构化内容，如ID、email、状态码、标签等
constant_keyword 用于固定值
wildcard 通配符类型针对具有大值或高基数的字段进行了优化，高基数含义：一个字段的内容很大且有很大比例是唯一值

keyword 如果integer或long类型的字段不需要使用range，那么换成keyword在search时将更快；如果需要range，还可以使用multi-field将字段同时映射keyword 和 numeric

支持以下参数：

boost 查询评分权重，float类型，默认1.0

dimension 该字段是否标记为时间序列维度，默认false，如果true则有以下约束：

doc_values 和index 参数必须为true.
字段值不能是 array 或 multi-value.
字段值不能大于1024 bytes.
该字段不能使用 normalizer 参数

doc_values

是否需要存储，以便用于sort、agg、script，true或false（默认true）

eager_global_ordinals

默认false，是否在索引refresh时构建全局序号，把Term加载到Cache中，再后续的terms agg时，性能会得到提升；

开启 eager_global_ordinals 会影响写入性能，因为每次刷新时都会创建新的全局序号。为了最大程度地减少由于频繁刷新建立全局序号而导致的额外开销，建议调大刷新间隔 refresh_intervalare frequently used for terms aggregations.

fields

Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations.

允许映射为多个不同类型的字段，例如一个用于search，multi-field 用于排序、agg

ignore_above

当值超过该大小时不索引，默认2147483647接受所有大小，而使用dynamic产生的映射字段该值是256

index

是否可搜索，默认true

index_options

出于评分的目的应该存储哪些信息，默认docs，其他取值见 https://www.elastic.co/guide/en/elasticsearch/reference/7.15/index-options.html

norms

默认false

null_value

当入参是null时存什么？默认存null。如果使用script参数，则无法设置此参数

on_script_error

当由于script引起error时怎么处理？默认fail 即驳回；如果是continue，则会忽略该字段并继续索引。该字段只能在script字段设置时设置.

script

以script的规则产生该字段的值，而不是直接使用入参值。如果入参时该字段有值，则会以error驳回文档

store

是否需要跟_source分开存储和检索，true或false（默认false）

similarity

使用哪种评分算法或相似性。默认为BM25 https://www.elastic.co/guide/en/elasticsearch/reference/7.15/similarity.html

normalizer

如何在索引之前预处理关键字。默认为null，表示关键字保持原样。可以处理大小写等 https://www.elastic.co/guide/en/elasticsearch/reference/7.15/normalizer.html

split_queries_on_whitespace

当使用全文搜索时是否应该把该字段的条件输入用空格符拆分成查询条件，默认false，因为它是keyword

meta

字段的Metadata .

Constant keyword 适用字段值是固定值

PUT logs-debug
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "message": {
        "type": "text"
      },
      "level": {
        "type": "constant_keyword",
        "value": "debug"
      }
    }
  }
}

此时level字段的值只能是debug，不允许使用debug之外的值，下面2种方式等价

POST logs-debug/_doc
{
  "date": "2019-12-12",
  "message": "Starting up Elasticsearch",
  "level": "debug"
}

POST logs-debug/_doc
{
  "date": "2019-12-12",
  "message": "Starting up Elasticsearch"
}

支持以下参数：

`meta`	字段的Metadata .
`value`	字段的常量值，如果不设置，则常量值将使用第一个文档提供的值；推荐设置

Wildcard wildcard 对大字段和高基数字段进行了优化。

如果字段内容是人类可读的，例如邮件内容、产品描述，则search时可以使用 full text queries.

wildcard的通常使用场景是机器生成的非结构化内容：

该字段包含超过一百万个唯一值，计划使用带有前导通配符搜索字段，例如*foo或*baz。

该字段包含大于32KB的值，计划使用任何通配符搜索字段。

如下例：

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_wildcard": {
        "type": "wildcard"
      }
    }
  }
}

PUT my-index-000001/_doc/1
{
  "my_wildcard" : "This string can be quite lengthy"
}

GET my-index-000001/_search
{
  "query": {
    "wildcard": {
      "my_wildcard": {
        "value": "*quite*lengthy"
      }
    }
  }
}

支持以下参数：

`null_value`	当入参是null时存什么？默认存null
`ignore_above`	当值超过该大小时不索引，默认2147483647接受所有大小

运行通配符查询时，将忽略任何重写参数。得分总是一个恒定的分数。

Numeric 数字家族，下设以下类型

`long`	A signed 64-bit integer with a minimum value of `-263` and a maximum value of `263-1`.
`integer`	A signed 32-bit integer with a minimum value of `-231` and a maximum value of `231-1`.
`short`	A signed 16-bit integer with a minimum value of `-32,768` and a maximum value of `32,767`.
`byte`	A signed 8-bit integer with a minimum value of `-128` and a maximum value of `127`.
`double`	A double-precision 64-bit IEEE 754 floating point number, restricted to finite values.
`float`	A single-precision 32-bit IEEE 754 floating point number, restricted to finite values.
`half_float`	(半精度)A half-precision 16-bit IEEE 754 floating point number, restricted to finite values.
`scaled_float`	(背后将乘以scaling_factor以long类型存储，取出时除以scaling_factor) A floating point number that is backed by a `long`, scaled by a fixed `double` scaling factor.
`unsigned_long`	An unsigned 64-bit integer with a minimum value of 0 and a maximum value of `264-1`.

double, float and half_float 认为-0和+0是不同的值。因此，在-0.0上执行术语查询将与+0.0不匹配，反之亦然。范围查询也是如此：如果上限为-0.0，则+0.0将不匹配；如果下限为+0.0，则-0.0将不匹配。

支持以下参数：

`coerce`	将字符串转换为数字，并截断整数的分数。接受true（默认值）和false。不适用于unsigned_long。如果使用script参数，则无法设置此参数
`boost`	查询评分权重，float类型，默认1.0
`dimension`	`该字段是否标记为时间序列维度，默认false，如果true则有以下约束：` `doc_values` 和`index 参数必须为true`. 字段值不能是 array 或 multi-value.
`doc_values`	是否需要存储，以便用于sort、agg、script，true或false（默认true）
`ignore_malformed`	忽略格式不正确的。如果为true，则忽略格式错误的数字。如果为false（默认值），格式错误的数字将引发异常并拒绝整个文档。如果使用script参数，则无法设置此参数
`index`	是否可以搜索，默认true
`null_value`	当入参是null时存什么？默认存null，如果设置只能设置与该字段相同类型的值。如果使用script参数，则无法设置此参数
`on_script_error`	当由于script引起error时怎么处理？默认`fail` 即驳回；如果是continue，则会忽略该字段并继续索引。该字段只能在script字段设置时设置.
`script`	以script的规则产生该字段的值，而不是直接使用入参值。如果入参时该字段有值，则会以error驳回文档。 script只能在`long和double时配置`
`store`	是否需要跟_source分开存储和检索，true或false（默认false）
`meta`	Metadata about the field.

scaled_float 额外支持的参数:

scaling_factor

该参数在scaled_float时是必填的。

数值将在索引时乘以该系数，并四舍五入到最接近的long值。例如，scaled_float的scaling_factor为10时，2.34存储为23，所有搜索时间操作（查询、聚合、排序）的行为将与文档的值为2.3的行为相同。

Dates 日期家族，包含date and date_nanos

Date

由于JSON中没有日期类型，因此ES的日期支持字符串格式如"2015-01-01" 或 "2015/01/01 12:10:30"，也支持从epoch(1970)开始的long类型的毫秒数或秒数。

日期将始终呈现为字符串，即使它们最初在JSON文档中作为long提供。

在ES内部，日期将按UTC（如果有指定）转换成long进行存储，在查询时也转换为long进行比较，查询结果再按指定要求格式化为string。

如果日期的格式化没有指定，则使用默认的 "strict_date_optional_time||epoch_millis" ，strict_date_optional_time的样例：yyyy-MM-dd'T'HH:mm:ss.SSSZ or yyyy-MM-dd

以下方式都是可以的

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "date": {
        "type": "date" 
      }
    }
  }
}

PUT my-index-000001/_doc/1
{ "date": "2015-01-01" } 

PUT my-index-000001/_doc/2
{ "date": "2015-01-01T12:10:30Z" } 

PUT my-index-000001/_doc/3
{ "date": 1420070400001 }

以下是指定format的规则：

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "date": {
        "type":   "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}

支持的参数如下：

`boost`	查询评分权重，float类型，默认1.0
`doc_values`	是否需要存储，以便用于sort、agg、script，true或false（默认true）
`format`	格式化的规则，默认是 `strict_date_optional_time\|\|epoch_millis`
`locale`	格式化解析的locale，默认 `ROOT` locale
`ignore_malformed`	忽略格式不正确的。如果为true，则忽略格式错误的数字。如果为false（默认值），格式错误的数字将引发异常并拒绝整个文档。如果使用script参数，则无法设置此参数
`index`	是否可以搜索，默认true
`null_value`	当入参是null时存什么？默认存null，如果设置只能设置与该字段相同类型的值。如果使用script参数，则无法设置此参数
`on_script_error`	当由于script引起error时怎么处理？默认`fail` 即驳回；如果是continue，则会忽略该字段并继续索引。该字段只能在script字段设置时设置.
`script`	以script的规则产生该字段的值，而不是直接使用入参值。如果入参时该字段有值，则会以error驳回文档。 script的格式化结果要跟format格式一样
`store`	是否需要跟_source分开存储和检索，true或false（默认false）
`meta`	Metadata about the field.

Date nanos 是Date的补充，相比Date它以nano精度进行存储，因此它的时间区间是1970 to 2262

默认的format格式是 "strict_date_optional_time_nanos||epoch_millis"

Alias 对已存在的字段起别名

PUT trips
{
  "mappings": {
    "properties": {
      "distance": {
        "type": "long"
      },
      "route_length_miles": {
        "type": "alias",
        "path": "distance" 
      },
      "transit_mode": {
        "type": "keyword"
      }
    }
  }
}

GET _search
{
  "query": {
    "range" : {
      "route_length_miles" : {
        "gte" : 39
      }
    }
  }
}

定义别名有一些限制：

目标必须是具体字段，不能是object或其他字段的别名。

创建别名时，目标字段必须存在。

如果定义了嵌套对象，则字段别名的嵌套范围必须与其目标相同。

别名只能有一个目标。

不支持写入字段别名：在索引或更新请求中使用别名将失败。

由于文档source中不存在别名，因此在执行source筛选时不能使用别名。例如，以下请求将为_source返回一个空结果：

GET /_search
{
  "query" : {
    "match_all": {}
  },
  "_source": "route_length_miles"
}

Object 内部对象

PUT my-index-000001
{
  "mappings": {
    "properties": { 
      "region": {
        "type": "keyword"
      },
      "manager": { 
        "properties": {
          "age":  { "type": "integer" },
          "name": { 
            "properties": {
              "first": { "type": "text" },
              "last":  { "type": "text" }
            }
          }
        }
      }
    }
  }
}

支持以下参数：

`dynamic`	与mapping级别的dynamic配置一样，默认true，可选 `true` , `runtime`, `false` and `strict`.
`enabled`	是否JSON中的该对象被索引，默认true，false时将忽略.
`properties`	与mapping级别的properties一样，定义object的各种字段.

如果需要定义object的数组，参考Nested

Flattened

展平类型相比object提供了另一种方法，即将整个对象映射为单个字段。给定一个对象，展平映射将解析出它的叶值，并将它们作为关键字索引到一个字段中。然后可以通过简单的查询和聚合来搜索对象的内容。

此数据类型对于索引具有大量或未知数量唯一键的对象非常有用。只为整个JSON对象创建一个字段映射，防止映射爆炸产生过多不同的字段映射。

另一方面，扁平化对象字段在搜索功能方面存在折衷。只允许基本查询，不支持数值范围查询或高亮显示。

扁平映射类型不应用于索引所有文档内容，因为它将所有值视为关键字，并且不提供完整的搜索功能。默认方法，即每个子字段在映射中都有自己的条目，在大多数情况下效果良好。

PUT bug_reports
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "labels": {
        "type": "flattened"
      }
    }
  }
}

POST bug_reports/_doc/1
{
  "title": "Results are not sorted correctly.",
  "labels": {
    "priority": "urgent",
    "release": ["v1.2.5", "v1.3.0"],
    "timestamp": {
      "created": 1541458026,
      "closed": 1541457010
    }
  }
}

POST bug_reports/_search
{
  "query": {
    "term": {"labels": "urgent"}
  }
}

POST bug_reports/_search
{
  "query": {
    "term": {"labels.release": "v1.3.0"}
  }
}

flattened 支持的查询类型

term, terms, and terms_set
prefix
range
match and multi_match
query_string and simple_query_string
exists

支持以下参数：

`boost`	Mapping field-level query time boosting. Accepts a floating point number, defaults to `1.0`.
`depth_limit`	The maximum allowed depth of the flattened object field, in terms of nested inner objects. If a flattened object field exceeds this limit, then an error will be thrown. Defaults to `20`. Note that `depth_limit` can be updated dynamically through the update mapping API.
`doc_values`	Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts `true` (default) or `false`.
`eager_global_ordinals`	Should global ordinals be loaded eagerly on refresh? Accepts `true` or `false` (default). Enabling this is a good idea on fields that are frequently used for terms aggregations.
`ignore_above`	Leaf values longer than this limit will not be indexed. By default, there is no limit and all values will be indexed. Note that this limit applies to the leaf values within the flattened object field, and not the length of the entire field.
`index`	Determines if the field should be searchable. Accepts `true` (default) or `false`.
`index_options`	What information should be stored in the index for scoring purposes. Defaults to `docs` but can also be set to `freqs` to take term frequency into account when computing scores.
`null_value`	A string value which is substituted for any explicit `null` values within the flattened object field. Defaults to `null`, which means null sields are treated as if it were missing.
`similarity`	Which scoring algorithm or similarity should be used. Defaults to `BM25`.
`split_queries_on_whitespace`	Whether full text queries should split the input on whitespace when building a query for this field. Accepts `true` or `false` (default).

Nested 嵌套类型是对象数据类型的一个专门版本，它允许对对象数组进行索引，以便它们可以相互独立地查询。

看以下案例，在使用object类型的时候是这样的：

PUT my-index-000001/_doc/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}
结果的文档将会是这样

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

可以看到 alice and white 的关系已经没有绑定，以下查询也可以得到结果

GET my-index-000001/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

换成使用nested类型后：

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "user": {
        "type": "nested" 
      }
    }
  }
}

PUT my-index-000001/_doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}
nested保持每个对象都是独立的

GET my-index-000001/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}
以上查询就不会匹配

GET my-index-000001/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }} 
          ]
        }
      },
      "inner_hits": { 
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}
以上查询可以匹配

nested 文档可以：

queried with the nested query.
analyzed with the nested and reverse_nested aggregations.
sorted with nested sorting.
retrieved and highlighted with nested inner hits. 检索和高亮

nested支持以下参数：

dynamic: 与mapping级别的dynamic配置一样，默认true，可选 true , false and strict.
properties: 与mapping级别的properties一样，定义nested的各种字段.

include_in_parent: 如果配置true，nested中的所有字段将会增加一份到parent文档，默认false

include_in_root: 如果配置true，nested中的所有字段将会增加一份到root根文档，默认false

由于每个嵌套对象都作为单独的Lucene文档编制索引。继续上面的例子，如果我们为包含100个用户对象的单个文档编制索引，那么将创建101个Lucene文档：一个用于父文档，每一个用于每个嵌套对象。可见这样的消耗，因此ES对此有一些配置以防止性能问题：

index.mapping.nested_fields.limit
　　nested的最多字段数量限制，默认50

index.mapping.nested_objects.limit
　　1个文档中nested对象的最多数量，默认10000

Join parent与child关联关系的文档类型

定义mapping
PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_id": {
        "type": "keyword"
      },
      "my_join_field": { 
        "type": "join",
        "relations": {
          "question": "answer" 
        }
      }
    }
  }
}

parent1

PUT my-index-000001/_doc/1?refresh
{
  "my_id": "1",
  "text": "This is a question",
  "my_join_field": "question" 
}

parent2
PUT my-index-000001/_doc/2?refresh
{
  "my_id": "2",
  "text": "This is another question",
  "my_join_field": "question"
}

以下是child文档，由于parent和child需要在同一个shard，因此要通过routing进行路由，routing默认是根据id的

PUT my-index-000001/_doc/3?routing=1&refresh

{
  "my_id": "3",
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer", 
    "parent": "1" 
  }
}

PUT my-index-000001/_doc/4?routing=1&refresh

{
  "my_id": "4",
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

ES的join跟关系型join的使用方式不同，使用has_child or has_parent 将会增加很大的性能负担，唯一的场景是一对多关系，例如产品型号和具体产品的关系。

Parent join的使用限制：

每个索引只允许一个join字段。

父文档和子文档必须在同一分片上编制索引。在获取、删除或更新子文档时需要提供相同的路由值。

一个元素可以有多个子元素，但只能有一个父元素。

可以向现有join字段添加新关系。也可以将子元素添加到现有元素中，但前提是该元素已经是父元素。

一个parent多个child

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "question": ["answer", "comment"]  
        }
      }
    }
  }
}

多级join

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "question": ["answer", "comment"],  
          "answer": "vote" 
        }
      }
    }
  }
}
关系如下

   question
    /    \
   /      \
comment  answer
           |
           |
          vote

PUT my-index-000001/_doc/3?routing=1&refresh 
{
  "text": "This is a vote",
  "my_join_field": {
    "name": "vote",
    "parent": "2" 
  }
}
以上文档的routing需要和parent及grand-parent一致，parent的值是answer文档的id

Range 范围字段类型表示上限和下限之间的连续值范围。例如，范围可以表示10月份的任何日期，也可以表示0到9之间的任何整数。它们使用运算符gt或gte定义下限，使用运算符lt或lte定义上限。它们可用于查询，对聚合的支持有限。唯一受支持的聚合是直方图、基数。

range也是一个族类，支持以下具体类型：

`integer_range`	A range of signed 32-bit integers with a minimum value of `-231` and maximum of `231-1`.
`float_range`	A range of single-precision 32-bit IEEE 754 floating point values.
`long_range`	A range of signed 64-bit integers with a minimum value of `-263` and maximum of `263-1`.
`double_range`	A range of double-precision 64-bit IEEE 754 floating point values.
`date_range`	A range of `date` values. Date ranges support various date formats through the `format` mapping parameter. Regardless of the format used, date values are parsed into an unsigned 64-bit integer representing milliseconds since the Unix epoch in UTC. Values containing the `now` date math expression are not supported.
`ip_range`	A range of ip values supporting either IPv4 or IPv6 (or mixed) addresses.

看以下例子

PUT range_index
{
  "settings": {
    "number_of_shards": 2
  },
  "mappings": {
    "properties": {
      "expected_attendees": {
        "type": "integer_range"
      },
      "time_frame": {
        "type": "date_range", 
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}

PUT range_index/_doc/1?refresh
{
  "expected_attendees" : { 
    "gte" : 10,
    "lt" : 20
  },
  "time_frame" : {
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-01"
  }
}
因为12在 expected_attendees 的10-20范围中，因此可以查询得到

GET range_index/_search
{
  "query" : {
    "term" : {
      "expected_attendees" : {
        "value": 12
      }
    }
  }
}

再例如日期范围查询

GET range_index/_search
{
  "query" : {
    "range" : {
      "time_frame" : { 
        "gte" : "2015-10-31",
        "lte" : "2015-11-01",
        "relation" : "within" 
      }
    }
  }
}
relation 参数可以是 WITHIN, CONTAINS, INTERSECTS (默认)。

range家族支持以下参数配置：

`coerce`	将字符串转换为数字，并截断整数的分数。接受true（默认值）和false。
`boost`	默认1.0
`index`	是否可以搜索，默认true
`store`	是否需要跟_source分开存储和检索，true或false（默认false）

IP 可以是 IPv4 or IPv6

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "ip_addr": {
        "type": "ip"
      }
    }
  }
}

PUT my-index-000001/_doc/1
{
  "ip_addr": "192.168.1.1"
}

GET my-index-000001/_search
{
  "query": {
    "term": {
      "ip_addr": "192.168.0.0/16"
    }
  }
}

支持以下参数：

`boost`	Mapping field-level query time boosting. Accepts a floating point number, defaults to `1.0`.
`dimension`	For internal use by Elastic only. Marks the field as a time series dimension. Accepts `true` or `false` (default). The `index.mapping.dimension_fields.limit` index setting limits the number of dimensions in an index. Dimension fields have the following constraints: The `doc_values` and `index` mapping parameters must be `true`. Field values cannot be an array or multi-value.
`doc_values`	Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts `true` (default) or `false`.
`ignore_malformed`	If `true`, malformed IP addresses are ignored. If `false` (default), malformed IP addresses throw an exception and reject the whole document. Note that this cannot be set if the `script` parameter is used.
`index`	Should the field be searchable? Accepts `true` (default) and `false`.
`null_value`	Accepts an IPv4 or IPv6 value which is substituted for any explicit `null` values. Defaults to `null`, which means the field is treated as missing. Note that this cannot be set if the `script` parameter is used.
`on_script_error`	Defines what to do if the script defined by the `script` parameter throws an error at indexing time. Accepts `reject` (default), which will cause the entire document to be rejected, and `ignore`, which will register the field in the document’s `_ignored` metadata field and continue indexing. This parameter can only be set if the `script` field is also set.
`script`	If this parameter is set, then the field will index values generated by this script, rather than reading the values directly from the source. If a value is set for this field on the input document, then the document will be rejected with an error. Scripts are in the same format as their runtime equivalent, and should emit strings containing IPv4 or IPv6 formatted addresses.
`store`	Whether the field value should be stored and retrievable separately from the `_source` field. Accepts `true` or `false` (default).

Version version是特殊的keyword类型，遵循version比较匹配规则，例如“2.1.0”<“2.4.1”<“2.11.2”，预发布版本在发布版本之前排序（即“1.0.0-alpha”<“1.0.0”）。

支持参数：

meta

Metadata about the field.

Murmur3 murmur3本身是一个高性能的hash算法，ES中在高基数和大字符串字段上运行基数聚合时，这有时很有用。

首先需要在集群中的所有节点安装插件，并重启节点

sudo bin/elasticsearch-plugin install mapper-murmur3

卸载插件时需要先停止节点

sudo bin/elasticsearch-plugin remove mapper-murmur3

aggregate_metric_double 预agg的字段类型，用于简化agg

直接看例子

PUT stats-index
{
  "mappings": {
    "properties": {
      "agg_metric": {
        "type": "aggregate_metric_double",
        "metrics": [ "min", "max", "sum", "value_count" ],
        "default_metric": "max"
      }
    }
  }
}

上面agg_metric是字段名，type是aggregate_metric_double，metrics是它的具体度量维度，default_metric是字段agg_metric的默认度量维度（可用于term查询value匹配）

PUT stats-index/_doc/1
{
  "agg_metric": {
    "min": -302.50,
    "max": 702.30,
    "sum": 200.0,
    "value_count": 25
  }
}

PUT stats-index/_doc/2
{
  "agg_metric": {
    "min": -93.00,
    "max": 1702.30,
    "sum": 300.00,
    "value_count": 25
  }
}
新增2个文档，可以看到文档的字段跟mapping有对应关系

POST stats-index/_search?size=0
{
  "aggs": {
    "metric_min": { "min": { "field": "agg_metric" } },
    "metric_max": { "max": { "field": "agg_metric" } },
    "metric_value_count": { "value_count": { "field": "agg_metric" } },
    "metric_sum": { "sum": { "field": "agg_metric" } },
    "metric_avg": { "avg": { "field": "agg_metric" } }
  }
}
上面这个agg的使用就很直观了

GET stats-index/_search
{
  "query": {
    "term": {
      "agg_metric": {
        "value": 702.30
      }
    }
  }
}
这里的value匹配的就是default_metric

支持参数：

metrics 数组形式，只能是min, max, sum, and value_count. 并至少指定1个。

default_metric 必须是metrics配置中的其中1个。

min 在所有的min中取最小值.
max 在所有的max中取最大值.
sum 所有的sum相加.
value_count 所有的value_count 相加，该字段是个正整数.
avg 该字段无需设置，avg=sum/value_count. 要使用avg必须同时有sum and value_count。

Histogram 预直方图数据的类型

直方图有2个数据：

values double类型的数组，表示直方图的bucket，数组中的数据必须升序.

counts integer类型的数组, 表示每个bucket的value，必须>=0.

values和counts的长度必须一致，他们在数组中的位置是对应的。

histogram字段只能用于

min aggregation
max aggregation
sum aggregation
value_count aggregation
avg aggregation
percentiles aggregation
percentile ranks aggregation
boxplot aggregation
histogram aggregation
range aggregation
exists query

Text text家族包含：

text全文内容

match_only_text 一种空间优化的文本变体，它禁用评分，并在需要位置的查询上执行较慢。它最适合索引日志消息。

text 用于索引全文值的字段，如电子邮件或产品说明。

通过 analyzer 将字符串转换为单个术语的列表。

text不用于排序，很少用于聚合。

text最适合非结构化但人类可读的内容。

如下例：

PUTmy-index-000001{"mappings":{"properties":{"full_name":{"type":"text"}}}}

有时候需要映射成2个字段，一个text用于全文搜索，一个keyword用于agg、排序，此时可以使用 multi-fields 方法。

text支持的参数：

`analyzer`	对字符串的解析器（分词器），同时用于文档被索引和搜索的时候（只要没有覆盖`search_analyzer`），默认是`standard` analyzer
`boost`	评分权重，默认1.0
`eager_global_ordinals`	默认false，是否在索引refresh时构建全局序号，把Term加载到Cache中，再后续的terms agg时，性能会得到提升；开启 eager_global_ordinals 会影响写入性能，因为每次刷新时都会创建新的全局序号。为了最大程度地减少由于频繁刷新建立全局序号而导致的额外开销，建议调大刷新间隔 refresh_intervalare
`fielddata`	默认false，是否可以用于in-memory fielddata for sorting, aggregations, or scripting
`fielddata_frequency_filter`	当 `fielddata` 为true时期望哪些值被加载到内存，默认加载所有的值
`fields`	配置 multi-fields ，比如一个text用于全文搜索，一个keyword用于agg、排序，或者使用不同的 `analyzer`
`index`	是否可以被搜索，默认true
`index_options`	为了搜索和高亮显示，索引中应该存储哪些信息，默认 `positions`
`index_prefixes`	在某些场景下面比如搜索框里面，需要用户在输入内容的同时也要实时展示与输入内容前缀匹配的搜索结果，就可以使用prefix查询。为了加速prefix查询，还可以在设置字段映射的时候，使用`index_prefixes`映射。ES会额外建立一个长度在2和5之间索引，在进行前缀匹配的时候效率会有很大的提高。 GET /_search { "query": { "prefix": { "name": "ac" } } }
`index_phrases`	将单词两两合并成一个短语，在 phrase queries 时更高效（该方式的索引占的更大些）。当不删除stopwords时，这种方法效果最佳，因为包含stopwords的短语将不使用辅助字段，并将返回到标准短语查询。接受true或false（默认值）。
`norms`	Whether field-length should be taken into account when scoring queries. Accepts `true` (default) or `false`.
`position_increment_gap`	考虑这样的场景 PUT /my_index/groups/1 { "names": [ "John Abraham", "Lincoln Smith"] } 使用以下查询时会匹配 GET /my_index/groups/_search { "query": { "match_phrase": { "names": "Abraham Lincoln" } } } 加上配置让他们的slop 100 PUT /my_index/_mapping/groups { "properties": { "names": { "type": "string", "position_increment_gap": 100 } } } 再查询时就可以避免
`store`	是否需要跟_source分开存储和检索，true或false（默认false）
`search_analyzer`	搜索时的分词器，默认使用 `analyzer` 的配置.
`search_quote_analyzer`	在遇到 phrase 查询时的分词器，默认使用 `search_analyzer` 的配置.
`similarity`	使用哪种评分算法或相似性。默认为BM25 https://www.elastic.co/guide/en/elasticsearch/reference/7.15/similarity.html
`term_vector`	Whether term vectors should be stored for the field. Defaults to `no`.
`meta`	Metadata about the field.

match_only_text

可在日志记录中直接替代text类型，磁盘占用空间更少。

为了减少磁盘空间需求，match_only_text 只索引 text 字段索引的信息的子集。这会带来以下缺点：

相关性分数计算为匹配项的数量。这对于日志用例通常无关紧要，因为文档是按降序时间戳而不是按相关性分数排序的。

不支持 span queries。 match_only_text 字段也支持 text 字段支持的所有其他类型的查询。

Phrase 和 intervals queries 的运行速度比 text 字段慢，但仍比线性扫描快得多。其他类型的查询与 text 字段的运行速度一样快，甚至略快。

annotated-text 处于试用阶段

Completion Completion Suggester自动补全

annotated-text

posted on 2021-10-27 09:14 icodegarden 阅读(353) 评论(0) 编辑收藏举报