Elastic Certified Engineer Practices

以下练习题来自铭毅天下的《死磕ElasticSearch》知识星球。

Sample 1

某索引index_a有多个字段,要求实现如下的查询:
1)针对字段title,满足'ssas'或者'sasa',至少一个满足
2)针对字段tags(数组字段),如果b字段包含'pingpang',则提升评分。

PUT index_a/_bulk
{"index":{"_id":1}}
{"title":"ssas is very nb", "tags":["pingpang", "basketball"]}
{"index":{"_id":2}}
{"title":"which is sasa","tags":["football"]}
{"index":{"_id":3}}
{"title":"which is ssas","tags":["basktball","football"]}
{"index":{"_id":4}}
{"title":"just for testing", "tags":["pingpang"]}
{"index":{"_id":5}}
{"title":"just for testing", "tags":["basketball"]}
{"index":{"_id":6}}
{"title":"just for testing", "tags":["football"]}
{"index":{"_id":7}}
{"title":"ssas sasa is very good", "tags":["pingpang"]}

解法1:bool query

GET index_a/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "ssas sasa"
          }
        }
      ],
      "should": [
        {
          "match": {
            "tags": {
              "query": "pingpang",
              "boost": 2
            }
          }
        }
      ]
    }
  }
}

解法2:function_score

GET index_a/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "title": "ssas sasa"
        }
      },
      "functions": [
        {
          "filter": {"match": {"tags": "pingpang"}},
          "weight": 5
        }
      ]
    }
  }
}

Sample 2

有一个文档,内容类似dog & cat, 要求索引这条文档,并且使用match_phrase query,查询dog & cat或者dog and cat都能match。

解法1:使用char_filter

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_mappings_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_mappings_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

解法2: 使用synonym

注意 tokenizer要使用whitespace,不能用standard,因为&会被过滤掉

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "my_synonym"
          ]
        }
      },
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": [
            "& => and"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Sample 3

有index_a包含一些文档, 要求创建索引index_b,通过reindex api将index_a的文档索引到index_b。 要求增加一个整形字段,value是index_a的field_x的字符长度; 再增加一个数组类型的字段,value是field_y的词集合。(field_y是空格分割的一组词,比方"foo bar",索引到index_b后,要求变成["foo", "bar"]。

解法1: 使用ingest script processor

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "source": """
            ctx.x_length = ctx.x.length();
            String[] ysplit = ctx.y.splitOnToken(" ");
            ArrayList ylist = new ArrayList();
            for (int i=0; i<ysplit.length; i++){
              ylist.add(ysplit[i])
            }
            ctx.y_list = ylist
          """
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "x": "hello",
        "y": "foo bar"
      }
    }
  ]
}

解法2:使用 ingest script + split processor

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "source": """
            ctx.x_length = ctx.x.length();
          """
        }
      },
      {
        "split": {
          "field": "y",
          "separator": " ",
          "target_field": "y_list"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "x": "hello",
        "y": "foo bar zee"
      }
    }
  ]
}

Sample 4

执行Reindex,实现以下两个功能:

  • 把 source index 的某个字段(该字段是数组)里的子项都去掉前后空格
  • 增加一个新字段,这个新字段的值是 source index 的其中两个字段的拼接

解法:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "foreach": {
          "field": "x",
          "processor": {
            "trim": {
              "field": "_ingest._value"
            }
          }
        },
        "script": {
          "source": "ctx.yz = ctx.y + ' ' +ctx.z"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "x": ["foo ", " bar"],
        "y": "hello",
        "z": "world"
      }
    }
  ]
}

Sample 5

对三个字段 a/b/c 查询 xxx, 要求 c 字段 boost 2, 各字段查询算分加和

解法1:multi_match

GET index_a/_search
{
  "query": {
    "multi_match": {
      "type": "most_fields",
      "query": "ssas",
      "fields": ["title^2", "tags"]
    }
  }
}

解法2:bool query should

GET index_a/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "ssas",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "tags": "ssas"
          }
        }
      ]
    }
  }
}

Sample 6

定义一个 Pipeline,并且将 eathquakes 索引的文档进行更新

  • pipeline的 ID 为 eathquakes_pipeline
  • 将 magnitude_type 的字段值改为大写
  • 如果文档不包含 “batch_number”, 增加这个字段,将数值设置为 1
  • 如果已经包含 batch_number, 字段值加1

解法:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "uppercase": {
          "field": "magnitude_type"
        }
      },
      {
        "script": {
          "source": """
            if(ctx.batch_number == null){
              ctx.batch_number = 1;
            }else{
              ctx.batch_number++;
            }
          """
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "magnitude_type": "foo"
      }
    },
    {
      "_source": {
        "magnitude_type": "bar",
        "batch_number": 2
      }
    }
  ]
}

Sample 7

earthquakes索引中包含了过去11个月的地震信息,请通过一句查询,获取以下信息

  • 过去11个月,每个月的平均地震等级(magnitude)
  • 过去11个月里,平均地震等级最高的一个月及其平均地震等级
  • 搜索不能返回任何文档

解法:

GET earthquakes/_search
{
  "size": 0,
  "aggs": {
    "monthly_aggs": {
      "date_histogram": {
        "field": "time",
        "calendar_interval": "month"
      },
      "aggs": {
        "avg_magnitude": {
          "avg": {
            "field": "magnitude"
          }
        }
      }
    },
    "max_avg_monthly_magnitude": {
      "max_bucket": {
        "buckets_path": "monthly_aggs>avg_magnitude"
      }
    }
  }
}

POST earthquakes/_bulk
{"index":{"_id":1}}
{"time":"2019-01-01T17:00:00", "magnitude":1}
{"index":{"_id":2}}
{"time":"2019-01-01T20:00:00", "magnitude":3}
{"index":{"_id":3}}
{"time":"2019-02-01T17:00:00", "magnitude":4}
{"index":{"_id":3}}
{"time":"2019-02-20T17:00:00", "magnitude":5}
{"index":{"_id":4}}
{"time":"2019-11-01T17:00:00", "magnitude":7}
{"index":{"_id":5}}
{"time":"2019-11-01T17:00:00", "magnitude":8}
{"index":{"_id":6}}
{"time":"2019-11-01T17:00:00", "magnitude":9}

PUT earthquakes
{
  "mappings": {
    "properties": {
      "time": {
        "type": "date"
      },
      "magnitude": {
        "type": "integer"
      }
    }
  }
}
DELETE earthquakes

Sample 8

安装并配置 一个 hot & warm 架构的集群:

  • 三个节点, node 1 为 hot , node2 为 warm,node 3 为cold
  • 三个节点均为 master-eligable 节点
  • 新创建的索引,数据写入 hot 节点
  • 通过一条命令,将数据从 hot 节点移动到 warm 节点

解法:

先配置node attr,编辑elasticsearch.yml,添加如下nodeattr

node.attr.hot_warm_type: hot
node.attr.hot_warm_type: warm
DELETE hotwarm_index
PUT hotwarm_index
{
  "settings": {
      "index.routing.allocation.include.hot_warm_type": "hot",
      "number_of_replicas": 0,
      "number_of_shards": 1
  }
}

PUT hotwarm_index/_bulk
{"index":{"_id":1}}
{"name":"foo"}
{"index":{"_id":2}}
{"name":"bar"}

GET _cat/shards?v

PUT hotwarm_index/_settings
{
  "index.routing.allocation.include.hot_warm_type": "warm"
}

GET _cat/shards?v

Sample 9

ilm + datastream, 数据首先分布在data_hot,2分钟之后rollover,再过5分钟之后,迁移到data_warm,再过3分钟,迁移到data_cold,再过6分钟删除

解法:

DELETE _data_stream/my-datastream
GET .ds-my-datastream-2022.02.26-000001/_ilm/explain # 查看ilm状态
GET _cat/shards/.ds-my-datastream-2022.02.26-000001?v #查看该index的shard分布
GET my-datastream
GET _data_stream/my-datastream

# 一定要POST或者是PUT + op_type为Create,即是要新建doc
POST my-datastream/_doc
{
  "message": "a",
  "@timestamp": "2099-05-06T16:21:15.000Z"
}

# 要设置上`data_stream: {}`,这样才会自动创建出来data_stream
PUT _index_template/my-datastream-template
{
  "index_patterns": [
    "my-datastream*"
  ],
  "data_stream": {},
  "template": {
    "settings": {
      "number_of_replicas": 0,
      "number_of_shards": 1,
      "index.lifecycle.name": "test_policy"
    }
  }
}

PUT _ilm/policy/test_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0m",
        "actions": {
          "rollover": {
            "max_age": "2m"
          }
        }
      },
      "warm": {
        "min_age": "5m",
        "actions": {}
      },
      "cold": {
        "min_age": "8m",
        "actions": {}
      },
      "delete": {
        "min_age": "14m",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}
PUT _cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "3s"
  }
}

关于ILM,有几个地方需要注意:

  1. 被ILM管理的索引都有一个age时间,表示存活时间,而每个phase都有一个min_age,指的就是当这个索引的存活时间达到min_age时,即进入对应的phase,所以每个phase的min_age是不断增加的,不能后一个phase比前一个小,但是如果有rollover action的话,age会在rollover之后重置,即hot phase之后的min_age,都是从rollover之后开始的。
  2. 只有当前phase中的actions执行完成之后,才会进入下一个phase,而如果在前一个phase耽误了较长时间,导致超过了下一个phase的min_age,则会很快跳过下一个phase,进入到下下一个phase
  3. min_age为0表示立即进入该phase,所以hot phase都将min_age设置为0,而只有当hot phase中的action执行完成之后,才会执行下一个phase
  4. ILM有个cluster配置:indices.lifecycle.poll_interval,即检查phase切换的间隔,默认是10分钟,因此如果设置的min_age太小的话,不会按照预期的进行切换,因此需要对应的将该poll_interval调小
  5. data stream, ilm之间的关系,他们彼此不依赖,都可以独立使用,ilm主要是来自动化管理索引的,包含data stream和一般的索引,而data stream没有ilm的话,就需要手动进行管理,所以data stream+ilm搭配起来使用是最合适的。
  6. ILM如果是管理的index + alias的话,并且需要rollover的话,一定要配合index template一起使用才有意义,否则rollover之后自动建出来的index,不会被ILM管理,如果不需要rollover,则不需要intex template,只是对单个索引进行管理。
  7. 如果使用hot-warm架构的话,并且使用es内置的data tier去调度,则想让调度生效的话,需要在node.roles中去掉data这个role,要不然设置了 "index.routing.allocation.include._tier_preference": "data_hot" 不生效。官方解释A node can belong to multiple tiers, but a node that has one of the specialized data roles cannot have the generic data role.
  8. 由于primariy shard和replica shard不能在同一个节点上,所以当某一个role的节点只有一个时,需要将replica设为0

若是ILM不是通过data stream来管理index,则会稍微复杂一些,以下为示例:

PUT my-policy-index-000001
{
  "aliases": {
    "test_alias": {
      "is_write_index": true
    }
  }
}

PUT _index_template/my-policy-index_template
{
  "index_patterns": ["my-policy-index-*"],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "index.lifecycle.name": "test_policy",
      "index.lifecycle.rollover_alias": "test_alias",
      "index.routing.allocation.include._tier_preference": "data_hot"
    }
  }
}

PUT _ilm/policy/test_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0m",
        "actions": {
          "rollover": {
            "max_age": "2m"
          }
        }
      },
      "warm": {
        "min_age": "5m",
        "actions": {}
      },
      "cold": {
        "min_age": "8m",
        "actions": {}
      },
      "delete": {
        "min_age": "14m",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Sample 10

有一个索引task2,有field2字段 用match匹配the能查到很多数据,现要求对task2索引进行重建,重建后的索引叫new_task2 然后match匹配the查不到数据

解法1: 使用stop analyzer

PUT test1/_doc/1
{
  "message": "you are the best"
}
PUT test2
{
  "mappings" : {
      "properties" : {
        "message" : {
          "type" : "text",
          "analyzer": "stop",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
}
POST _reindex
{
  "source": {
    "index": "test1"
  },
  "dest": {
    "index": "test2"
  }
}

解法2: 使用stop filter

PUT test1/_doc/1
{
  "message": "you are the best"
}
PUT test3
{
  "settings": {
    "analysis": {
      "analyzer": {
        "stop_analyzer": {
          "tokenizer": "standard",
          "filter": ["stop"]
        }
      }
    }
  },
  "mappings" : {
      "properties" : {
        "message" : {
          "type" : "text",
          "analyzer": "stop_analyzer",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
}
POST _reindex
{
  "source": {
    "index": "test1"
  },
  "dest": {
    "index": "test2"
  }
}

Sample 11

在test索引里创建一个runtime字段,它的值为字段A减去字段B,创建一个range聚合,分为三个级别:

  • 小于0
  • 0到100
  • 100以上
  • 返回文档数为0

解法:

POST test4/_bulk
{"index": {}}
{"A": 100, "B": 200}
{"index": {}}
{"A": 10, "B": 20}
{"index": {}}
{"A": 200, "B": 20}
{"index": {}}
{"A": 100, "B": 20}
{"index": {}}
{"A": 100, "B": 50}

GET test4/_search
{
  "size": 0,
  "runtime_mappings": {
    "C": {
      "type": "long",
      "script": {
        "source": "emit(doc['A'].value-doc['B'].value)"
      }
    }
  },
  "aggs": {
    "caggs": {
      "range": {
        "field": "C",
        "ranges": [
          {
            "to": 0
          },
          {
            "from": 0,
            "to": 100
          },
          {
            "from": 100
          }
        ]
      }
    }
  }
}

Sample 12

testa 和 testb 两索引, 有一个关联字段x, 建立新的索引有testa索引的全部数据, 并且通过x的关联也包含了testb索引对应数据

解法:

PUT testb/_bulk
{"index":{}}
{"b":10,"x":2}
{"index":{}}
{"b":5,"x":5}

PUT testa/_bulk
{"index":{}}
{"a":1,"x":2}
{"index":{}}
{"a":3,"x":2}
{"index":{}}
{"a":5,"x":4}

PUT /_enrich/policy/myenrich-policy
{
  "match": {
    "indices": "testb",
    "match_field": "x",
    "enrich_fields": ["x", "b"]
  }
}

POST /_enrich/policy/myenrich-policy/_execute

PUT _ingest/pipeline/mypipeline
{
  "processors": [
    {
      "enrich": {
        "policy_name": "myenrich-policy",
        "field": "x",
        "target_field": "c"
      }
    }
  ]
}

POST _reindex
{
  "source": {
    "index": "testa"
  },
  "dest": {
    "index": "testc",
    "pipeline": "mypipeline"
  }
}

Sample 13

对集群一上的task9索引编写一个查询,并满足以下要求:

  • 'a','b','c'字段至少有两个字段匹配中'test'关键字
  • 对查询结果进行排序,先按照'a'字段进行降序排序,再按照'_socre'进行升序排序
  • 'a'字段的返回结果高亮显示,前标签是"<h1>",后标签是"</h1>"

解法:

GET test6/_search
{
  "query": {
    "bool": {
      "should": [
        {"match": {"a": "test"}},
        {"match": {"b": "test"}},
        {"match": {"c": "test"}}
      ],
      "minimum_should_match": 2
    }
  },
  "highlight": {
    "fields": {
      "a": {}
    },
    "pre_tags": ["<h1>"],
    "post_tags": ["</h1>"]
  },
  "sort": [
    {
      "a.keyword": {
        "order": "desc"
      }
    },
    {
      "_score": {
        "order": "asc"
      }
    }
  ]
}
PUT test6/_bulk
{"index": {}}
{"a": "test", "b": "foo", "c": "bar"}
{"index": {}}
{"a": "test", "b": "test", "c": "bar"}
{"index": {}}
{"a": "test", "b": "foo", "c": "test"}

Sample 14

解决集群变红或者是变黄的问题

解法:

GET _cluster/health

GET _cluster/health?level=indices
GET _cluster/health/my-index-000001?level=shards

GET /_cat/shards/my-index-000001?v
GET _cat/indices?health=yellow&v

GET _cluster/allocation/explain

Sample 15

创建一个搜索模板,name为task10,搜索模板满足以下条件:

  • 对于字段a,搜索param为search_string
  • 使用start_date和end_date参数范围查询timestamp字段,如果没有提供end_date字段,那么结束时间默认是现在
  • 对于返回值,要高亮a字段的内容,用<strong>和</strong>框起来
  • 返回结果先按照b字段排序,然后再按照score排序

写一个搜索语句,对movie索引进行搜索,使用搜索模板为task10,search_string的值为star

解法:

GET task5/_search/template
{
  "id": "task5_template",
  "params": {
    "search_string": "foo",
    "start_date": "2022-01-01"
  }
}

PUT task5/_bulk
{"index": {}}
{"a": "foo", "b": 10, "timestamp": "2022-01-01"}
{"index": {}}
{"a": "foo", "b": 4, "timestamp": "2022-02-01"}
{"index": {}}
{"a": "foo bar", "b": 34, "timestamp": "2022-03-01"}
{"index": {}}
{"a": "bar", "b": 2, "timestamp": "2021-01-01"}

PUT _scripts/task5_template
{
  "script": {
    "lang": "mustache",
    "source": """
    {
      "query": {
        "bool": {
          "filter": [
            {"match": {"a": "{{search_string}}"}},
            {"range": {
              "timestamp": {
                "gte": "{{start_date}}",
                "lte": "{{end_date}}{{^end_date}}now/d{{/end_date}}"
              }
            }}
          ]
        }
      },
      "highlight": {
        "fields": {
          "a": {}
        },
        "pre_tags": ["<strong>"],
        "post_tags": ["</strong>"]
      },
      "sort": [
        {
          "b": {
            "order": "desc"
          }
        },
        {
          "_score": {
            "order": "asc"
          }
        }
      ]
    }
    """
  }
}

注意,因为在source里要写很长的语句,并且kibana没有提示,直接写的话,很容易出错,所以可以先在_search中将语句写出来,然后复制到source字段,但是注意复制的时候,复制的之后,一定要将query语句外面包括的{}给复制过来,即是这种形式:"source": """ {"query": {}} """,而不是 "source": """ "query": {} """

Sample 16

对a字段进行term匹配,对b字段进行match匹配,对c字段进行加权算分,c字段是由另外两个字段得来的

解法:

GET task6/_search
{
  "runtime_mappings": {
    "z": {
      "type": "long",
      "script": {
        "source": "emit(doc['x'].value + doc['y'].value)"
      }
    }
  },
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {"match": {"b": "hello"}},
            {"term": {"a": "foo"}}
          ]
        }
      },
      "script_score": {
        "script": {
          "source": "_score * doc['z'].value"
        }
      }
    }
  }
}

PUT task6/_bulk
{"index": {}}
{"x": 2, "y": 4, "a": "foo", "b": "hello world"}
{"index": {}}
{"x": 100, "y": 50, "a": "bar", "b": "hello world 1"}
{"index": {}}
{"x": 200, "y": 10, "a": "foo", "b": "hello"}

需要注意function_score的作用,是在一个query的基础上,去影响这个query的评分。

posted @ 2023-03-11 20:03  hackerain  阅读(45)  评论(0编辑  收藏  举报