Elastic Certified Engineer Practices

以下练习题来自铭毅天下的《死磕ElasticSearch》知识星球。

Sample 1

某索引index_a有多个字段，要求实现如下的查询：
1）针对字段title，满足'ssas'或者'sasa',至少一个满足
2）针对字段tags（数组字段），如果b字段包含'pingpang',则提升评分。

PUT index_a/_bulk
{"index":{"_id":1}}
{"title":"ssas is very nb", "tags":["pingpang", "basketball"]}
{"index":{"_id":2}}
{"title":"which is sasa","tags":["football"]}
{"index":{"_id":3}}
{"title":"which is ssas","tags":["basktball","football"]}
{"index":{"_id":4}}
{"title":"just for testing", "tags":["pingpang"]}
{"index":{"_id":5}}
{"title":"just for testing", "tags":["basketball"]}
{"index":{"_id":6}}
{"title":"just for testing", "tags":["football"]}
{"index":{"_id":7}}
{"title":"ssas sasa is very good", "tags":["pingpang"]}

解法1：bool query

GET index_a/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "ssas sasa"
          }
        }
      ],
      "should": [
        {
          "match": {
            "tags": {
              "query": "pingpang",
              "boost": 2
            }
          }
        }
      ]
    }
  }
}

解法2：function_score

GET index_a/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "title": "ssas sasa"
        }
      },
      "functions": [
        {
          "filter": {"match": {"tags": "pingpang"}},
          "weight": 5
        }
      ]
    }
  }
}

Sample 2

有一个文档，内容类似dog & cat，要求索引这条文档，并且使用match_phrase query，查询dog & cat或者dog and cat都能match。

解法1：使用char_filter

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_mappings_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_mappings_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

解法2：使用synonym

注意 tokenizer要使用whitespace，不能用standard，因为&会被过滤掉

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "my_synonym"
          ]
        }
      },
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": [
            "& => and"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Sample 3

有index_a包含一些文档，要求创建索引index_b，通过reindex api将index_a的文档索引到index_b。要求增加一个整形字段，value是index_a的field_x的字符长度；再增加一个数组类型的字段，value是field_y的词集合。(field_y是空格分割的一组词，比方"foo bar"，索引到index_b后，要求变成["foo", "bar"]。

解法1：使用ingest script processor

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "source": """
            ctx.x_length = ctx.x.length();
            String[] ysplit = ctx.y.splitOnToken(" ");
            ArrayList ylist = new ArrayList();
            for (int i=0; i<ysplit.length; i++){
              ylist.add(ysplit[i])
            }
            ctx.y_list = ylist
          """
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "x": "hello",
        "y": "foo bar"
      }
    }
  ]
}

解法2：使用 ingest script + split processor

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "source": """
            ctx.x_length = ctx.x.length();
          """
        }
      },
      {
        "split": {
          "field": "y",
          "separator": " ",
          "target_field": "y_list"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "x": "hello",
        "y": "foo bar zee"
      }
    }
  ]
}

Sample 4

执行Reindex，实现以下两个功能：

把 source index 的某个字段（该字段是数组）里的子项都去掉前后空格
增加一个新字段，这个新字段的值是 source index 的其中两个字段的拼接

解法：

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "foreach": {
          "field": "x",
          "processor": {
            "trim": {
              "field": "_ingest._value"
            }
          }
        },
        "script": {
          "source": "ctx.yz = ctx.y + ' ' +ctx.z"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "x": ["foo ", " bar"],
        "y": "hello",
        "z": "world"
      }
    }
  ]
}

Sample 5

对三个字段 a/b/c 查询 xxx, 要求 c 字段 boost 2, 各字段查询算分加和

解法1：multi_match

GET index_a/_search
{
  "query": {
    "multi_match": {
      "type": "most_fields",
      "query": "ssas",
      "fields": ["title^2", "tags"]
    }
  }
}

解法2：bool query should

GET index_a/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "ssas",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "tags": "ssas"
          }
        }
      ]
    }
  }
}

Sample 6

定义一个 Pipeline，并且将 eathquakes 索引的文档进行更新

pipeline的 ID 为 eathquakes_pipeline
将 magnitude_type 的字段值改为大写
如果文档不包含 “batch_number”, 增加这个字段，将数值设置为 1
如果已经包含 batch_number, 字段值加1

解法：

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "uppercase": {
          "field": "magnitude_type"
        }
      },
      {
        "script": {
          "source": """
            if(ctx.batch_number == null){
              ctx.batch_number = 1;
            }else{
              ctx.batch_number++;
            }
          """
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "magnitude_type": "foo"
      }
    },
    {
      "_source": {
        "magnitude_type": "bar",
        "batch_number": 2
      }
    }
  ]
}

Sample 7

earthquakes索引中包含了过去11个月的地震信息，请通过一句查询，获取以下信息

过去11个月，每个月的平均地震等级（magnitude）
过去11个月里，平均地震等级最高的一个月及其平均地震等级
搜索不能返回任何文档

解法：

GET earthquakes/_search
{
  "size": 0,
  "aggs": {
    "monthly_aggs": {
      "date_histogram": {
        "field": "time",
        "calendar_interval": "month"
      },
      "aggs": {
        "avg_magnitude": {
          "avg": {
            "field": "magnitude"
          }
        }
      }
    },
    "max_avg_monthly_magnitude": {
      "max_bucket": {
        "buckets_path": "monthly_aggs>avg_magnitude"
      }
    }
  }
}

POST earthquakes/_bulk
{"index":{"_id":1}}
{"time":"2019-01-01T17:00:00", "magnitude":1}
{"index":{"_id":2}}
{"time":"2019-01-01T20:00:00", "magnitude":3}
{"index":{"_id":3}}
{"time":"2019-02-01T17:00:00", "magnitude":4}
{"index":{"_id":3}}
{"time":"2019-02-20T17:00:00", "magnitude":5}
{"index":{"_id":4}}
{"time":"2019-11-01T17:00:00", "magnitude":7}
{"index":{"_id":5}}
{"time":"2019-11-01T17:00:00", "magnitude":8}
{"index":{"_id":6}}
{"time":"2019-11-01T17:00:00", "magnitude":9}

PUT earthquakes
{
  "mappings": {
    "properties": {
      "time": {
        "type": "date"
      },
      "magnitude": {
        "type": "integer"
      }
    }
  }
}
DELETE earthquakes

Sample 8

安装并配置一个 hot & warm 架构的集群：

三个节点， node 1 为 hot ， node2 为 warm，node 3 为cold
三个节点均为 master-eligable 节点
新创建的索引，数据写入 hot 节点
通过一条命令，将数据从 hot 节点移动到 warm 节点

解法：

先配置node attr，编辑elasticsearch.yml，添加如下nodeattr

node.attr.hot_warm_type: hot
node.attr.hot_warm_type: warm

DELETE hotwarm_index
PUT hotwarm_index
{
  "settings": {
      "index.routing.allocation.include.hot_warm_type": "hot",
      "number_of_replicas": 0,
      "number_of_shards": 1
  }
}

PUT hotwarm_index/_bulk
{"index":{"_id":1}}
{"name":"foo"}
{"index":{"_id":2}}
{"name":"bar"}

GET _cat/shards?v

PUT hotwarm_index/_settings
{
  "index.routing.allocation.include.hot_warm_type": "warm"
}

GET _cat/shards?v

Sample 9

ilm + datastream, 数据首先分布在data_hot，2分钟之后rollover，再过5分钟之后，迁移到data_warm，再过3分钟，迁移到data_cold，再过6分钟删除

解法：

DELETE _data_stream/my-datastream
GET .ds-my-datastream-2022.02.26-000001/_ilm/explain # 查看ilm状态
GET _cat/shards/.ds-my-datastream-2022.02.26-000001?v #查看该index的shard分布
GET my-datastream
GET _data_stream/my-datastream

# 一定要POST或者是PUT + op_type为Create，即是要新建doc
POST my-datastream/_doc
{
  "message": "a",
  "@timestamp": "2099-05-06T16:21:15.000Z"
}

# 要设置上`data_stream: {}`，这样才会自动创建出来data_stream
PUT _index_template/my-datastream-template
{
  "index_patterns": [
    "my-datastream*"
  ],
  "data_stream": {},
  "template": {
    "settings": {
      "number_of_replicas": 0,
      "number_of_shards": 1,
      "index.lifecycle.name": "test_policy"
    }
  }
}

PUT _ilm/policy/test_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0m",
        "actions": {
          "rollover": {
            "max_age": "2m"
          }
        }
      },
      "warm": {
        "min_age": "5m",
        "actions": {}
      },
      "cold": {
        "min_age": "8m",
        "actions": {}
      },
      "delete": {
        "min_age": "14m",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}
PUT _cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "3s"
  }
}

关于ILM，有几个地方需要注意：

被ILM管理的索引都有一个age时间，表示存活时间，而每个phase都有一个min_age，指的就是当这个索引的存活时间达到min_age时，即进入对应的phase，所以每个phase的min_age是不断增加的，不能后一个phase比前一个小，但是如果有rollover action的话，age会在rollover之后重置，即hot phase之后的min_age，都是从rollover之后开始的。
只有当前phase中的actions执行完成之后，才会进入下一个phase，而如果在前一个phase耽误了较长时间，导致超过了下一个phase的min_age，则会很快跳过下一个phase，进入到下下一个phase
min_age为0表示立即进入该phase，所以hot phase都将min_age设置为0，而只有当hot phase中的action执行完成之后，才会执行下一个phase
ILM有个cluster配置：indices.lifecycle.poll_interval，即检查phase切换的间隔，默认是10分钟，因此如果设置的min_age太小的话，不会按照预期的进行切换，因此需要对应的将该poll_interval调小
data stream, ilm之间的关系，他们彼此不依赖，都可以独立使用，ilm主要是来自动化管理索引的，包含data stream和一般的索引，而data stream没有ilm的话，就需要手动进行管理，所以data stream+ilm搭配起来使用是最合适的。
ILM如果是管理的index + alias的话，并且需要rollover的话，一定要配合index template一起使用才有意义，否则rollover之后自动建出来的index，不会被ILM管理，如果不需要rollover，则不需要intex template，只是对单个索引进行管理。
如果使用hot-warm架构的话，并且使用es内置的data tier去调度，则想让调度生效的话，需要在node.roles中去掉data这个role，要不然设置了 "index.routing.allocation.include._tier_preference": "data_hot" 不生效。官方解释：A node can belong to multiple tiers, but a node that has one of the specialized data roles cannot have the generic data role.
由于primariy shard和replica shard不能在同一个节点上，所以当某一个role的节点只有一个时，需要将replica设为0

若是ILM不是通过data stream来管理index，则会稍微复杂一些，以下为示例：

PUT my-policy-index-000001
{
  "aliases": {
    "test_alias": {
      "is_write_index": true
    }
  }
}

PUT _index_template/my-policy-index_template
{
  "index_patterns": ["my-policy-index-*"],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "index.lifecycle.name": "test_policy",
      "index.lifecycle.rollover_alias": "test_alias",
      "index.routing.allocation.include._tier_preference": "data_hot"
    }
  }
}

PUT _ilm/policy/test_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0m",
        "actions": {
          "rollover": {
            "max_age": "2m"
          }
        }
      },
      "warm": {
        "min_age": "5m",
        "actions": {}
      },
      "cold": {
        "min_age": "8m",
        "actions": {}
      },
      "delete": {
        "min_age": "14m",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Sample 10

有一个索引task2，有field2字段用match匹配the能查到很多数据，现要求对task2索引进行重建，重建后的索引叫new_task2 然后match匹配the查不到数据

解法1：使用stop analyzer

PUT test1/_doc/1
{
  "message": "you are the best"
}
PUT test2
{
  "mappings" : {
      "properties" : {
        "message" : {
          "type" : "text",
          "analyzer": "stop",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
}
POST _reindex
{
  "source": {
    "index": "test1"
  },
  "dest": {
    "index": "test2"
  }
}

解法2：使用stop filter

PUT test1/_doc/1
{
  "message": "you are the best"
}
PUT test3
{
  "settings": {
    "analysis": {
      "analyzer": {
        "stop_analyzer": {
          "tokenizer": "standard",
          "filter": ["stop"]
        }
      }
    }
  },
  "mappings" : {
      "properties" : {
        "message" : {
          "type" : "text",
          "analyzer": "stop_analyzer",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
}
POST _reindex
{
  "source": {
    "index": "test1"
  },
  "dest": {
    "index": "test2"
  }
}

Sample 11

在test索引里创建一个runtime字段，它的值为字段A减去字段B，创建一个range聚合，分为三个级别：

小于0
0到100
100以上
返回文档数为0

解法：

POST test4/_bulk
{"index": {}}
{"A": 100, "B": 200}
{"index": {}}
{"A": 10, "B": 20}
{"index": {}}
{"A": 200, "B": 20}
{"index": {}}
{"A": 100, "B": 20}
{"index": {}}
{"A": 100, "B": 50}

GET test4/_search
{
  "size": 0,
  "runtime_mappings": {
    "C": {
      "type": "long",
      "script": {
        "source": "emit(doc['A'].value-doc['B'].value)"
      }
    }
  },
  "aggs": {
    "caggs": {
      "range": {
        "field": "C",
        "ranges": [
          {
            "to": 0
          },
          {
            "from": 0,
            "to": 100
          },
          {
            "from": 100
          }
        ]
      }
    }
  }
}

Sample 12

testa 和 testb 两索引, 有一个关联字段x, 建立新的索引有testa索引的全部数据, 并且通过x的关联也包含了testb索引对应数据

解法：

PUT testb/_bulk
{"index":{}}
{"b":10,"x":2}
{"index":{}}
{"b":5,"x":5}

PUT testa/_bulk
{"index":{}}
{"a":1,"x":2}
{"index":{}}
{"a":3,"x":2}
{"index":{}}
{"a":5,"x":4}

PUT /_enrich/policy/myenrich-policy
{
  "match": {
    "indices": "testb",
    "match_field": "x",
    "enrich_fields": ["x", "b"]
  }
}

POST /_enrich/policy/myenrich-policy/_execute

PUT _ingest/pipeline/mypipeline
{
  "processors": [
    {
      "enrich": {
        "policy_name": "myenrich-policy",
        "field": "x",
        "target_field": "c"
      }
    }
  ]
}

POST _reindex
{
  "source": {
    "index": "testa"
  },
  "dest": {
    "index": "testc",
    "pipeline": "mypipeline"
  }
}

Sample 13

对集群一上的task9索引编写一个查询，并满足以下要求：

'a','b','c'字段至少有两个字段匹配中'test'关键字
对查询结果进行排序，先按照'a'字段进行降序排序，再按照'_socre'进行升序排序
'a'字段的返回结果高亮显示，前标签是"<h1>",后标签是"</h1>"

解法：

GET test6/_search
{
  "query": {
    "bool": {
      "should": [
        {"match": {"a": "test"}},
        {"match": {"b": "test"}},
        {"match": {"c": "test"}}
      ],
      "minimum_should_match": 2
    }
  },
  "highlight": {
    "fields": {
      "a": {}
    },
    "pre_tags": ["<h1>"],
    "post_tags": ["</h1>"]
  },
  "sort": [
    {
      "a.keyword": {
        "order": "desc"
      }
    },
    {
      "_score": {
        "order": "asc"
      }
    }
  ]
}
PUT test6/_bulk
{"index": {}}
{"a": "test", "b": "foo", "c": "bar"}
{"index": {}}
{"a": "test", "b": "test", "c": "bar"}
{"index": {}}
{"a": "test", "b": "foo", "c": "test"}

Sample 14

解决集群变红或者是变黄的问题

解法：

GET _cluster/health

GET _cluster/health?level=indices
GET _cluster/health/my-index-000001?level=shards

GET /_cat/shards/my-index-000001?v
GET _cat/indices?health=yellow&v

GET _cluster/allocation/explain

Sample 15

创建一个搜索模板，name为task10，搜索模板满足以下条件：

对于字段a，搜索param为search_string
使用start_date和end_date参数范围查询timestamp字段，如果没有提供end_date字段，那么结束时间默认是现在
对于返回值，要高亮a字段的内容，用<strong>和</strong>框起来
返回结果先按照b字段排序，然后再按照score排序

写一个搜索语句，对movie索引进行搜索，使用搜索模板为task10，search_string的值为star

解法：

GET task5/_search/template
{
  "id": "task5_template",
  "params": {
    "search_string": "foo",
    "start_date": "2022-01-01"
  }
}

PUT task5/_bulk
{"index": {}}
{"a": "foo", "b": 10, "timestamp": "2022-01-01"}
{"index": {}}
{"a": "foo", "b": 4, "timestamp": "2022-02-01"}
{"index": {}}
{"a": "foo bar", "b": 34, "timestamp": "2022-03-01"}
{"index": {}}
{"a": "bar", "b": 2, "timestamp": "2021-01-01"}

PUT _scripts/task5_template
{
  "script": {
    "lang": "mustache",
    "source": """
    {
      "query": {
        "bool": {
          "filter": [
            {"match": {"a": "{{search_string}}"}},
            {"range": {
              "timestamp": {
                "gte": "{{start_date}}",
                "lte": "{{end_date}}{{^end_date}}now/d{{/end_date}}"
              }
            }}
          ]
        }
      },
      "highlight": {
        "fields": {
          "a": {}
        },
        "pre_tags": ["<strong>"],
        "post_tags": ["</strong>"]
      },
      "sort": [
        {
          "b": {
            "order": "desc"
          }
        },
        {
          "_score": {
            "order": "asc"
          }
        }
      ]
    }
    """
  }
}

注意，因为在source里要写很长的语句，并且kibana没有提示，直接写的话，很容易出错，所以可以先在_search中将语句写出来，然后复制到source字段，但是注意复制的时候，复制的之后，一定要将query语句外面包括的{}给复制过来，即是这种形式："source": """ {"query": {}} """，而不是 "source": """ "query": {} """

Sample 16

对a字段进行term匹配，对b字段进行match匹配，对c字段进行加权算分，c字段是由另外两个字段得来的

解法：

GET task6/_search
{
  "runtime_mappings": {
    "z": {
      "type": "long",
      "script": {
        "source": "emit(doc['x'].value + doc['y'].value)"
      }
    }
  },
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {"match": {"b": "hello"}},
            {"term": {"a": "foo"}}
          ]
        }
      },
      "script_score": {
        "script": {
          "source": "_score * doc['z'].value"
        }
      }
    }
  }
}

PUT task6/_bulk
{"index": {}}
{"x": 2, "y": 4, "a": "foo", "b": "hello world"}
{"index": {}}
{"x": 100, "y": 50, "a": "bar", "b": "hello world 1"}
{"index": {}}
{"x": 200, "y": 10, "a": "foo", "b": "hello"}

需要注意function_score的作用，是在一个query的基础上，去影响这个query的评分。

posted @ 2023-03-11 20:03 hackerain 阅读(45) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

成长的足迹

My [Tech] World