elasticsearch 基础 —— _mget取回多个文档及_bulk批量操作

取回多个文档

Elasticsearch 的速度已经很快了，但甚至能更快。将多个请求合并成一个，避免单独处理每个请求花费的网络延时和开销。如果你需要从 Elasticsearch 检索很多文档，那么使用 multi-get 或者 mget API 来将这些检索请求放在一个请求中，将比逐个文档请求更快地检索到全部文档。

mget API 要求有一个 docs 数组作为参数，每个元素包含需要检索文档的元数据，包括 _index 、 _type和 _id 。如果你想检索一个或者多个特定的字段，那么你可以通过 _source 参数来指定这些字段的名字：

GET /_mget
{
   "docs" : [
      {
         "_index" : "website",
         "_type" :  "blog",
         "_id" :    2
      },
      {
         "_index" : "website",
         "_type" :  "pageviews",
         "_id" :    1,
         "_source": "views"
      }
   ]
}

该响应体也包含一个 docs 数组，对于每一个在请求中指定的文档，这个数组中都包含有一个对应的响应，且顺序与请求中的顺序相同。其中的每一个响应都和使用单个 get request 请求所得到的响应体相同：

   {
       "docs" : [
          {
             "_index" :   "website",
             "_id" :      "2",
             "_type" :    "blog",
             "found" :    true,
             "_source" : {
                "text" : "This is a piece of cake...",
                "title" : "My first external blog entry"
             },
             "_version" : 10
          },
          {
             "_index" :   "website",
             "_id" :      "1",
             "_type" :    "pageviews",
             "found" :    true,
             "_version" : 2,
             "_source" : {
                "views" : 2
             }
          }
       ]
   }

ElasticSearch reindex报错：the final mapping would have more than 1 type

在Elasticsearch 6.0.0或更高版本中创建的索引只包含一个mapping type。在5.x中使用multiple mapping types创建的索引将继续像以前一样在Elasticsearch 6.x中运行。 Mapping types将在Elasticsearch 7.0.0中完全删除。

Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type. Indices created in 5.x with multiple mapping types will continue to function as before in Elasticsearch 6.x. Mapping types will be completely removed in Elasticsearch 7.0.0.

https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html#_index_per_document_type

如果想检索的数据都在相同的 _index 中（甚至相同的 _type 中），则可以在 URL 中指定默认的 /_index或者默认的 /_index/_type 。

你仍然可以通过单独请求覆盖这些值：

GET /website/blog/_mget
{
   "docs" : [
      { "_id" : 2 },
      { "_type" : "pageviews", "_id" :   1 }
   ]
}

事实上，如果所有文档的 _index 和 _type 都是相同的，你可以只传一个 ids 数组，而不是整个 docs 数组：

GET /website/blog/_mget
{
   "ids" : [ "2", "1" ]
}

注意，我们请求的第二个文档是不存在的。我们指定类型为 blog ，但是文档 ID 1 的类型是 pageviews，这个不存在的情况将在响应体中被报告：

   {
      "docs" : [
        {
          "_index" :   "website",
          "_type" :    "blog",
          "_id" :      "2",
          "_version" : 10,
          "found" :    true,
          "_source" : {
            "title":   "My first external blog entry",
            "text":    "This is a piece of cake..."
          }
        },
        {
          "_index" :   "website",
          "_type" :    "blog",
          "_id" :      "1",
          "found" :    false
        }
      ]
   }

未找到该文档。

事实上第二个文档未能找到并不妨碍第一个文档被检索到。每个文档都是单独检索和报告的。

即使有某个文档没有找到，上述请求的 HTTP 状态码仍然是 200 。事实上，即使请求没有找到任何文档，它的状态码依然是 200 --因为 mget 请求本身已经成功执行。为了确定某个文档查找是成功或者失败，你需要检查 found 标记。

_source过滤

默认_source字段会返回所有的内容，你也可以通过_source进行过滤。比如使用_source,_source_include,_source_exclude.
比如：

POST _bulk
{ "create":  { "_index": "website", "_type": "blog", "_id": "1" }}
{ "text" :  "This is a piece of cake1", "title" : "My first external blog entry1","username.lastname":"lastname1","username.firstname":"firstname1"}
{ "create":  { "_index": "website", "_type": "blog", "_id": "2" }}
{ "text" :  "This is a piece of cake2", "title" : "My first external blog entry2","username.lastname":"lastname2","username.firstname":"firstname1"}
{ "create":  { "_index": "website", "_type": "blog", "_id": "3" }}
{ "text" :  "This is a piece of cake3", "title" : "My first external blog entry3","username.lastname":"lastname3","username.firstname":"firstname1"}

GET /website/blog/_mget 
{
    "docs" : [
        {
            "_id" : "1",
            "_source" : false
        },
        {
            "_id" : "2",
            "_source" : ["title", "text"]
        },
        {
            "_id" : "3",
            "_source" : {
                "include": ["username"],
                "exclude": ["username.lastname"]
            }
        }
    ]
}

   {
        "docs": [
            {
                "_index": "website",
                "_type": "blog",
                "_id": "1",
                "_version": 1,
                "found": true
            },
            {
                "_index": "website",
                "_type": "blog",
                "_id": "2",
                "_version": 1,
                "found": true,
                "_source": {
                    "text": "This is a piece of cake2",
                    "title": "My first external blog entry2"
                }
            },
            {
                "_index": "website",
                "_type": "blog",
                "_id": "3",
                "_version": 1,
                "found": true,
                "_source": {
                    "username.firstname": "firstname3"
                }
            }
        ]
   }

路由

在mget查询中也会涉及到路由的问题。可以在url中设置默认的路由，然后在Body中修改：

GET /website/blog/_mget?routing=key1 
{
    "docs" : [
        {
            "_id" : "1",
            "_routing" : "key2"
        },
        {
            "_id" : "2"
        }
    ]
}

在上面的例子中，test/type/1按照key2这个路由锁定分片进行查询；test/type/2按照key1这个路由锁定分片进行查询。

代价较小的批量操作

与 mget 可以使我们一次取回多个文档同样的方式， bulk API 允许在单个步骤中进行多次 create 、 index 、 update 或 delete 请求。如果你需要索引一个数据流比如日志事件，它可以排队和索引数百或数千批次。

bulk 与其他的请求体格式稍有不同，如下所示：

   { action: { metadata }}
   { request body }
   { action: { metadata }}
   { request body }
   ...

这种格式类似一个有效的单行 JSON 文档流，它通过换行符(\n)连接到一起。注意两个要点：

每行一定要以换行符(\n)结尾， 包括最后一行 。这些换行符被用作一个标记，可以有效分隔行。
这些行不能包含未转义的换行符，因为他们将会对解析造成干扰。这意味着这个 JSON 不能使用 pretty 参数打印。

在为什么是有趣的格式？中，我们解释为什么 bulk API 使用这种格式。

action/metadata 行指定 哪一个文档 做 什么操作 。

action 必须是以下选项之一:

create

如果文档不存在，那么就创建它。详情请见创建新文档。

index

创建一个新文档或者替换一个现有的文档。详情请见索引文档和更新整个文档。

update

部分更新一个文档。详情请见文档的部分更新。

delete

删除一个文档。详情请见删除文档。

metadata 应该指定被索引、创建、更新或者删除的文档的 _index 、 _type 和 _id 。

例如，一个 delete 请求看起来是这样的：

{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}

request body 行由文档的 _source 本身组成--文档包含的字段和值。它是 index 和 create 操作所必需的，这是有道理的：你必须提供文档以索引。

它也是 update 操作所必需的，并且应该包含你传递给 update API 的相同请求体： doc 、 upsert 、 script 等等。删除操作不需要 request body 行。

{ "create":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }

如果不指定 _id ，将会自动生成一个 ID ：

{ "index": { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }

为了把所有的操作组合在一起，一个完整的 bulk 请求有以下形式:

   POST /_bulk
   { "delete": { "_index": "website", "_type": "blog", "_id": "123" }}
   { "create": { "_index": "website", "_type": "blog", "_id": "123" }}
   { "title": "My first blog post" }
   { "index": { "_index": "website", "_type": "blog" }}
   { "title": "My second blog post" }
   { "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
   { "doc" : {"title" : "My updated blog post"} }

注意 delete 动作不能有请求体,它后面跟着的是另外一个操作。

谨记最后一个换行符不要落下。

这个 Elasticsearch 响应包含 items 数组，这个数组的内容是以请求的顺序列出来的每个请求的结果。

   {
       "took": 4,
       "errors": false,
       "items": [
          { "delete": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "_version": 2,
                "status":   200,
                "found":    true
          }},
          { "create": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "_version": 3,
                "status":   201
          }},
          { "create": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "EiwfApScQiiy7TIKFxRCTw",
                "_version": 1,
                "status":   201
          }},
          { "update": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "_version": 4,
                "status":   200
          }}
       ]
   }

所有的子请求都成功完成。

每个子请求都是独立执行，因此某个子请求的失败不会对其他子请求的成功与否造成影响。如果其中任何子请求失败，最顶层的 error 标志被设置为 true ，并且在相应的请求报告出错误明细：

POST /_bulk
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "Cannot create - it already exists" }
{ "index":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "But we can update it" }

在响应中，我们看到 create 文档 123 失败，因为它已经存在。但是随后的 index 请求，也是对文档 123操作，就成功了：

   {
       "took": 3,
       "errors": true,
       "items": [
          { "create": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "status":   409,
                "error":    "DocumentAlreadyExistsException
                            [[website][4] [blog][123]:
                            document already exists]"
          }},
          { "index": {
                "_index":   "website",
                "_type":    "blog",
                "_id":      "123",
                "_version": 5,
                "status":   200
          }}
       ]
   }

   一个或者多个请求失败。
   这个请求的HTTP状态码报告为 409 CONFLICT 。
   解释为什么请求失败的错误信息。
第二个请求成功，返回 HTTP 状态码 200 OK 。

这也意味着 bulk 请求不是原子的：不能用它来实现事务控制。每个请求是单独处理的，因此一个请求的成功或失败不会影响其他的请求。

不要重复指定Index和Type

也许你正在批量索引日志数据到相同的 index 和 type 中。但为每一个文档指定相同的元数据是一种浪费。相反，可以像 mget API 一样，在 bulk 请求的 URL 中接收默认的 /_index 或者 /_index/_type ：

POST /website/_bulk
{ "index": { "_type": "log" }}
{ "event": "User logged in" }

你仍然可以覆盖元数据行中的 _index 和 _type , 但是它将使用 URL 中的这些元数据值作为默认值：

POST /website/log/_bulk
{ "index": {}}
{ "event": "User logged in" }
{ "index": { "_type": "blog" }}
{ "title": "Overriding the default type" }

多大是太大了？

整个批量请求都需要由接收到请求的节点加载到内存中，因此该请求越大，其他请求所能获得的内存就越少。批量请求的大小有一个最佳值，大于这个值，性能将不再提升，甚至会下降。但是最佳值不是一个固定的值。它完全取决于硬件、文档的大小和复杂度、索引和搜索的负载的整体情况。

幸运的是，很容易找到这个 最佳点 ：通过批量索引典型文档，并不断增加批量大小进行尝试。当性能开始下降，那么你的批量大小就太大了。一个好的办法是开始时将 1,000 到 5,000 个文档作为一个批次, 如果你的文档非常大，那么就减少批量的文档个数。

密切关注你的批量请求的物理大小往往非常有用，一千个 1KB 的文档是完全不同于一千个 1MB 文档所占的物理大小。一个好的批量大小在开始处理后所占用的物理大小约为 5-15 MB。

posted on 2018-09-11 12:24 疯狂的小萝卜头阅读(339) 评论(0) 收藏举报

刷新页面返回顶部

疯狂的小萝卜头