Elasticsearch7.6学习笔记1 Getting start with Elasticsearch

前言

权威指南中文只有2.x, 但现在es已经到7.6. 就安装最新的来学下.

安装

这里是学习安装, 生产安装是另一套逻辑.

win

es下载地址:

https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.0-windows-x86_64.zip

kibana下载地址:

https://artifacts.elastic.co/downloads/kibana/kibana-7.6.0-windows-x86_64.zip

官方目前最新是7.6.0, 但下载速度惨不忍睹. 使用迅雷下载速度可以到xM.

bin\elasticsearch.bat
bin\kibana.bat

双击bat启动.

docker安装

对于测试学习，直接使用官方提供的docker镜像更快更方便。

安装方法见： https://www.cnblogs.com/woshimrf/p/docker-es7.html

以下内容来自:

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/getting-started.html

Index some documents 索引一些文档

本次测试直接使用kibana, 当然也可以通过curl或者postman访问localhost:9200.

访问localhost:5601, 然后点击Dev Tools.

新建一个客户索引(index)

PUT /{index-name}/_doc/

PUT /customer/_doc/1
{
  "name": "John Doe"
}

put 是http method, 如果es中不存在索引(index) customer, 则创建一个, 并插入一个数据, id为, name=John`.
如果存在则更新. 注意, 更新是覆盖更新, 即body json是什么, 最终结果就是什么.

返回如下:

{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 7,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 6,
  "_primary_term" : 1
}

_index 是索引名称
_type 唯一为_doc
_id 是文档(document)的主键, 也就是一条记录的pk
_version 是该_id的更新次数, 我这里已经更新了7次
_shards 表示分片的结果. 我们这里一共部署了两个节点, 都写入成功了.

在kibana上设置-index manangement里可以查看index的状态. 比如我们这条记录有主副两个分片.

保存记录成功后可以立马读取出来:

GET /customer/_doc/1

{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 15,
  "_seq_no" : 14,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "John Doe"
  }
}

_source 就是我们记录的内容

批量插入

当有多条数据需要插入的时候, 我们可以批量插入. 下载准备好的文档, 然后通过http请求导入es.

创建一个索引bank: 由于shards(分片)和replicas(副本)创建后就不能修改了，所以要先创建的时候配置shards. 这里配置了3个shards和2个replicas.

PUT /bank
{
  "settings": {
    "index": {
      "number_of_shards": "3",
      "number_of_replicas": "2"
    }
  }
}

文档地址: https://gitee.com/mirrors/elasticsearch/raw/master/docs/src/test/resources/accounts.json

下载下来之后, curl命令或者postman 发送文件请求过去

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"

每条记录格式如下:

{
  "_index": "bank",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "_score": 0,
  "_source": {
    "account_number": 1,
    "balance": 39225,
    "firstname": "Amber",
    "lastname": "Duke",
    "age": 32,
    "gender": "M",
    "address": "880 Holmes Lane",
    "employer": "Pyrami",
    "email": "amberduke@pyrami.com",
    "city": "Brogan",
    "state": "IL"
  }
}

在kibana monitor中选择self monitor. 然后再indices中找到索引bank。可以看到我们导入的数据分布情况。

可以看到, 有3个shards分在不同的node上, 并且都有2个replicas.

开始查询

批量插入了一些数据后, 我们就可以开始学习查询了. 上文知道, 数据是银行职员表, 我们查询所有用户,并根据账号排序.

类似 sql

select * from bank order by  account_number asc limit 3

Query DSL


GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ],
  "size": 3,
  "from": 2
}

_search 表示查询
query 是查询条件, 这里是所有
size 表示每次查询的条数, 分页的条数. 如果不传, 默认是10条. 在返回结果的hits中显示.
from表示从第几个开始


{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "account_number" : 2,
          "balance" : 28838,
          "firstname" : "Roberta",
          "lastname" : "Bender",
          "age" : 22,
          "gender" : "F",
          "address" : "560 Kingsway Place",
          "employer" : "Chillium",
          "email" : "robertabender@chillium.com",
          "city" : "Bennett",
          "state" : "LA"
        },
        "sort" : [
          2
        ]
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "account_number" : 3,
          "balance" : 44947,
          "firstname" : "Levine",
          "lastname" : "Burks",
          "age" : 26,
          "gender" : "F",
          "address" : "328 Wilson Avenue",
          "employer" : "Amtap",
          "email" : "levineburks@amtap.com",
          "city" : "Cochranville",
          "state" : "HI"
        },
        "sort" : [
          3
        ]
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : null,
        "_source" : {
          "account_number" : 4,
          "balance" : 27658,
          "firstname" : "Rodriquez",
          "lastname" : "Flores",
          "age" : 31,
          "gender" : "F",
          "address" : "986 Wyckoff Avenue",
          "employer" : "Tourmania",
          "email" : "rodriquezflores@tourmania.com",
          "city" : "Eastvale",
          "state" : "HI"
        },
        "sort" : [
          4
        ]
      }
    ]
  }
}

返回结果提供了如下信息

took es查询时间, 单位是毫秒(milliseconds)
timed_out search是否超时了
_shards 我们搜索了多少shards, 成功了多少, 失败了多少, 跳过了多少. 关于shard, 简单理解为数据分片, 即一个index里的数据分成了几片，可以理解为按id进行分表。
max_score 最相关的记录(document)的分数

接下来可可以尝试带条件的查询。

分词查询

查询address中带mill和lane的地址。

GET /bank/_search
{
  "query": { "match": { "address": "mill lane" } },
  "size": 2
}

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 19,
      "relation" : "eq"
    },
    "max_score" : 9.507477,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : 9.507477,
        "_source" : {
          "account_number" : 136,
          "balance" : 45801,
          "firstname" : "Winnie",
          "lastname" : "Holland",
          "age" : 38,
          "gender" : "M",
          "address" : "198 Mill Lane",
          "employer" : "Neteria",
          "email" : "winnieholland@neteria.com",
          "city" : "Urie",
          "state" : "IL"
        }
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "970",
        "_score" : 5.4032025,
        "_source" : {
          "account_number" : 970,
          "balance" : 19648,
          "firstname" : "Forbes",
          "lastname" : "Wallace",
          "age" : 28,
          "gender" : "M",
          "address" : "990 Mill Road",
          "employer" : "Pheast",
          "email" : "forbeswallace@pheast.com",
          "city" : "Lopezo",
          "state" : "AK"
        }
      }
    ]
  }
}

我设置了返回2个，但实际上命中的有19个

完全匹配查询

GET /bank/_search
{
  "query": { "match_phrase": { "address": "mill lane" } }
}

这时候查的完全符合的就一个了

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 9.507477,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : 9.507477,
        "_source" : {
          "account_number" : 136,
          "balance" : 45801,
          "firstname" : "Winnie",
          "lastname" : "Holland",
          "age" : 38,
          "gender" : "M",
          "address" : "198 Mill Lane",
          "employer" : "Neteria",
          "email" : "winnieholland@neteria.com",
          "city" : "Urie",
          "state" : "IL"
        }
      }
    ]
  }
}

多条件查询

实际查询中通常是多个条件一起查询的

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}

bool用来合并多个查询条件
must, should, must_not是boolean查询的子语句， must, should决定相关性的score，结果默认按照score排序
must not是作为一个filter，影响查询的结果，但不影响score，只是从结果中过滤。

还可以显式地指定任意过滤器，以包括或排除基于结构化数据的文档。

比如，查询balance在20000和30000之间的。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}

聚合运算group by

按照省份统计人数

按sql的写法可能是

select state AS group_by_state, count(*) from tbl_bank limit 3;

对应es的请求是


GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "size": 3
      }
    }
  }
}

size=0是限制返回内容，因为es会返回查询的记录，我们只想要聚合值
aggs是聚合的语法词
group_by_state 是一个聚合结果，名称自定义
terms 查询的字段精确匹配, 这里是需要分组的字段
state.keyword state是text类型, 字符类型需要统计和分组的，类型必须是keyword
size=3 限制group by返回的数量，这里是top3, 默认top10, 系统最大10000，可以通过修改search.max_buckets实现，注意多个shards会产生精度问题，后面再深入学习

返回值：

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 26,
      "sum_other_doc_count" : 928,
      "buckets" : [
        {
          "key" : "MD",
          "doc_count" : 28
        },
        {
          "key" : "ID",
          "doc_count" : 23
        },
        {
          "key" : "TX",
          "doc_count" : 21
        }
      ]
    }
  }
}

hits命中查询条件的记录，因为设置了size=0，返回[]. total是本次查询命中了1000条记录
aggregations 是聚合指标结果
group_by_state 是我们查询中命名的变量名
doc_count_error_upper_bound 没有在这次聚合中返回、但是可能存在的潜在聚合结果.键名有「上界」的意思，也就是表示在预估的最坏情况下沒有被算进最终结果的值，当然doc_count_error_upper_bound的值越大，最终数据不准确的可能性越大，能确定的是，它的值为 0 表示数据完全正确，但是它不为 0，不代表这次聚合的数据是错误的.
sum_other_doc_count 聚合中没有统计到的文档数

值得注意的是, top3是否是准确的呢. 我们看到doc_count_error_upper_bound是有错误数量的, 即统计结果很可能不准确, 并且得到的top3分别是28,23,21. 我们再来添加另个查询参数来比较结果:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "size": 3,
        "shard_size":  60
      }
    }
  }
}
-----------------------------------------
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 915,
      "buckets" : [
        {
          "key" : "TX",
          "doc_count" : 30
        },
        {
          "key" : "MD",
          "doc_count" : 28
        },
        {
          "key" : "ID",
          "doc_count" : 27
        }
      ]
    }
  }

shard_size 表示每个分片计算的数量. 因为agg聚合运算是每个分片计算出一个结果,然后最后聚合计算最终结果. 数据在分片分布不均衡, 每个分片的topN并不是一样的, 就有可能最终聚合结果少算了一部分. 从而导致doc_count_error_upper_bound不为0. es默认shard_size的值是size*1.5+10, size=3对应就是14.5, 验证shar_size=14.5时返回值确实和不传一样. 而设置为60时, error终于为0了, 即, 可以保证这个3个绝对是最多的top3. 也就是说, 聚合运算要设置shard_size尽可能大, 比如size的20倍.

按省份统计人数并计算平均薪酬

我们想要查看每个省的平均薪酬, sql可能是

select 
  state, avg(balance) AS average_balance, count(*) AS group_by_state 
from tbl_bank
group by state
limit 3

在es可以这样查询:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "size": 3,
        "shard_size":  60
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        },
        "sum_balance": {
          "sum": {
            "field": "balance"
          }
        }
      }
    }
  }
}

第二个aggs是计算每个state的聚合指标
average_balance 自定义的变量名称, 值为相同state的balance avg运算
sum_balance 自定义的变量名称, 值为相同state的balancesum运算

结果如下:

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 915,
      "buckets" : [
        {
          "key" : "TX",
          "doc_count" : 30,
          "sum_balance" : {
            "value" : 782199.0
          },
          "average_balance" : {
            "value" : 26073.3
          }
        },
        {
          "key" : "MD",
          "doc_count" : 28,
          "sum_balance" : {
            "value" : 732523.0
          },
          "average_balance" : {
            "value" : 26161.535714285714
          }
        },
        {
          "key" : "ID",
          "doc_count" : 27,
          "sum_balance" : {
            "value" : 657957.0
          },
          "average_balance" : {
            "value" : 24368.777777777777
          }
        }
      ]
    }
  }
}

按省份统计人数并按照平均薪酬排序

agg terms默认排序是count降序, 如果我们想用其他方式, sql可能是这样:

select 
  state, avg(balance) AS average_balance, count(*) AS group_by_state 
from tbl_bank
group by state
order by average_balance
limit 3

对应es可以这样查询:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "order": {
          "average_balance": "desc"
        },
        "size": 3
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

返回结果的top3就不是之前的啦:

  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 983,
      "buckets" : [
        {
          "key" : "DE",
          "doc_count" : 2,
          "average_balance" : {
            "value" : 39040.5
          }
        },
        {
          "key" : "RI",
          "doc_count" : 5,
          "average_balance" : {
            "value" : 36035.4
          }
        },
        {
          "key" : "NE",
          "doc_count" : 10,
          "average_balance" : {
            "value" : 35648.8
          }
        }
      ]
    }
  }

参考

中文社区:https://elasticsearch.cn/
es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html
es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index.html
terms 聚合计算不准确: https://www.dongwm.com/post/elasticsearch-terms-agg-is-not-accurate/

posted @ 2020-04-10 18:32 Ryan.Miao 阅读(3041) 评论(0) 收藏举报

刷新页面返回顶部

Ryan Miao

像风一样

Elasticsearch7.6学习笔记1 Getting start with Elasticsearch

Elasticsearch7.6学习笔记1 Getting start with Elasticsearch

前言

安装

docker安装

Index some documents 索引一些文档

批量插入

开始查询

分词查询

完全匹配查询

多条件查询

聚合运算group by

按照省份统计人数

按省份统计人数并计算平均薪酬

按省份统计人数并按照平均薪酬排序

参考

公告