elasticsearch高级

DSL搜索

Elasticsearch提供丰富且灵活的查询语言叫做DSL查询(Query DSL),它允许你构建更加复杂、强大的查询。

DSL(Domain Specifific Language特定领域语言)以JSON请求体的形式出现

POST /haoke/user/_search

#请求体

{

"query" : {

"match" : { #match只是查询的一种

"age" : 20

}

查询大于30：

POST /haoke/user/_search

#请求数据

{

"query": {

"bool": {

"filter": {

"range": {

"age": {

"gt": 30

}

"must": {

"match": {

"sex": "男"

}

POST /haoke/user/_search

#请求数据

{

"query": {

"match": {

"name": "张三李四"

}

高亮显示

POST /haoke/user/_search

#请求数据

{

"query": {

"match": {

"name": "张三李四"

}

POST /haoke/user/_search

{

"query": {

"match": {

"name": "张三李四"

}

"highlight": {

"fields": {

"name": {}

}

聚合

在Elasticsearch中，支持聚合操作，类似SQL中的group by操作

POST /haoke/user/_search

{

"aggs": {

"all_interests": {

"terms": {

"field": "age"

}

文档

在Elasticsearch中，文档以JSON格式进行存储，可以是复杂的结构，如：

{

"_index": "haoke",

"_type": "user",

"_id": "1005",

"_version": 1,

"_score": 1,

"_source": {

"id": 1005,

"name": "孙七",

"age": 37,

"sex": "女",

"card": {

"card_number": "123456789"

}

其中，card是一个复杂对象，嵌套的Card对象。

元数据（metadata）

一个文档不只有数据。它还包含了元数据(metadata)——关于文档的信息。三个必须的元数据节点是：

_index

索引(index)类似于关系型数据库里的“数据库”——它是我们存储和索引关联数据的地方。

提示：

事实上，我们的数据被存储和索引在分片(shards)中，索引只是一个把一个或多个分片分组在一起的逻辑空

间。然而，这只是一些内部细节——我们的程序完全不用关心分片。对于我们的程序而言，文档存储在索引

(index)中。剩下的细节由Elasticsearch关心既可。

_type

在应用中，我们使用对象表示一些“事物”，例如一个用户、一篇博客、一个评论，或者一封邮件。每个对象都属于一

个类(class)，这个类定义了属性或与对象关联的数据。 user 类的对象可能包含姓名、性别、年龄和Email地址。

在关系型数据库中，我们经常将相同类的对象存储在一个表里，因为它们有着相同的结构。同理，在Elasticsearch

中，我们使用相同类型(type)的文档表示相同的“事物”，因为他们的数据结构也是相同的。

每个类型(type)都有自己的映射(mapping)或者结构定义，就像传统数据库表中的列一样。所有类型下的文档被存储

在同一个索引下，但是类型的映射(mapping)会告诉Elasticsearch不同的文档如何被索引。

_type 的名字可以是大写或小写，不能包含下划线或逗号。我们将使用 blog 做为类型名。

_id

id仅仅是一个字符串，它与 _index 和 _type 组合时，就可以在Elasticsearch中唯一标识一个文档。当创建一个文

档，你可以自定义 _id ，也可以让Elasticsearch帮你自动生成（32位长度）。

查询响应

pretty

可以在查询url后面添加pretty参数，使得返回的json更易查看。

指定响应字段

在响应的数据中，如果我们不需要全部的字段，可以指定某些需要的字段进行返回

GET /haoke/user/1005?_source=id,name

#响应

{

"_index": "haoke",

"_type": "user",

"_id": "1005",

"_version": 1,

"found": true,

"_source": {

"name": "孙七",

"id": 1005

}

如不需要返回元数据，仅仅返回原始数据，可以这样：

GET /haoke/user/1005/_source

GET /haoke/user/1005/_source?_source=id,name

判断文档是否存在

如果我们只需要判断文档是否存在，而不是查询文档内容，那么可以这样：

HEAD /haoke/user/1005

批量操作

有些情况下可以通过批量操作以减少网络请求。如：批量查询、批量插入数据。

批量查询

POST /haoke/user/_mget

{

"ids" : [ "1001", "1003" ]

}

{"create":{"_index":"haoke","_type":"user","_id":2001}}

{"id":2001,"name":"name1","age": 20,"sex": "男"}

{"create":{"_index":"haoke","_type":"user","_id":2002}}

{"id":2002,"name":"name2","age": 20,"sex": "男"}

{"create":{"_index":"haoke","_type":"user","_id":2003}}

{"id":2003,"name":"name3","age": 20,"sex": "男"

批量删除：

{"delete":{"_index":"haoke","_type":"user","_id":2001}}

{"delete":{"_index":"haoke","_type":"user","_id":2002}}

{"delete":{"_index":"haoke","_type":"user","_id":2003}}

分页

和SQL使用 LIMIT 关键字返回只有一页的结果一样，Elasticsearch接受 from 和 size 参数

size: 结果数，默认10

from: 跳过开始的结果数，默认0

如果你想每页显示5个结果，页码从1到3，那请求如下：

GET /_search?size=5

GET /_search?size=5&from=5

GET /_search?size=5&from=10

应该当心分页太深或者一次请求太多的结果。结果在返回前会被排序。但是记住一个搜索请求常常涉及多个分

片。每个分片生成自己排好序的结果，它们接着需要集中起来排序以确保整体排序正确。

GET /haoke/user/_search?size=1&from=2

在集群系统中深度分页

为了理解为什么深度分页是有问题的，让我们假设在一个有5个主分片的索引中搜索。当我们请求结果的第一

页（结果1到10）时，每个分片产生自己最顶端10个结果然后返回它们给请求节点(requesting node)，它再

排序这所有的50个结果以选出顶端的10个结果。

现在假设我们请求第1000页——结果10001到10010。工作方式都相同，不同的是每个分片都必须产生顶端的

10010个结果。然后请求节点排序这50050个结果并丢弃50040个！

你可以看到在分布式系统中，排序结果的花费随着分页的深入而成倍增长。这也是为什么网络搜索引擎中任何

语句不能返回多于1000个结果的原因。

映射

前面我们创建的索引以及插入数据，都是由Elasticsearch进行自动判断类型，有些时候我们是需要进行明确字段类型

的，否则，自动判断的类型和实际需求是不相符的。

自动判断的规则如下：

string类型在ElasticSearch 旧版本中使用较多，从ElasticSearch 5.x开始不再支持string，由text和

keyword类型替代。

text 类型，当一个字段是要被全文搜索的，比如Email内容、产品描述，应该使用text类型。设置text类型

以后，字段内容会被分析，在生成倒排索引以前，字符串会被分析器分成一个一个词项。text类型的字段

不用于排序，很少用于聚合。

keyword类型适用于索引结构化的字段，比如email地址、主机名、状态码和标签。如果字段需要进行过

滤(比如查找已发布博客中status属性为published的文章)、排序、聚合。keyword类型的字段只能通过精

确值搜索到

PUT /itcast

{

"settings": {

"index": {

"number_of_shards": "2"

"number_of_replicas": "0"

}

"mappings": {

"person": {

"properties": {

"name": {

"type": "text"

"age": {

"type": "integer"

"mail": {

"type": "keyword"

"hobby": {

"type": "text"

}

查看映射：

GET /itcast/_mapping

插入数据：

POST /itcast/_bulk

{"index":{"_index":"itcast","_type":"person"}}

{"name":"张三","age": 20,"mail": "111@qq.com","hobby":"羽毛球、乒乓球、足球"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"李四","age": 21,"mail": "222@qq.com","hobby":"羽毛球、乒乓球、足球、篮球"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"王五","age": 22,"mail": "333@qq.com","hobby":"羽毛球、篮球、游泳、听音乐"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"赵六","age": 23,"mail": "444@qq.com","hobby":"跑步、游泳"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"孙七","age": 24,"mail": "555@qq.com","hobby":"听音乐、看电影"}

POST /itcast/person/_search

{

"query" : {

"match" : {

"hobby" : "音乐"

}

结构化查询

term查询

term 主要用于精确匹配哪些值，比如数字，日期，布尔值或 not_analyzed 的字符串(未经分析的文本数据类型)

{ "term": { "age": 26 }}

{ "term": { "date": "2014-09-01" }}

{ "term": { "public": true }}

{ "term": { "tag": "full_text" }}

POST /itcast/person/_search

{

"query" : {

"term" : {

"age" : 20

}

terms查询

terms 跟 term 有点类似，但 terms 允许指定多个匹配条件。如果某个字段指定了多个值，那么文档需要一起去

做匹配：

示例：

{ "term": { "age": 26 }}

{ "term": { "date": "2014-09-01" }}

{ "term": { "public": true }}

{ "term": { "tag": "full_text" }}

POST /itcast/person/_search

{

"query" : {

"term" : {

"age" : 20

}

terms查询

terms 跟 term 有点类似，但 terms 允许指定多个匹配条件。如果某个字段指定了多个值，那么文档需要一起去

做匹配：

{

"terms": {

"tag": [ "search", "full_text", "nosql" ]

}

POST /itcast/person/_search

{

"query" : {

"terms" : {

"age" : [20,21]

}

range查询

range 过滤允许我们按照指定范围查找一批数据：

{

"range": {

"age": {

"gte": 20,

"lt": 30

}

范围操作符包含：

gt :: 大于

gte :: 大于等于

lt :: 小于

lte :: 小于等于

POST /itcast/person/_search

{

"query": {

"range": {

"age": {

"gte": 20,

"lte": 22

}

exists 查询

exists 查询可以用于查找文档中是否包含指定字段或没有某个字段，类似于SQL语句中的 IS_NULL 条件

{

"exists": {

"field": "title"

}

这两个查询只是针对已经查出一批数据来，但是想区分出某个字段是否存在的时候使用。

示例：

3.6.5、match查询

match 查询是一个标准查询，不管你需要全文本查询还是精确查询基本上都要用到它。

如果你使用 match 查询一个全文本字段，它会在真正查询之前用分析器先分析 match 一下查询字符：

如果用 match 下指定了一个确切值，在遇到数字，日期，布尔值或者 not_analyzed 的字符串时，它将为你搜索你

给定的值：

POST /haoke/user/_search

{

"query": {

"exists": { #必须包含

"field": "card"

}

match 查询是一个标准查询，不管你需要全文本查询还是精确查询基本上都要用到它。

如果你使用 match 查询一个全文本字段，它会在真正查询之前用分析器先分析 match 一下查询字符：

如果用 match 下指定了一个确切值，在遇到数字，日期，布尔值或者 not_analyzed 的字符串时，它将为你搜索你

给定的值：

POST /haoke/user/_search

{

"query": {

"exists": { #必须包含

"field": "card"

}

match查询

{

"match": {

"tweet": "About Search"

}

如果用 match 下指定了一个确切值，在遇到数字，日期，布尔值或者 not_analyzed 的字符串时，它将为你搜索你

给定的值

{ "match": { "age": 26 }}

{ "match": { "date": "2014-09-01" }}

{ "match": { "public": true }}

{ "match": { "tag": "full_text" }}

bool查询

bool 查询可以用来合并多个条件查询结果的布尔逻辑，它包含一下操作符：

must :: 多个查询条件的完全匹配,相当于 and 。

must_not :: 多个查询条件的相反匹配，相当于 not 。

should :: 至少有一个查询条件匹配, 相当于 or 。

这些参数可以分别继承一个查询条件或者一个查询条件的数组：

{

"bool": {

"must": { "term": { "folder": "inbox" }},

"must_not": { "term": { "tag": "spam" }},

"should": [

{ "term": { "starred": true }},

{ "term": { "unread": true }}

]

}

过滤查询

前面讲过结构化查询，Elasticsearch也支持过滤查询，如term、range、match等。

示例：查询年龄为20岁的用户

POST /itcast/person/_search

{

"query": {

"bool": {

"filter": {

"term": {

"age": 20

}

查询和过滤的对比

一条过滤语句会询问每个文档的字段值是否包含着特定值。

查询语句会询问每个文档的字段值与特定值的匹配程度如何。

一条查询语句会计算每个文档与查询语句的相关性，会给出一个相关性评分 _score，并且按照相关性对匹

配到的文档进行排序。这种评分方式非常适用于一个没有完全配置结果的全文本搜索。

一个简单的文档列表，快速匹配运算并存入内存是十分方便的，

每个文档仅需要1个字节。这些缓存的过滤结果

集与后续请求的结合使用是非常高效的。

查询语句不仅要查找相匹配的文档，还需要计算每个文档的相关性，所以一般来说查询语句要比过滤语句更耗

时，并且查询结果也不可缓存。

建议：

做精确匹配搜索时，最好用过滤语句，因为过滤语句可以缓存数据。

全文搜索

全文搜索两个最重要的方面是：

相关性（Relevance）

它是评价查询与其结果间的相关程度，并根据这种相关程度对结果排名的一种能力，这

种计算方式可以是 TF/IDF 方法、地理位置邻近、模糊相似，或其他的某些算法。

分词（Analysis）

它是将文本块转换为有区别的、规范化的 token 的一个过程，目的是为了创建倒排索引以及

查询倒排索引。

PUT /itcast

{

"settings": {

"index": {

"number_of_shards": "1",

"number_of_replicas": "0"

}

"mappings": {

"person": {

"properties": {

"name": {

"type": "text"

"age": {

"type": "integer"

"mail": {

"type": "keyword"

"hobby": {

"type": "text",

"analyzer":"ik_max_word"

}

POST http://172.16.55.185:9200/itcast/_bulk

{"index":{"_index":"itcast","_type":"person"}}

{"name":"张三","age": 20,"mail": "111@qq.com","hobby":"羽毛球、乒乓球、足球"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"李四","age": 21,"mail": "222@qq.com","hobby":"羽毛球、乒乓球、足球、篮球"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"王五","age": 22,"mail": "333@qq.com","hobby":"羽毛球、篮球、游泳、听音乐"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"赵六","age": 23,"mail": "444@qq.com","hobby":"跑步、游泳、篮球"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"孙七","age": 24,"mail": "555@qq.com","hobby":"听音乐、看电影、羽毛球"}

单词搜索

结果：

POST http://172.16.55.185:9200/itcast/_bulk

{"index":{"_index":"itcast","_type":"person"}}

{"name":"张三","age": 20,"mail": "111@qq.com","hobby":"羽毛球、乒乓球、足球"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"李四","age": 21,"mail": "222@qq.com","hobby":"羽毛球、乒乓球、足球、篮球"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"王五","age": 22,"mail": "333@qq.com","hobby":"羽毛球、篮球、游泳、听音乐"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"赵六","age": 23,"mail": "444@qq.com","hobby":"跑步、游泳、篮球"}

{"index":{"_index":"itcast","_type":"person"}}

{"name":"孙七","age": 24,"mail": "555@qq.com","hobby":"听音乐、看电影、羽毛球"}

单词搜索:

POST /itcast/person/_search

{

"query":{

"match":{

"hobby":"音乐"

}

"highlight": {

"fields": {

"hobby": {}

}

过程说明：

1. 检查字段类型

爱好 hobby 字段是一个 text 类型（

指定了IK分词器），这意味着查询字符串本身也应该被分词。

2. 分析查询字符串。

将查询的字符串 “音乐” 传入IK分词器中，输出的结果是单个项音乐。因为只有一个单词项，所以 match 查询执

行的是单个底层 term 查询。

3. 查找匹配文档。

用 term 查询在倒排索引中查找 “音乐” 然后获取一组包含该项的文档，本例的结果是文档：3 、5 。

为每个文档评分。

4.用 term 查询计算每个文档相关度评分 _score ，这是种将词频（term frequency，即词 “音乐” 在相关文档的

hobby 字段中出现的频率）和反向文档频率（inverse document frequency，即词 “音乐” 在所有文档的

hobby 字段中出现的频率），以及字段的长度（即字段越短相关度越高）相结合的计算方式

多词搜索

POST /itcast/person/_search

{

"query":{

"match":{

"hobby":"音乐篮球"

}

"highlight": {

"fields": {

"hobby": {}

}

可是，搜索的结果并不符合我们的预期，因为我们想搜索的是既包含“音乐”又包含“篮球”的用户，显然结果返回

的“或”的关系。

在Elasticsearch中，可以指定词之间的逻辑关系，如下：

POST /itcast/person/_search

{

"query":{

"match":{

"hobby":{

"query":"音乐篮球",

"operator":"and"

}

"highlight": {

"fields": {

"hobby": {}

}

组合搜索

在搜索时，也可以使用过滤器中讲过的bool组合查询，示例：

POST /itcast/person/_search

{

"query":{

"bool":{

"must":{

"match":{

"hobby":"篮球"

}

"must_not":{

"match":{

"hobby":"音乐"

}

"should":[

{

"match": {

"hobby":"游泳"

}

]

}

"highlight": {

"fields": {

"hobby": {}

}

上面搜索的意思是：

搜索结果中必须包含篮球，不能包含音乐，如果包含了游泳，那么它的相似度更高。

权重

有些时候，我们可能需要对某些词增加权重来影响该条数据的得分。如下：

搜索关键字

POST /itcast/person/_search

{

"query": {

"bool": {

"must": {

"match": {

"hobby": {

"query": "游泳篮球",

"operator": "and"

}

“游泳篮球”，如果结果中包含了“音乐”权重为10，包含了“跑步”权重为2。

"should": [

{

"match": {

"hobby": {

"query": "音乐",

"boost": 10

}

{

"match": {

"hobby": {

"query": "跑步",

"boost": 2

}

]

}

"highlight": {

"fields": {

"hobby": {}

}

同步mysql到elasticsearch :

导入文件

<properties>

<elasticsearch.version>7.13.0</elasticsearch.version>

</properties>

<下面是pom依赖>

<dependency>

<groupId>org.elasticsearch.client</groupId>

<artifactId>elasticsearch-rest-high-level-client</artifactId>

<version>7.13.0</version>

<exclusions>

<exclusion>

<groupId>org.elasticsearch</groupId>

<artifactId>elasticsearch</artifactId>

</exclusion>

<exclusion>

<groupId>org.elasticsearch.client</groupId>

<artifactId>elasticsearch-rest-client</artifactId>

</exclusion>

</exclusions>

</dependency>

<dependency>

<groupId>org.elasticsearch</groupId>

<artifactId>elasticsearch</artifactId>

<version>${elasticsearch.version}</version>

</dependency>

<dependency>

<groupId>org.elasticsearch.client</groupId>

<artifactId>elasticsearch-rest-client</artifactId>

<version>${elasticsearch.version}</version>

<exclusions>

<exclusion>

<groupId>commons-logging</groupId>

<artifactId>commons-logging</artifactId>

</exclusion>

</exclusions>

</dependency>



<dependency>

<groupId>com.alibaba</groupId>

<artifactId>fastjson</artifactId>

</dependency>

ES链接的配置;

package com.tm.config;

import org.apache.http.HttpHost;

import org.elasticsearch.client.RequestOptions;

import org.elasticsearch.client.RestClient;

import org.elasticsearch.client.RestHighLevelClient;

import org.springframework.context.annotation.Bean;

import org.springframework.context.annotation.Configuration;

/**
* @author q请问请问q
* @createTime 2021年11月25日 18:20:00
*/

@Configuration

public class config {
public static final RequestOptions COMMON_OPTIONS;

static {
RequestOptions.Builder builder = RequestOptions.DEFAULT.toBuilder();

COMMON_OPTIONS = builder.build();

}

/**
* @title 无账号密码登录
* @updateTime 2021/11/22 20:53
*/

@Bean

public static RestHighLevelClient esRestClient(){
RestHighLevelClient client = new RestHighLevelClient(

RestClient.builder(

//集群配置法

new HttpHost("192.168.206.133",19200,"http")));

return client;

}

}

具体实现:

package com.tm.service.impl;

import com.alibaba.fastjson.JSONObject;

import com.tm.config.config;

import com.tm.mapper.EsSyncGoodsSpuMapper;

import com.tm.model.entity.EsSyncGoodsSpuEntity;

import org.elasticsearch.action.bulk.BulkRequest;

import org.elasticsearch.action.index.IndexRequest;

import org.elasticsearch.client.RestHighLevelClient;

import org.elasticsearch.common.xcontent.XContentType;

import org.springframework.context.annotation.Configuration;

import org.springframework.scheduling.annotation.EnableScheduling;

import org.springframework.scheduling.annotation.Scheduled;

import org.springframework.stereotype.Component;

import javax.annotation.Resource;

import java.util.List;

/**
* @author q请问请问q
* @createTime 2021年11月25日 18:20:00
*/

//EnableScheduling开启定时器
//Component让spring能够扫描到当前类

@EnableScheduling

@Component

public class EsSyncGoodsSpuServiceImpl {
//注入一下

@Resource

private EsSyncGoodsSpuMapper esSyncGoodsSpuMapper;

//这个定时器注解就是每过去5秒执行一下这个aaa的方法

@Scheduled(cron = "0/5 * * * * ?")

public void aaa() {
//这边是查询mysql的数据

List<EsSyncGoodsSpuEntity> list = esSyncGoodsSpuMapper.aaa();

//调用高层对象

RestHighLevelClient restHighLevelClient = config.esRestClient();

//然后我这边使用forEach循环将数据添加到es中

list.forEach(a -> {
//创建一个索引请求(这里面写的是我们想要添加的索引)

IndexRequest index = new IndexRequest("goods_spu");

//这边是获取到我们查询得到的数据将这个查询的id当成我们es中_id(不要es自带的)

index.id(a.getSpuId().toString());

//创建批量操作对象

BulkRequest request = new BulkRequest();

//这里我将查询到的数据循环转换成json

index.source(JSONObject.toJSONString(a), XContentType.JSON);

//将转换成json的数据添加到我们创建的对象中去

request.add(index);

try {
//将数据通过bulk操作进入es。。

restHighLevelClient.bulk(request, config.COMMON_OPTIONS);

} catch (Exception e) {
e.printStackTrace();

}

//打印......

System.out.println(list);

});

}
}

DSL搜索

Elasticsearch提供丰富且灵活的查询语言叫做DSL查询(Query DSL),它允许你构建更加复杂、强大的查询。

DSL(Domain Specifific Language特定领域语言)以JSON请求体的形式出现

posted @ 2022-07-05 14:05 迈克尔-唐-僧阅读(25) 评论(0) 收藏举报

刷新页面返回顶部

jackliang-rembercnblogs

elasticsearch高级

公告