Elasticsearch搜索

结构化搜索（Structured search）

结构化搜索（Structured search） 是指有关探询那些具有内在结构数据的过程。比如日期、时间和数字都是结构化的：它们有精确的格式，我们可以对这些格式进行逻辑操作。比较常见的操作包括比较数字或时间的范围，或判定两个值的大小。

文本也可以是结构化的。如彩色笔可以有离散的颜色集合： 红（red） 、 绿（green） 、 蓝（blue） 。一个博客可能被标记了关键词 分布式（distributed） 和 搜索（search） 。电商网站上的商品都有 UPCs（通用产品码 Universal Product Codes）或其他的唯一标识，它们都需要遵从严格规定的、结构化的格式。

在结构化查询中，我们得到的结果总是非是即否，要么存于集合之中，要么存在集合之外。结构化查询不关心文件的相关度或评分；它简单的对文档包括或排除处理。

这在逻辑上是能说通的，因为一个数字不能比其他数字更适合存于某个相同范围。结果只能是：存于范围之中，抑或反之。同样，对于结构化文本来说，一个值要么相等，要么不等。没有更似这种概念。

—— https://www.elastic.co/guide/cn/elasticsearch/guide/current/structured-search.html

结构化数据
- 日期、布尔类型和数字类型
- 文本——枚举，颜色red、green、blue
结构化搜索（Structured search）
- 对结构化数据进行搜索
Elasticsearch中的结构化搜索
- 布尔，时间，⽇期和数字这类结构化数据：有精确的格式，我们可以对这些格式进⾏逻辑操作。包括⽐较数字或时间的范围，或判定两个值的⼤⼩。
- 结构化的⽂本可以做精确匹配或者部分匹配
  - Term 查询 / Prefix 前缀查询
  - 结构化结果只有“是”或“否”两个值
- 根据场景需要，可以决定结构化搜索是否需要打分

#结构化搜索，精确匹配
DELETE products
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

GET products/_mapping

布尔值

#对布尔值 match 查询，有算分
POST products/_search
{
  "profile": "true",
  "explain": true,
  "query": {
    "term": {
      "avaliable": true
    }
  }
}

#对布尔值，通过constant score 转成 filtering，没有算分
POST products/_search
{
  "profile": "true",
  "explain": true,
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "avaliable": true
        }
      }
    }
  }
}

数字类型

#数字类型 Term
POST products/_search
{
  "profile": "true",
  "explain": true,
  "query": {
    "term": {
      "price": 30
    }
  }
}

#数字类型 terms
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "price": [
            "20",
            "30"
          ]
        }
      }
    }
  }
}

#数字 Range 查询
GET products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "price" : {
                        "gte" : 20,
                        "lte"  : 30
                    }
                }
            }
        }
    }
}

gt ⼤于
lt ⼩于
gte ⼤于等于
lte ⼩于等于

日期

# 日期 range
POST products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "date" : {
                      "gte" : "now-1y"
                    }
                }
            }
        }
    }
}

	解释
y	年
M	月
w	周
d	天
H/h	小时
m	分钟
s	秒

exit 查询，处理空值

#exists查询
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "exists": {
          "field": "date"
        }
      }
    }
  }
}

POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must_not":{
            "exists":{
              "field": "date"
            }
          }
        }
      },
      "boost": 1.2
    }
  }
}

查找多个精确值

#数字类型 terms
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "price": [
            "20",
            "30"
          ]
        }
      }
    }
  }
}

#数字 Range 查询
GET products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "price" : {
                        "gte" : 20,
                        "lte"  : 30
                    }
                }
            }
        }
    }
}

#字符类型 terms
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "productID.keyword": [
            "QQPX-R-3956-#aD8",
            "JODL-X-1937-#pV7"
          ]
        }
      }
    }
  }
}

包含而不是相等

精确值 & 多值字段的精确值查找。Term 查询是包含，不是完全相等。针对多值字段查询要尤其注意

包含而不是相等

#处理多值字段
POST /movies/_bulk
{ "index": { "_id": 1 }}
{ "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy"}
{ "index": { "_id": 2 }}
{ "title" : "Dave","year":1993,"genre":["Comedy","Romance"] }


#处理多值字段，term 查询是包含，而不是等于
POST movies/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "genre.keyword": "Comedy"
        }
      }
    }
  }
}

Query & Filtering 与多字符串多字段查询

⾼级搜索的功能：⽀持多项⽂本输⼊，针对多个字段进⾏搜索。在 Elasticsearch 中，有 Query 和 Filter 两种不同的 Context

Query Context：相关性算分
Filter Context：不需要算分（ Yes or No），可以利⽤ Cache，获得更好的性能

条件组合：

假设要搜索⼀本电影，包含了以下⼀些条件
- 评论中包含了 Guitar，⽤户打分⾼于 3 分，同时上映⽇期要在 1993 与 2000 年之间
这个搜索其实包含了 3 段逻辑，针对不同的字段
- 评论字段中要包含 Guitar / ⽤户评分⼤于 3 / 上映⽇期⽇期需要在给定的范围
同时包含这三个逻辑，并且有⽐较好的性能？
- 复合查询： bool Query

bool 查询

⼀个 bool 查询，是⼀个或者多个查询⼦句的组合

总共包括 4 种⼦句。其中 2 种会影响算分，2 种不影响算分

相关性并不只是全⽂本检索的专利。也适⽤于 yes | no 的⼦句，匹配的⼦句越多，相关性评分越⾼。如果多条查询⼦句被合并为⼀条复合查询语句，⽐如 bool 查询，则每个查询⼦句计算得出的评分会被合并到总的相关性评分中。

子句	作用
must	必须匹配。贡献算分
should	选择性匹配。贡献算分
must_not	Filter Context 查询字句，必须不能匹配
filter	Filter Context 必须匹配，但是不贡献算分

bool 查询语法

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }



#基本语法
POST /products/_search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "price" : "30" }
      },
      "filter": {
        "term" : { "avaliable" : "true" }
      },
      "must_not" : {
        "range" : {
          "price" : { "lte" : 10 }
        }
      },
      "should" : [
        { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
        { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
      ],
      "minimum_should_match" :1
    }
  }
}

解决结构化查询 —— 包含而不是相等问题

增加 count 字段，使⽤ bool 查询解决。

从业务⻆角度，按需改进 Elasticsearch 数据模型

#改变数据模型，增加字段。解决数组包含而不是精确匹配的问题
POST /newmovies/_bulk
{ "index": { "_id": 1 }}
{ "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy","genre_count":1 }
{ "index": { "_id": 2 }}
{ "title" : "Dave","year":1993,"genre":["Comedy","Romance"],"genre_count":2 }

增加count字段

#must，有算分
POST /newmovies/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"genre.keyword": {"value": "Comedy"}}},
        {"term": {"genre_count": {"value": 1}}}

      ]
    }
  }
}

#Filter。不参与算分，结果的score是0
POST /newmovies/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"genre.keyword": {"value": "Comedy"}}},
        {"term": {"genre_count": {"value": 1}}}
        ]

    }
  }
}

Filter Context 不影响算法

#Filtering Context
POST _search
{
  "query": {
    "bool" : {

      "filter": {
        "term" : { "avaliable" : "true" }
      },
      "must_not" : {
        "range" : {
          "price" : { "lte" : 10 }
        }
      }
    }
  }
}

Query Context 影响算分

bool 嵌套

#Query Context
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }


POST /products/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "productID.keyword": {
              "value": "JODL-X-1937-#pV7"}}
        },
        {"term": {"avaliable": {"value": true}}
        }
      ]
    }
  }
}


#嵌套，实现了 should not 逻辑
POST /products/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {
          "price": "30"
        }
      },
      "should": [
        {
          "bool": {
            "must_not": {
              "term": {
                "avaliable": "false"
              }
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

查询语句的结构，会对相关度算分产⽣影响

同⼀层级下的竞争字段，具有有相同的权重
通过嵌套 bool 查询，可以改变对算分的影响

POST /animals/_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "brown" }},
        { "term": { "text": "red" }},
        { "term": { "text": "quick"   }},
        { "term": { "text": "dog"   }}
      ]
    }
  }
}

POST /animals/_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "quick" }},
        { "term": { "text": "dog"   }},
        {
          "bool":{
            "should":[
               { "term": { "text": "brown" }},
                 { "term": { "text": "brown" }},
            ]
          }

        }
      ]
    }
  }
}

控制字段的 Boosting

制字段的 Boosting

DELETE blogs
POST /blogs/_bulk
{ "index": { "_id": 1 }}
{"title":"Apple iPad", "content":"Apple iPad,Apple iPad" }
{ "index": { "_id": 2 }}
{"title":"Apple iPad,Apple iPad", "content":"Apple iPad" }


POST blogs/_search
{
  "query": {
    "bool": {
      "should": [
        {"match": {
          "title": {
            "query": "apple,ipad",
            "boost": 1.1
          }
        }},

        {"match": {
          "content": {
            "query": "apple,ipad",
            "boost":
          }
        }}
      ]
    }
  }
}

Not Quite Not

DELETE news
POST /news/_bulk
{ "index": { "_id": 1 }}
{ "content":"Apple Mac" }
{ "index": { "_id": 2 }}
{ "content":"Apple iPad" }
{ "index": { "_id": 3 }}
{ "content":"Apple employee like Apple Pie and Apple Juice" }


POST news/_search
{
  "query": {
    "bool": {
      "must": {
        "match":{"content":"apple"}
      }
    }
  }
}

Boosting Query

POST news/_search
{
  "query": {
    "bool": {
      "must": {
        "match":{"content":"apple"}
      },
      "must_not": {
        "match":{"content":"pie"}
      }
    }
  }
}

[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html](https://www.elastic.co/guide/en/elast icsearch/reference/current/query-filter-context.html)
https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-boosting-query.html

单字符串多字段查询

单字符串查询

Dis Max Query

单字符串查询的实例

PUT /blogs/_doc/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /blogs/_doc/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

POST /blogs/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

算分过程

查询 should 语句中的两个查询
加和两个查询的评分
乘以匹配语句的总数
除以所有语句的总数

Disjunction Max Query 查询

上例中，title 和 body 相互竞争。不应该将分数简单叠加，⽽是应该找到单个最佳匹配的字段的评分。

Disjunction Max Query，将任何与任⼀查询匹配的⽂档作为结果返回。采⽤字段上最匹配的评分最终评分返回

Disjunction Max Query 查询

POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    }
}

通过 Tie Breaker 参数调整

有⼀些情况下，同时匹配 title 和 body 字段的⽂档⽐只与⼀个字段匹配的⽂档的相关度更⾼。disjunction max query 查询只会简单地使⽤单个最佳匹配语句的评分 _score 作为整体评分。怎么办？

POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.2
        }
    }
}

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-dis-max-query.html

Multi-Match

最佳字段 (Best Fields)：当字段之间相互竞争，⼜相互关联。例如 title 和 body 这样的字段。评分来⾃最匹配字段
多数字段 (Most Fields)：处理英⽂内容时：⼀种常⻅见的⼿段是，在主字段( English Analyzer)，抽取词⼲，加⼊同义词，以匹配更多的⽂档。相同的⽂本，加⼊⼦字段(Standard Analyzer)，以提供更加精确的匹配。其他字段作为匹配⽂档提⾼相关度的信号。匹配字段越多则越好
混合字段 (Cross Field)：对于某些实体，例如⼈名，地址，图书信息。需要在多个字段中确定信息，单个字段只能作为整体的⼀部分。希望在任何这些列出的字段中找到尽可能多的词

Multi Match Query

POST blogs/_search
{
  "query": {
    "multi_match": {
      "type": "best_fields",
      "query": "Quick pets",
      "fields": ["title","body"],
      "tie_breaker": 0.2,
      "minimum_should_match": "20%"
    }
  }
}

多字段匹配案例

英⽂分词器，导致精确度降低，时态信息丢失

PUT /titles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english"
      }
    }
  }
}

POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }


GET titles/_search
{
  "query": {
    "match": {
      "title": "barking dogs"
    }
  }
}

⽤⼴度匹配字段 title 包括尽可能多的⽂档——以提升召回率——同时⼜使⽤字段 title.std 作为信号将相关度更⾼的⽂档置于结果顶部。

每个字段对于最终评分的贡献可以通过⾃定义值 boost 来控制。⽐如，使 title 字段更为重要，这样同时也降低了其他信号字段的作⽤：

使用多字段匹配解决

DELETE /titles
PUT /titles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {"std": {"type": "text","analyzer": "standard"}}
      }
    }
  }
}

POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }

GET /titles/_search
{
   "query": {
        "multi_match": {
            "query":  "barking dogs",
            "type":   "most_fields",
            "fields": [ "title", "title.std" ]
        }
    }
}

跨字段搜索

⽀持使⽤ Operator，与 copy_to, 相⽐，其中⼀个优势就是它可以在搜索时为单个字段提升权重。

[https://www.elastic.co/guide/en/elasticsearch/reference/7.1/query-dsl-dis-max-query.html](

posted @ 2020-01-27 18:34 深页阅读(87) 评论(0) 收藏举报

刷新页面返回顶部

深页

Elasticsearch搜索

结构化搜索（Structured search）

布尔值

数字类型

日期

exit 查询，处理空值

查找多个精确值

包含而不是相等

相关阅读

Query & Filtering 与多字符串多字段查询

bool 查询

解决结构化查询 —— 包含而不是相等问题

Filter Context 不影响算法

Query Context 影响算分

bool 嵌套

查询语句的结构，会对相关度算分产⽣影响

控制字段的 Boosting

Not Quite Not

Boosting Query

相关文章

单字符串多字段查询

Dis Max Query

Disjunction Max Query 查询

通过 Tie Breaker 参数调整

相关文章

Multi-Match

Multi Match Query

多字段匹配案例

跨字段搜索

相关文章

公告