ElasticSearch基础知识讲解

第一节 ElasticSearch概述

ElasticSearch是一个基于Lucene的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎，基于RESTfull web接口。ElasticSearch是用java开发的，设计用户云计算当中，能够达到实时搜索。
概述：ElasticSearch是基于RESTfulweb标准的高扩展高可用性的实时数据分析的全文搜索工具

1.1 ElasticSearch的基本概念

Index
类似于MySQL数据库中的database
Type
类似于MySQL数据库中的table表，es中可以在Index中建立type(table),通过mapping进行映射
Document
由于es中存储的数据是文档型的，一条数据对应一篇文档，即相当于MySQL数据库中的一行数据row,一个文档可以有多个字段也就是MySQL数据库一行可以有多个列。 Filed es中一个文档对应的多个列与MySQL数据库汇总每一列对应
Mapping
可以理解为MySQL或solr中对应的schema，只不过有些时候es中的mapping增加了动态识别功能，感觉很强大的样子，但是生产环境不建议使用，最好还是开始制定好了对应的schema为主
indexed
就是名义上的建议索引。mysql中一般会对经常使用的列增加相应的索引用于提高查询速度，而在es中默认都是会加上索引的，除非你特殊制定不建立索引只是进行存储用于展示，这个需要看具体的需求和业务而定了
Query DSL
类似于mysql的sql语句，只不过在es中使用的json格式的查询语句，专业术语就叫：Query DSL
GET/PUT/POST/DELETE
分别类似于mysql中的select/update/delete...

1.2 RESTfull API

一种软件架构风格，设计风格，而不是标准。主要用户客户端和服务器交互类的软件，基于这个风格设计的软件可以更简洁，更有层次，更易于实现缓存等机制。

它使用典型的HTTP方法，诸如GET,POST,DELETE,PUT来实现资源的获取，添加，修改，删除等操作，即通过HTTP动词开实现资源的状态扭转

GET:用来获取资源
POST:用来新建资源(也可以用于更新资源)
PUT:用来更新资源
DELETE:用来删除资源

1.3 curl命令

以命令的方式执行HTTP协议的请求，GET/POST/DELETE/PUT
示例：

访问一个网页

curl www.baidu.com

保存网页内容到文件

curl -o tt.html www.baidu.com

显示响应的头信息

curl -i www.baidu.com

显示一次HTTP请求的通信过程

curl -v www.baidu.com

使用curl执行GET/POST/DELETE/PUT等操作

curl -X GET/POST/DELETE/PUT www.baidu.com

curl命令帮助

[root@localhost tmp]# curl --help
Usage: curl [options...] <url>
Options: (H) means HTTP/HTTPS only, (F) means FTP only
     --anyauth       Pick "any" authentication method (H)
 -a, --append        Append to target file when uploading (F/SFTP)
     --basic         Use HTTP Basic Authentication (H)
     --cacert FILE   CA certificate to verify peer against (SSL)
     --capath DIR    CA directory to verify peer against (SSL)
 -E, --cert CERT[:PASSWD] Client certificate file and password (SSL)
     --cert-type TYPE Certificate file type (DER/PEM/ENG) (SSL)
     --ciphers LIST  SSL ciphers to use (SSL)
     --compressed    Request compressed response (using deflate or gzip)
 -K, --config FILE   Specify which config file to read
     --connect-timeout SECONDS  Maximum time allowed for connection
 -C, --continue-at OFFSET  Resumed transfer offset
 -b, --cookie STRING/FILE  String or file to read cookies from (H)
 -c, --cookie-jar FILE  Write cookies to this file after operation (H)
     --create-dirs   Create necessary local directory hierarchy
     --crlf          Convert LF to CRLF in upload
     --crlfile FILE  Get a CRL list in PEM format from the given file
 -d, --data DATA     HTTP POST data (H)
     --data-ascii DATA  HTTP POST ASCII data (H)
     --data-binary DATA  HTTP POST binary data (H)
     --data-urlencode DATA  HTTP POST data url encoded (H)
     --delegation STRING GSS-API delegation permission
     --digest        Use HTTP Digest Authentication (H)
     --disable-eprt  Inhibit using EPRT or LPRT (F)
     --disable-epsv  Inhibit using EPSV (F)
 -D, --dump-header FILE  Write the headers to this file
     --egd-file FILE  EGD socket path for random data (SSL)
     --engine ENGINGE  Crypto engine (SSL). "--engine list" for list
 -f, --fail          Fail silently (no output at all) on HTTP errors (H)
 -F, --form CONTENT  Specify HTTP multipart POST data (H)
     --form-string STRING  Specify HTTP multipart POST data (H)
     --ftp-account DATA  Account data string (F)
     --ftp-alternative-to-user COMMAND  String to replace "USER [name]" (F)
     --ftp-create-dirs  Create the remote dirs if not present (F)
     --ftp-method [MULTICWD/NOCWD/SINGLECWD] Control CWD usage (F)
     --ftp-pasv      Use PASV/EPSV instead of PORT (F)
 -P, --ftp-port ADR  Use PORT with given address instead of PASV (F)
     --ftp-skip-pasv-ip Skip the IP address for PASV (F)
     --ftp-pret      Send PRET before PASV (for drftpd) (F)
     --ftp-ssl-ccc   Send CCC after authenticating (F)
     --ftp-ssl-ccc-mode ACTIVE/PASSIVE  Set CCC mode (F)
     --ftp-ssl-control Require SSL/TLS for ftp login, clear for transfer (F)
 -G, --get           Send the -d data with a HTTP GET (H)
 -g, --globoff       Disable URL sequences and ranges using {} and []
 -H, --header LINE   Custom header to pass to server (H)
 -I, --head          Show document info only
 -h, --help          This help text
     --hostpubmd5 MD5  Hex encoded MD5 string of the host public key. (SSH)
 -0, --http1.0       Use HTTP 1.0 (H)
     --ignore-content-length  Ignore the HTTP Content-Length header
 -i, --include       Include protocol headers in the output (H/F)
 -k, --insecure      Allow connections to SSL sites without certs (H)
     --interface INTERFACE  Specify network interface/address to use
 -4, --ipv4          Resolve name to IPv4 address
 -6, --ipv6          Resolve name to IPv6 address
 -j, --junk-session-cookies Ignore session cookies read from file (H)
     --keepalive-time SECONDS  Interval between keepalive probes
     --key KEY       Private key file name (SSL/SSH)
     --key-type TYPE Private key file type (DER/PEM/ENG) (SSL)
     --krb LEVEL     Enable Kerberos with specified security level (F)
     --libcurl FILE  Dump libcurl equivalent code of this command line
     --limit-rate RATE  Limit transfer speed to this rate
 -l, --list-only     List only names of an FTP directory (F)
     --local-port RANGE  Force use of these local port numbers
 -L, --location      Follow redirects (H)
     --location-trusted like --location and send auth to other hosts (H)
 -M, --manual        Display the full manual
     --mail-from FROM  Mail from this address
     --mail-rcpt TO  Mail to this receiver(s)
     --mail-auth AUTH  Originator address of the original email
     --max-filesize BYTES  Maximum file size to download (H/F)
     --max-redirs NUM  Maximum number of redirects allowed (H)
 -m, --max-time SECONDS  Maximum time allowed for the transfer
     --metalink      Process given URLs as metalink XML file
     --negotiate     Use HTTP Negotiate Authentication (H)
 -n, --netrc         Must read .netrc for user name and password
     --netrc-optional Use either .netrc or URL; overrides -n
     --netrc-file FILE  Set up the netrc filename to use
 -N, --no-buffer     Disable buffering of the output stream
     --no-keepalive  Disable keepalive use on the connection
     --no-sessionid  Disable SSL session-ID reusing (SSL)
     --noproxy       List of hosts which do not use proxy
     --ntlm          Use HTTP NTLM authentication (H)
 -o, --output FILE   Write output to <file> instead of stdout
     --pass PASS     Pass phrase for the private key (SSL/SSH)
     --post301       Do not switch to GET after following a 301 redirect (H)
     --post302       Do not switch to GET after following a 302 redirect (H)
     --post303       Do not switch to GET after following a 303 redirect (H)
 -#, --progress-bar  Display transfer progress as a progress bar
     --proto PROTOCOLS  Enable/disable specified protocols
     --proto-redir PROTOCOLS  Enable/disable specified protocols on redirect
 -x, --proxy [PROTOCOL://]HOST[:PORT] Use proxy on given port
     --proxy-anyauth Pick "any" proxy authentication method (H)
     --proxy-basic   Use Basic authentication on the proxy (H)
     --proxy-digest  Use Digest authentication on the proxy (H)
     --proxy-negotiate Use Negotiate authentication on the proxy (H)
     --proxy-ntlm    Use NTLM authentication on the proxy (H)
 -U, --proxy-user USER[:PASSWORD]  Proxy user and password
     --proxy1.0 HOST[:PORT]  Use HTTP/1.0 proxy on given port
 -p, --proxytunnel   Operate through a HTTP proxy tunnel (using CONNECT)
     --pubkey KEY    Public key file name (SSH)
 -Q, --quote CMD     Send command(s) to server before transfer (F/SFTP)
     --random-file FILE  File for reading random data from (SSL)
 -r, --range RANGE   Retrieve only the bytes within a range
     --raw           Do HTTP "raw", without any transfer decoding (H)
 -e, --referer       Referer URL (H)
 -J, --remote-header-name Use the header-provided filename (H)
 -O, --remote-name   Write output to a file named as the remote file
     --remote-name-all Use the remote file name for all URLs
 -R, --remote-time   Set the remote file's time on the local output
 -X, --request COMMAND  Specify request command to use
     --resolve HOST:PORT:ADDRESS  Force resolve of HOST:PORT to ADDRESS
     --retry NUM   Retry request NUM times if transient problems occur
     --retry-delay SECONDS When retrying, wait this many seconds between each
     --retry-max-time SECONDS  Retry only within this period
 -S, --show-error    Show error. With -s, make curl show errors when they occur
 -s, --silent        Silent mode. Don't output anything
     --socks4 HOST[:PORT]  SOCKS4 proxy on given host + port
     --socks4a HOST[:PORT]  SOCKS4a proxy on given host + port
     --socks5 HOST[:PORT]  SOCKS5 proxy on given host + port
     --socks5-basic  Enable username/password auth for SOCKS5 proxies
     --socks5-gssapi Enable GSS-API auth for SOCKS5 proxies
     --socks5-hostname HOST[:PORT] SOCKS5 proxy, pass host name to proxy
     --socks5-gssapi-service NAME  SOCKS5 proxy service name for gssapi
     --socks5-gssapi-nec  Compatibility with NEC SOCKS5 server
 -Y, --speed-limit RATE  Stop transfers below speed-limit for 'speed-time' secs
 -y, --speed-time SECONDS  Time for trig speed-limit abort. Defaults to 30
     --ssl           Try SSL/TLS (FTP, IMAP, POP3, SMTP)
     --ssl-reqd      Require SSL/TLS (FTP, IMAP, POP3, SMTP)
 -2, --sslv2         Use SSLv2 (SSL)
 -3, --sslv3         Use SSLv3 (SSL)
     --ssl-allow-beast Allow security flaw to improve interop (SSL)
     --stderr FILE   Where to redirect stderr. - means stdout
     --tcp-nodelay   Use the TCP_NODELAY option
 -t, --telnet-option OPT=VAL  Set telnet option
     --tftp-blksize VALUE  Set TFTP BLKSIZE option (must be >512)
 -z, --time-cond TIME  Transfer based on a time condition
 -1, --tlsv1         Use => TLSv1 (SSL)
     --tlsv1.0       Use TLSv1.0 (SSL)
     --tlsv1.1       Use TLSv1.1 (SSL)
     --tlsv1.2       Use TLSv1.2 (SSL)
     --trace FILE    Write a debug trace to the given file
     --trace-ascii FILE  Like --trace but without the hex output
     --trace-time    Add time stamps to trace/verbose output
     --tr-encoding   Request compressed transfer encoding (H)
 -T, --upload-file FILE  Transfer FILE to destination
     --url URL       URL to work with
 -B, --use-ascii     Use ASCII/text transfer
 -u, --user USER[:PASSWORD]  Server user and password
     --tlsuser USER  TLS username
     --tlspassword STRING TLS password
     --tlsauthtype STRING  TLS authentication type (default SRP)
     --unix-socket FILE    Connect through this UNIX domain socket
 -A, --user-agent STRING  User-Agent to send to server (H)
 -v, --verbose       Make the operation more talkative
 -V, --version       Show version number and quit
 -w, --write-out FORMAT  What to output after completion
     --xattr        Store metadata in extended file attributes
 -q                 If used as the first parameter disables .curlrc

第二节 ElasticSearch基本操作

2.1 倒排索引

ElasticSearch使用一种称为倒排索引的结构，它适用于快速的全文搜索。一个倒排索引由文档中所有不重复词的列表构成，对于其中每个词，有一个包含它的文档列表
示例：
(1)假设文档集合包含五个文档，每个文档内容如下图所示，在图中最左侧一栏是每个文档对应的文档编号，我们的任务就是对这个文档集合建立倒排索引

(2)中文和英文等语言不同，单词之间没有明确的分隔符号，所以首先要用分词系统将文档自动切分成单词序列，这样每个文档就转换为由单词序列构成的数据流，为了系统后续处理方便，需要对每个不同的单词赋予唯一的单词编号，同时记录下哪些文件包含这个单词，在这样处理结束之后，我们可以得到最简单的倒排索引

"单词ID"一栏记录了每个单词的单词编号，第二栏是对应的单词，第三栏即是每个单词对应的倒排列表

(3)索引系统还可以记录除此之外的更多信息，下图还记载了单词频率信息(TF),即这个单词在某个文档中的出现次数，之所以要记录这个信息，是因为词频信息在搜索结果排序时，计算查询和文档相似度是很重要的一个计算因子，所以将其记录在倒排列表中，以方便后续排查时进行分值计算

(4)倒排列表还科技记录单词在某个文档中出现的位置信息，比如：(1,<11>,1),(2,<7>,1),(3,❤️,9>,2)。有了这个索引系统，搜索引擎可以很方便地响应用户的查询。比如用户输出查询单词"Facebook",搜索系统查找倒排索引，从中可以读出包含这个单词的文档，这些文档就是提供给用户的搜索结果，而利用单词频率信息，，文档频率信息即可以对这些候选结果进行排序，计算文档和查询的相似性，按照相似性得分由高到低排序输出，此即为搜索系统的部分内部流程

使用标准化规则(normalization)：建立倒排索引的时候，会对拆分出的各个单词进行相应的处理，以提升后面索引的时候能够搜索到相关联的文档的概率

2.2 分词器

分词器:从一串文本中切分出一个一个的词条，并对每个词条进行标准化
包括三部分:

character filter：分词之前的预处理，过滤掉HTML标签，特殊符号转换等
tokenizer:分词
token filter:标准化

内置分词器：

standard分词器：默认的，它会将词汇单元转换成小写形式，并去除停用词和标点符号，支持重温采用的方法为单字切分
simple分词器：首选会通过非字母字符来分隔文本信息，然后将词汇单元统一为小写形式，该分词器会去杜鳌数字类型的字符
whitespace分词器：仅仅是去除空格，对字符没有转换成小写形式，不支持中文，并且不对生成的词汇单元进行其他的标准化处理
language分词器：特定语言的分词器，不支持中文

配置中文分词器

下载elasticsearch-analysis-ik-master.zip
wget https://github.com/medcl/elasticsearch-analysis-ik-master.zip
解压elasticsearch-analysis-ik-master.zip
unzip elasticsearch-analysis-ik-master.zip
进入解压目录，编译源码
cd elasticsearch-analysis-ik-master
mvn clean install -Dmaven.test.skip=true （需要事先安装配置好maven环境）
将编译后生成的zip文件移动到es的插件目录下,解压缩并重命名

cd elasticsearch-analysis-ik-master/target/release/
cp elasticsearch-analysis-ik.zip /usr/local/elasticsearch/plugins
unzip elasticsearch-analysis-ik.zip
mv elasticsearch-analysis-ik ik
# 重启elasticsearch，查看加载的插件信息

2.3 使用ElasticSearch API实现CURD

使用浏览器打开http://ip/kibana，左侧导航有开发工具，点开，查看帮助信息等
左侧输入，右侧输出结果

添加索引
相当于新建一个数据库
输入的数据,使用自定义的配置

PUT /lib/
{
  "settings": {
    "index":{
      "number_of_shards":3,
      "number_of_replicas":0
    }
  }
}

右侧输出的结果：

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "lib"
}

添加索引，使用默认的配置：

PUT /lib2/

输出结果是：

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "lib2"
}

查看索引

GET /lib/_settings

{
  "lib" : {
    "settings" : {
      "index" : {
        "creation_date" : "1566617147934",
        "number_of_shards" : "3",
        "number_of_replicas" : "0",
        "uuid" : "6D92TpNWSk-j-gD-nDoxdw",
        "version" : {
          "created" : "7030099"
        },
        "provided_name" : "lib"
      }
    }
  }
}

GET /lib2/_settings

{
  "lib2" : {
    "settings" : {
      "index" : {
        "creation_date" : "1566617313082",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "jw8Lh0n7QM-1lhT6xCqhKw",
        "version" : {
          "created" : "7030099"
        },
        "provided_name" : "lib2"
      }
    }
  }
}

查看所有索引的配置

GET /_all/_settings

添加文档
相当于新建一个数据表，并添加一条数据
指定索引使用put方式。示例中索引为1

PUT /lib/user/1
{
  "first_name":"Jane",
  "last_name":"Simth",
  "age":32,
  "about":"I like to collect rock albums",
  "interests":["music","video"]
}

#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
  "_index" : "lib",
  "_type" : "user",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

不指定索引使用post方式,id自动生成

POST /lib/user/
{
  "first_name":"Jane",
  "last_name":"Simth",
  "age":32,
  "about":"I like to collect rock albums",
  "interests":["music","video"]
}

#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
  "_index" : "lib",
  "_type" : "user",
  "_id" : "PRy5wWwBfIGT97PTaZOi",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

查看文档

GET /lib/user/1

#! Deprecation: [types removal] Specifying types in document get requests is deprecated, use the /{index}/_doc/{id} endpoint instead.
{
  "_index" : "lib",
  "_type" : "user",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "first_name" : "Jane",
    "last_name" : "Simth",
    "age" : 32,
    "about" : "I like to collect rock albums",
    "interests" : [
      "music",
      "video"
    ]
  }
}

查看部分文档信息

GET /lib/user/1?_source=age,about

#! Deprecation: [types removal] Specifying types in document get requests is deprecated, use the /{index}/_doc/{id} endpoint instead.
{
  "_index" : "lib",
  "_type" : "user",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "about" : "I like to collect rock albums",
    "age" : 32
  }
}

更新文档
使用put方式更新，只能更新已有字段

PUT /lib/user/1
{
  "first_name":"Jane",
  "last_name":"Simth",
  "age":36,
  "about":"I like to collect rock albums",
  "interests":["music"]
}

使用post方式更新,若有字段则更新，若无字段则新增

POST /lib/user/1/_update
{
  "doc":{
    "age":1111,
    "aa":2222
  }
}

6.删除
删除一个文档

DELETE /lib/user/1

删除一个索引

DELETE /lib

2.4 批量获取文档

使用es提供的Multi Get API可以通过索引名，类型名，文档id一次得到一个文档集合，文档可以来着同一个索引库，也可以来自不同的索引库。

使用curl命令：

curl -XGET "http://192.168.10.102:9200/_mget" -H 'Content-Type: application/json' -d'{  "docs": [    {      "_index": "lib",      "_type": "user",      "_id": 1    },    {      "_index": "lib",      "_type": "user",      "_id": 2    },    {      "_index": "lib",      "_type": "user",      "_id": 3    }  ]}'

使用kibana提供的客户端工具：开发工具
先添加三条数据

GET /_mget
{
  "docs": [
    {
      "_index": "lib",
      "_type": "user",
      "_id": 1
    },
    {
      "_index": "lib",
      "_type": "user",
      "_id": 2
    },
    {
      "_index": "lib",
      "_type": "user",
      "_id": 3
    }
  ]
}

可以指定具体的字段：

GET /_mget
{
  "docs": [
    {
      "_index": "lib",
      "_type": "user",
      "_id": 1,
      "_source":["age","about"]
    },
    {
      "_index": "lib",
      "_type": "user",
      "_id": 2
    },
    {
      "_index": "lib",
      "_type": "user",
      "_id": 3,
      "_source":"interests"
    }
  ]
}

获取同索引同类型下的不同文档,简写成如下形式：

GET /lib/user/_mget
{
  "docs": [
    {
      "_id": 1
    },
    {
      "_type": "user",
      "_id": 2
    }
  ]
}

GET /lib/user/_mget
{
  "ids":["1","2"]
}

2.5 使用Bulk实现批量操作

bulk格式：
{action:{metadata}}\n
{requestbody}\n

action:(行为)

create:文档不存在时创建
update:更新文档
index:创建新文档或替换已有文档
delete:删除一个文档

metedata:_index,_type,_id

create和index的区别：如果数据存在，使用create操作失败，会提示文档已经存在，使用index则可以成功执行。

示例：

{"delete":{"_index":"lib","_type":"user","_id":"1"}}

批量添加：
右边输出框显示"errors" : false表示添加成功

POST /lib2/books/_bulk
{"index":{"_id":1}}
{"title":"java","price":55}
{"index":{"_id":2}}
{"title":"HTML5","price":35}
{"index":{"_id":3}}
{"title":"python","price":100}

批量获取：

GET /lib2/books/_mget 
{
    "ids":["1","2","3"]
}

删除：没有请求体

POST /lib2/books/_bulk
{"delete":{"_index":"lib2","_type":"books","_id":3}}

bulk一次最大处理多少数据量：
bulk会把将要处理的数据载入内存中，所有数据量是有限制的。最佳的数据量不是一个确定的数值，它取决于你的硬件，你的文档大小以及复杂性，你的索引以及搜索的负载。
一般建议是1000-5000个文档，大小建议是5-15MB，默认不能超过100M，可以在es的配置文件中

2.6 版本控制

Elasticsearch采用了乐观锁来保证数据的一致性，也就是说当用户对文档进行操作时，并不需要对该文档作加锁和解锁的操作，只需要指定要操作的版本即可。当版本号一致时，Elasticsearch会允许该操作顺利执行，而当版本号存在冲突时，Elasticsearch会提示冲突并抛出异常

Elasticsearch的版本号取值范文是1到2^63-1

内部版本控制使用的是_version
外部版本控制：Elasticsearch在处理外部版本号时会对内部版本号的处理有些不同，它不再是检查_version是否与请求中指定的数值相同，而是会检查当前的_version是否比指定的数值小。如果请求成功，那么外部版本号就会被存储到文档的_version中

为了保持_version与外部版本控制的数据一致，使用version_type=external

GET /lib/user/2

# 修改version=后面的值，查看变化
PUT /lib/user/1?version=1&version_type=external
{
  "age" : 44
}

2.7 mapping

es自动创建了index，type，以及type对应的mapping(dynamic mapping)
mapping定义了type中的每个字段的数据类型以及这些字段如何分词等相关属性

创建索引的时候，可以预先定义字段的类型及相关属性，这样就能够把日期字段处理成日期，把数字字段处理成数字，把字符串字段处理成字符串值等

支持的数据类型：

核心数据类型(core datatype)

字符型:string,包括text和keyword
text类型被用来索引长文本，在建立索引前会将这些文本进行分词，转化为词的组合，建立索引，允许es来检索这些词语。text类型不能用来排序和聚合。
keyword类型不需要进行分词，可以被用来检索过滤，排序和聚合，keyword类型字段只能用本身来进行检索
数字型：long,integer,short,btype,dobule,float (默认没有分词)
日期型：date (默认没有分词)
布尔型：boolean
二进制型：binary

(2)复杂数据类型

数组类型：数组类型不需要专门制定数组元素的type,比如：
字符型数组：["one","two"]
整型数组：[1,2]
数组型数组：[1,[2,3]],等价于[1,2,3]
对象数组：[{"name":"Mary","age":12},{"name":"Tom","age":20}]
对象类型：_object_用于单个json对象
嵌套类型：_nested_用于json数组

(3)地理位置类型

地理坐标类型：_geo_point_用于经纬度坐标
地理形状类型：_geo_shape_用于类似于多边形的复杂形状

(4)特定类型

IPv4类型：_ip_用于IPv4地址
Completion:_completion_提供自动补全建议
Token count类型：_token_count_用于统计做了标记的字段的index数目，该值会一致增加，不会因为过滤条件而减少
mapper-murmur3类型：通过插件，可以通过_murmur3_来计算index的hash值
附加类型：采用mapper-attachments插件，可支持_attachments_索引，例如Microsoft Office格式，Open Document格式，ePub，HTML等

支持的属性：

"store":false // 是否单独设置此字段的是否存储而从_source字段中分离，默认是false,只能搜索，不能获取值
"index":true // 分词，不分词是false,设置成false字段将不会被索引
"analyzer":"ik" // 指定分词器，默认分词器是standard analyzer
"boost":1.23 // 字段级别的分数加权，默认是1.0
"doc_values":false // 对not_analyzed字段，默认都是开启，分词字段不能使用，对哦排序和聚合能提升较大性能，节约内存
"fielddata":{"format":"disabled"} // 针对分词字段，参与排序或聚合时能提高性能，不分词字段统一建议使用doc_value
"fields":{"raw":{"type":"string","index":"not_analyzed"}} // 可以对一个字段提供多种索引模式，同一个字段的值，一个分词，一个不分词
"ignore_above":100 // 超过100个字符的文本，将会被忽略，不被索引
"include_in_all":true // 设置是否此参数字段包含在_all字段中，默认是true,除非index设置成no选项
"index_options":"docs" // 4个可选参数docs(索引文档号)，freqs(文档号+词频),positions(文档号+词频+位置，通常用来距离查询),offsets(文档号+词频+位置+偏移量，通常被使用在高亮字段)，分词字段默认是positions,其他的默认是docs
"norms":{"enable":true,"loading":lazy} // 分词字段默认配置，不分词字段：默认{"enable":false},存储长度因子和索引时boost,建议对需要参与评分字段使用，会额外增加内存消耗量
"null_value":NULL // 设置一些缺失字段的初始化值，只有string可以使用，分词字段的null值也会被分词
"position_increment_gap":0 // 影响距离查询或近似查询，可以设置在多值字段的数据上或分词字段上，查询时可以指定slop间隔，默认是100
"search_analyzer":"ik" // 设置搜索时的分词器，默认跟analyzer是一致的，比如index时用standard+ngram，搜索时用standard来完成自动提示功能
"similarity":"BM25" // 默认是TF/IDF算法，指定一个字段评分策略，仅仅对字符串型和分词类型有效
"term_vector":"no" // 默认不存储向量信息，支持参数yes(tern存储),with_positions(term+位置),with_offsets(term+偏移量),with_positions_offsets(term+位置+偏移量),对快速高亮fast vector highlighter能提升性能，但开启又会加大索引体积，不适合大数据量用

添加三个文档

PUT /lib/user/1
{
  "first_name":"Jane",
  "last_name":"Simth",
  "age":36,
  "about":"I like to collect rock albums",
  "interests":["music"]
}

PUT /lib/user/2
{
  "first_name":"Jane",
  "last_name":"Simth",
  "age":36,
  "about":"I like to collect rock albums",
  "interests":["music"]
}


PUT /lib/user/3
{
  "first_name":"Jane",
  "last_name":"Simth",
  "age":36,
  "about":"I like to collect rock albums",
  "interests":["music"],
  "data":"2019-08-24"
}

查看其中一个文档

GET /lib/user/1

查看文档mapping

GET /lib/_mapping

输出结果：

{
  "lib" : {
    "mappings" : {
      "properties" : {
        "about" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "age" : {
          "type" : "long"
        },
        "data" : {
          "type" : "date"
        },
        "first_name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "interests" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "last_name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

查询文档

# 查询出来，文本类型的默认分词，不需要精确
GET /lib/_search?q=age:36
# 查询出来
GET /lib/_search?q=about:like

# 查询不出来，日期类型默认没有分词，查询的话必须精确
GET /lib/_search?q=data:2019
# 查询出来
GET /lib/_search?q=data:2019-08-24

Object数据类型及手动创建mapping

# 添加一个文档
PUT /lib5/person/1
{
  "name":"tom",
  "age":30,
  "birthday":"1985-12-12",
  "address":{
    "country":"china",
    "province":"guangdong",
    "city":"shenzhen"
  }
}

# 查看该文档
GET /lib5/person/1

# 查看文档mapping
GET /lib5/_mapping

输出结果：

{
  "lib5" : {
    "mappings" : {
      "properties" : {
        "address" : {
          "properties" : {
            "city" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "country" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "province" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            }
          }
        },
        "age" : {
          "type" : "long"
        },
        "birthday" : {
          "type" : "date"
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

底层存储格式

{
  "name":["tom"],
  "age":[30],
  "birthday":["1985-12-12"],
  "address.country":["china"],
  "address.province":["guangdong"],
  "address.city":["shenzhen"]
}

更复杂一些的

{
    "person":[
        {"name":"lisi","age":25},
        {"name":"waqngwu","age":26},
        {"name":"zhangsan","age":30}
    ]
}
# 底层存储格式
{
    "person.name":["lisi","waqngwu","zhangsan"],
    "person.age":[25,26,30]
}

注意：ElasticSearch 7.x 默认不再支持指定索引类型,默认索引类型是_doc，如果想改变，则配置include_type_name: true 即可
(这个没有测试，官方文档说的，无论是否可行，建议不要这么做，因为elasticsearch8后就不在提供该字段)

如下手动创建mapping，在6.x可以顺利执行，但是在7.x则会报错：Root mapping definition has unsupported parameters

手动创建mapping

PUT /lib6
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 0
  },
  "mappings": {
    "books":{
      "properties":{
        "title":{"type":"text"},
        "name":{"type":"text","analyzer":"standard"},
        "publish_date":{"type":"date","index":false},
        "price":{"type":"dobule"},
        "number":{"type":"integer"}
      }
    }
  }
}

所以在Elasticsearch7中应该这么创建索引
跟6.x版本的想比较，少了一层结构

PUT /lib6
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties":{
      "title":{"type":"text"},
      "name":{"type":"text"},
      "publish_date":{"type":"date"},
      "price":{"type":"double"},
      "number":{"type":"integer"}
    }
  }
}

2.8 基本查询(Query查询)

数据准备

PUT /lib3/user/1
{
  "name":"zhaoliu",
  "address":"hei long jiang sheng tie ling shi",
  "age":50,
  "birthday":"1970-12-12",
  "interests":"xi huan he jiu,duan lian,lvyou"
}

PUT /lib3/user/2
{
  "name":"lisi",
  "address":"bei jing hai dian qu qing he zhen",
  "age":20,
  "birthday":"1998-12-12",
  "interests":"xi huan he jiu,duan lian,changge"
}

PUT /lib3/user/3
{
  "name":"zhaoming",
  "address":"bei jing hai dian qu qing he zhen",
  "age":23,
  "birthday":"1970-12-12",
  "interests":"xi huan he jiu,duan lian,lvyou,youyong"
}

# _score:和当前搜索相关度的匹配分数

# 简单查询
GET /lib3/_search?q=name:zhaoming
GET /lib3/_search?q=interests:he jiu&sort=age:desc

query_string查询
把查询的词句先分词，然后再查询

GET /lib3/user/3/_search
{
    "query":{
        "query_string":{
            "default_field":"name",
            "query":"zhangsan"
        }
    }
}

term查询和terms查询
term query会去倒排索引中寻找确切的term,它并不知道分词器的存在，这种查询适合keyword，numeric，date

term:查询某个字段里含有某个关键词的文档
terms:查询某个字段里含有多个关键词的文档

控制查询返回的数量
from：从哪一个文档开始
size：需要的个数
返回版本号
"version":true
match查询
match query知道分词器的存在，会对field进行分词操作，然后再查询

match_all:查询所有文档
multi_match:可以指定多个字段
match_phrase:短语匹配查询，es引擎首先分析查询字符串，从分析后的文本汇总构建短语查询，这意味着必须匹配短语汇总的所有分词，并别保证各个分词的相对位置不变

指定返回的字段
_source
排除某些字段
_include,_exclude
排序
使用sort实现排序：desc降序，asc升序
前缀匹配查询
match_phrase_prefix
查询范围
range：实现查询范围
参数：from,to,include_lower,include_upper,boost

include_lower:是否包含范围的左边界，默认是true
include_upper:是否包含范围的右边界，默认是true

wildcard查询
允许使用通配符*和?来进行查询

*：表示0个或多个字符
?：表示任意一个字符

fuzzy实现模糊查询

value:查询的关键字
boost:查询的权重，默认值是1.0
min_similarity:设置匹配的最小相似度，默认值是0.5，对于字符串，取值为0-1(包括0和1)；对于数值，取值可能大于1；对于日期型取值为1d(1一天),1m等
prefix_length:指明区分词项的共同前缀长度，默认是0
max_expansions：查询中的词项可以扩展的项目，默认可以无限大

高亮搜索结果
"highlight"
filter查询
filter是不计算相关性的，同时可以cache,因此filter速度要快于query

2.9 中文查询

前面步骤安装的IK中文分词器提供了两个分词算法：ik_smart和ik_max_world
其中ik_smart为最少切分，ik_max_world为最细粒度划分
使用postman软件来测试中文分词效果：

GET http://127.0.0.1:9200/_analyze?analyzer=ik_smart&pretty=true&text=测试中文分词器
GET http://127.0.0.1:9200/_analyze?analyzer=ik_max_world&pretty=true&text=测试中文分词器

使用中文分词器的话，在创建索引的时候，需要在文档的mapping中相应字段设置使用中文分词器

PUT /lib6
{
    "mappings": {
        "properties":{
            "title":{
                "type":"text",
                "analyzzer":"ik_max_world"
            },
            "content":{
                "type":"text",
                "analyzzer":"ik_smart"
            },

        }
    }
}

posted @ 2019-08-26 10:16 哈喽哈喽111111 阅读(1027) 评论(0) 收藏举报

刷新页面返回顶部