elasticsearch简单数据建模

之前我已经使用docker安装好elasticsearch服务，并安装好ik中文分词器：docker-compose搭建ES和Kibana环境，并安装IK中文分词插件，所有以下操作都是基于elasticsearch 7.3版本。相关文档：【elasticsearch 7.3 reference】
在ES的使用过程中，数据建模是非常重要的。如果只是使用默认的mappings，是很难满足实际业务需求的，查询出来的结果肯定是不符合实际业务的。所以我们需要结合ES原理和实际业务需求，对数据结构进行建模。

例1:电商推广数据结构

image.png

 1 {
 2   "id": 536600477,
 3   "name": "黑色外穿打底裤女春秋薄款铅笔裤2019新款高腰九分显瘦紧身小脚裤",
 4   "image": "http://img.alicdn.com/bao/uploaded/i4/1687728515/O1CN015vKRk22Clv2z9jVKM_!!0-item_pic.jpg",
 5   "item_url":  "http://item.taobao.com/item.htm?id=536600477798",
 6   "shop_name": "XXX旗舰店",
 7   "price": 35.00,
 8   "sales": 12866,
 9   "contact_info": "XXX旗舰店",
10   "short_url": "https://s.click.taobao.com/6dhjX0w",
11   "sales_url":  "https://s.click.taobao.com/t?e=m%3D2%26s%3DhqNnFErxaS0cQipKwQzePOeEDrYVVa64K7Vc7tFgwiG3bLqV5UHdqSJ215tW5ra7%2Fl0%2B1yuzCtL9CVjm9%2FaTIMEcIrQjme5phH%2FwEhdaGdpwfW9VvJkbiUOLibAxXu8J4DrzI0Q%2Bh5mWydDa%2BK5%2FZ44CXhN9RDLu87eUjW4Ylwlp3E7b2H5imSCyCj9paIOIxiXvDf8DaRs%3D",
12   "sales_pass":  "￥q6vvYNlY15Y￥",
13   "coupon_total_num": 50000,
14   "coupon_remaining_num":  49981,
15   "coupon_quota": "满35减10",
16   "coupon_start_date": "2019-09-20",
17   "coupon_end_date": "2019-09-25",
18   "coupon_url": "https://uland.taobao.com/coupon/edetail?e=EpEKjA4ejsRt3vqbdXnGlgxMgopp14njlHycenxkSuDwJfMHI%2FfVmw2KFrzHTGtgHv69%2F64THFCtOwU1ltpiC5ZrJ2LltVbgH31ZeQAUzbQ%3D&af=1&pid=mm_226490165_153450382_44990650090",
19   "coupon_pass": "￥b0NmYNlbC8t￥",
20   "coupon_short_url": "https://s.click.taobao.com/XRkjX0w"
21 }

通过上面的数据结构，我们可以得出以下结论：

"id"是整型，可以设置类型为"integer"或"long"
"name"是字符串类型，需要作为查询条件，并且需要分词。类型要指定为"text"，指定中文分词器"ik_max_word"，搜索的时候指定"ik_smart"分词器。ps："text"类型数据会被分词，"keyword"类型不会被分词。
"image"作为一个商品的图片链接，用作展示商品，是不需要作为搜索条件的，所以不需要建立索引，也不需要做聚合分析，直接设置"enabled": false。其他类似需求的字段也和"image"一样设置
"item_url"也和"image"一样。
"shop_name"是需要作为查询条件，但是因为每个店铺的名字并不规范，ik自带的词库并不能满足分词的要求，所以给出以下两种建议：

1.将shop_name的值加入分词库，并指定中文分词器"ik_max_word"。查询的时候使用"ik_smart"分词并查询。

ps:关于如何添加自定义分词,可查看如何添加自定义分词

2.对shop_name的值不分词，类型指定为"keyword"。查询的时候使用"term query"。

"price"是商品价格，类型设置为"double"。
"sales"是销量，类型设置为"integer"。
"contact_info"是联系方式，不需要分词，直接设置"keyword"。
"short_url"是商品推广的短链接，不需要分词。
"sales_url"是商品推广的链接，不需要分词。
**"sales_pass"是商品推广口令，不需要分词，直接设置"keyword"。
"coupon_total_num"是优惠券总数量，设置为"integer"。
"coupon_remaining_num"是优惠券剩余量，设置为"integer"。
"coupon_quota"是优惠券额度，设置为"keyword"。
"coupon_start_date"是优惠券开始日期，设置为"date"日期类型，"format"为"yyyy-MM-dd"。
"coupon_end_date"是优惠券结束日期，设置为"date"日期类型，"format"为"yyyy-MM-dd"。
"coupon_url"是优惠券链接。
"coupon_pass"是优惠券推广口令，不需要分词，直接设置"keyword"。
"coupon_short_url"是优惠券短链接。

根据上述分析，最终得出：

 1 PUT item_index
 2 {
 3   "mappings":  {
 4     "dynamic": false,
 5     "properties":  {
 6       "id":  {
 7         "type":  "long"
 8       },
 9       "name":  {
10         "type":  "text",
11         "analyzer":  "ik_max_word",
12         "search_analyzer": "ik_smart"
13       },
14       "image":  {
15         "enabled": false
16       },
17       "item_url":  {
18         "enabled": false
19       },
20       "shop_name":  {
21         "type":  "text",
22         "analyzer":  "ik_max_word",
23         "search_analyzer": "ik_smart",
24         "fields": {
25             "keyword": {
26                 "type":  "keyword"
27              }
28          }
29       },
30       "price":  {
31         "type":  "double"
32       },
33       "sales":  {
34         "type":  "integer"
35       },
36       "contact_info":  {
37         "type":  "keyword"
38       },
39       "short_url":  {
40         "enabled": false
41       },
42       "sales_url":  {
43          "enabled": false
44       },
45       "sales_pass":  {
46         "type":  "keyword"
47       },
48       "coupon_total_num":  {
49         "type":  "integer"
50       },
51       "coupon_remaining_num":  {
52         "type":  "integer"
53       },
54       "coupon_quota":  {
55         "type":  "keyword"
56       },
57       "coupon_start_date":  {
58         "type":  "date",
59         "format":  "yyyy-MM-dd"
60       },
61       "coupon_end_date":  {
62         "type":  "date",
63         "format":  "yyyy-MM-dd"
64       },
65       "coupon_url":  {
66         "enabled": false
67       },
68       "coupon_pass":  {
69         "type":  "keyword"
70       },
71       "coupon_short_url":  {
72         "enabled": false
73       },
74     }
75   }
76 }

例2:服务器日志数据结构

222.67.85.228 - - [14/Nov/2018:14:30:34 +0800] "GET /search?keyword=&hasCoupon=0&pageNum=1&pageSize=100 HTTP/1.1" 200 12268 "-" "Apache-HttpClient/4.5.5 (Java/1.8.0_131)" "-"

通过日志格式化，将nginx日志转换成以下数据结构：

{
    "ip": "222.67.85.228",
    "username": "-",
    "time": "2018-11-14 14:30:34",
    "request_action": "GET",
    "request_url": "/search?keyword=&hasCoupon=0&pageNum=1&pageSize=100",
    "http_version": "1.1",
    "response_status": 200,
    "byte": 12268,
    "referrer": "-",
    "agent": "Apache-HttpClient/4.5.5 (Java/1.8.0_131)",
    "http_forward": "-"
}

一般查看日志按照时间和响应状态这两个维度作为查询条件。比如说，需要查询从2019年01月01日至今为止的响应状态为500的请求。整个日志字段基本不需要做分词处理，基本都是做一个展示，字符串数据基本就是"keyword"类型，日期类型注意格式化。

PUT nginx_log_index
{
    "mappings": {
        "dynamic": false,
        "properties":  {
            "ip":  {
                "type": "keyword"
            },
            "username":  {
                "type": "keyword"
            },
            "time":  {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss"
            },
            "request_action":  {
                "type": "keyword"
            },
            "request_url":  {
                "enabled": false
            },
            "http_version":  {
                "type": "keyword"
            },
            "response_status":  {
                "type": "integer"
            },
            "bytes":  {
                "type": "long"
            },
            "referrer":  {
                "type": "keyword"
            },
            "agent":  {
                "type": "keyword"
            },
            "http_forward":  {
                "type": "keyword"
            }
        }
    }
}

例3:博客数据结构

{    
    "id": "1",
    "url": "https://www.xxxx.com/p100000",
    "title": "elasticsearch简单数据建模",
    "author": "李四",
    "content": "elasticsearch。。。。。。。。。。。简单数据建模", 
    "time": "2019.04.10 21:08:21", 
    "word_num": 1000, 
    "read_num": 20, 
    "like_num": 1, 
    "reward_num": 0 
}

因为博客内容过大，为避免每次查询的结果都要带上庞大的博客内容，建议将每个字段分开存储，查询的时候按需展示。所以建议"_source"设置为"enabled": false，但是需要针对每个字段单独设置 "store": true

PUT blog_index
{
    "mappings": {
        "dynamic": false,
        "_source": {
            "enabled": false
        }, 
        "properties":  {
            "id": {
                "type":  "keyword",
                "store":  true,
            },
            "url": {
                "type":  "keyword",
                "store":  true,
                "ignore_above":  100,
                "doc_values":  false,
                "norms":  false,
            },
            "title": {
                "type":  "text",
                "store":  true,
                "analyzer":  "ik_max_word",
                "search_analyzer": "ik_smart",
                "fields": {
                    "keyword": {
                        "type":  "keyword"
                    }
                }
            },
            "author": {
                "type":  "keyword",
                "store":  true,
            },
            "content": {
                "type":  "text",
                "analyzer":  "ik_max_word",
                "search_analyzer": "ik_smart",
                "store":  true
            },
            "time": {
                "type":  "text",
                "format":  "yyyy.MM.dd HH:mm:ss",
                "store":  true
            },
            "word_num": {
                "type":  "integer",
                "store":  true
            },
            "read_num": {
                "type":  "integer",
                "store":  true
            },
            "like_num": {
                "type":  "integer",
                "store":  true
            },
            "reward_num": {
                "type":  "integer",
                "store":  true
            }
        }
    }
}

小伙伴们可以根据自己对ES的了解和实际业务分析，对以上例子进行进一步优化。

posted @ 2020-09-02 16:34 wisdomx 阅读(281) 评论(0) 收藏举报

刷新页面返回顶部

wisdomx

elasticsearch简单数据建模

公告