全文检索match和ik分词器

创建索引并设置分词器为ik(默认是标准的standard，不然会乱套的)

PUT /account
{
  "mappings": {
    "person": {
      "properties": {
        "user": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "title": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "desc": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}

这里用的是IK分词器，并且采用最细粒度划分max；

写入一条数据

POST /account/person
{
  "user": "张三",
  "title": "工程师",
  "desc": "数据库管理"
}

写入数据的时候，就根据设置的分词器进行分词了，可以通过如下检测如何分词

GET _analyze/
{
  "analyzer":"ik_max_word",
  "text":"数据库管理"
}

得到如下

{
  "tokens": [
    {
      "token": "数据库",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "数据",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    }，
    {
      "token": "库",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "管理",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 5
    }
  ]
}

此时，我用match 搜索以上任何一个词，都可以得到这一个doc

POST /account/person/_search
{
  "query": {
    "match": {
      "desc": "管理"
    }
  }
}

但是我们注意，ik分词此时没有分到：“数“ 这个词，那么我们可以根据自定义词库，通过远程加载，将词库扩容。

重新测试得到

{
  "tokens": [
    {
      "token": "数据库",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "数据",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "数",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "据",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token": "库",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "管理",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 5
    }
  ]
}

但是会有一个问题，已经写入到es的数据，是没有按照 “数” 这个词进行分词建立倒排索引的，所以旧的数据，根据 “ 数” 是无法全文检索到的，而新加入一条记录，

会进行分词，可以搜索到!!!

####旧数据该如何处理呢，你们去想想。

posted on 2018-04-30 00:53 老曹123 阅读(325) 评论(0) 收藏举报

刷新页面返回顶部

Dean

全文检索match和ik分词器

导航

公告