elasticsearch之内置分词器

elasticsearch之内置分词器

 

前言

由于elasticsearch内置了分析器,它同样也包含了分词器。分词器,顾名思义,主要的操作是将文本字符串分解为小块,而这些小块这被称为分词token

标准分词器:standard tokenizer

标准分词器(standard tokenizer)是一个基于语法的分词器,对于大多数欧洲语言来说还是不错的,它同时还处理了Unicode文本的分词,但分词默认的最大长度是255字节,它也移除了逗号和句号这样的标点符号。

POST _analyze
{
  "tokenizer": "standard",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "莎",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "士",
      "start_offset" : 46,
      "end_offset" : 47,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "比",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "亚",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    }
  ]
}

关键词分词器:keyword tokenizer

关键词分词器(keyword tokenizer)是一种简单的分词器,将整个文本作为单个的分词,提供给分词过滤器,当你只想用分词过滤器,而不做分词操作时,它是不错的选择。

POST _analyze
{
  "tokenizer": "keyword",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "To be or not to be,  That is a question ———— 莎士比亚",
      "start_offset" : 0,
      "end_offset" : 49,
      "type" : "word",
      "position" : 0
    }
  ]
}

字母分词器:letter tokenizer

字母分词器(letter tokenizer)根据非字母的符号,将文本切分成分词。

POST _analyze
{
  "tokenizer": "letter",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亚",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}

小写分词器:lowercase tokenizer

小写分词器(lowercase tokenizer)结合了常规的字母分词器和小写分词过滤器(跟你想的一样,就是将所有的分词转化为小写)的行为。通过一个单独的分词器来实现的主要原因是,一次进行两项操作会获得更好的性能。

POST _analyze
{
  "tokenizer": "lowercase",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "to",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "that",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亚",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}

空白分词器:whitespace tokenizer

空白分词器(whitespace tokenizer)通过空白来分隔不同的分词,空白包括空格、制表符、换行等。但是,我们需要注意的是,空白分词器不会删除任何标点符号。

POST _analyze
{
  "tokenizer": "whitespace",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be,",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "————",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "莎士比亚",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 11
    }
  ]
}

模式分词器:pattern tokenizer

模式分词器(pattern tokenizer)允许指定一个任意的模式,将文本切分为分词。

POST _analyze
{
  "tokenizer": "pattern",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

现在让我们手动定制一个以逗号分隔的分词器。

PUT pattern_test2
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer":{
          "type":"pattern",
          "pattern":","
        }
      }
    }
  }
}

上例中,在settings下的自定义分析器my_analyzer中,自定义的模式分词器名叫my_tokenizer;在与自定义分析器同级,为新建的自定义模式分词器设置一些属性,比如以逗号分隔。

POST pattern_test2/_analyze
{
  "tokenizer": "my_tokenizer",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "To be or not to be",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "  That is a question ———— 莎士比亚",
      "start_offset" : 19,
      "end_offset" : 49,
      "type" : "word",
      "position" : 1
    }
  ]
}

根据结果可以看到,文档被逗号分割为两部分。

UAX URL电子邮件分词器:UAX RUL email tokenizer

在处理单个的英文单词的情况下,标准分词器是个非常好的选择,但是现在很多的网站以网址或电子邮件作为结尾,比如我们现在有这样的一个文本:

作者:张开
来源:未知 
原文:https://www.cnblogs.com/Neeo/articles/10402742.html
邮箱:xxxxxxx@xx.com
版权声明:本文为博主原创文章,转载请附上博文链接!

现在让我们使用标准分词器查看一下:

POST _analyze
{
  "tokenizer": "standard",
  "text":"作者:张开来源:未知原文:https://www.cnblogs.com/Neeo/articles/10402742.html邮箱:xxxxxxx@xx.com版权声明:本文为博主原创文章,转载请附上博文链接!"
}

结果很长:

{
  "tokens" : [
    {
      "token" : "作",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "者",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "张",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "开",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "来",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "源",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "未",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "知",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "原",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "文",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "https",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "www.cnblogs.com",
      "start_offset" : 21,
      "end_offset" : 36,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "Neeo",
      "start_offset" : 37,
      "end_offset" : 41,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "articles",
      "start_offset" : 42,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 13
    },
    {
      "token" : "10402742",
      "start_offset" : 51,
      "end_offset" : 59,
      "type" : "<NUM>",
      "position" : 14
    },
    {
      "token" : "html",
      "start_offset" : 60,
      "end_offset" : 64,
      "type" : "<ALPHANUM>",
      "position" : 15
    },
    {
      "token" : "邮",
      "start_offset" : 64,
      "end_offset" : 65,
      "type" : "<IDEOGRAPHIC>",
      "position" : 16
    },
    {
      "token" : "箱",
      "start_offset" : 65,
      "end_offset" : 66,
      "type" : "<IDEOGRAPHIC>",
      "position" : 17
    },
    {
      "token" : "xxxxxxx",
      "start_offset" : 67,
      "end_offset" : 74,
      "type" : "<ALPHANUM>",
      "position" : 18
    },
    {
      "token" : "xx.com",
      "start_offset" : 75,
      "end_offset" : 81,
      "type" : "<ALPHANUM>",
      "position" : 19
    },
    {
      "token" : "版",
      "start_offset" : 81,
      "end_offset" : 82,
      "type" : "<IDEOGRAPHIC>",
      "position" : 20
    },
    {
      "token" : "权",
      "start_offset" : 82,
      "end_offset" : 83,
      "type" : "<IDEOGRAPHIC>",
      "position" : 21
    },
    {
      "token" : "声",
      "start_offset" : 83,
      "end_offset" : 84,
      "type" : "<IDEOGRAPHIC>",
      "position" : 22
    },
    {
      "token" : "明",
      "start_offset" : 84,
      "end_offset" : 85,
      "type" : "<IDEOGRAPHIC>",
      "position" : 23
    },
    {
      "token" : "本",
      "start_offset" : 86,
      "end_offset" : 87,
      "type" : "<IDEOGRAPHIC>",
      "position" : 24
    },
    {
      "token" : "文",
      "start_offset" : 87,
      "end_offset" : 88,
      "type" : "<IDEOGRAPHIC>",
      "position" : 25
    },
    {
      "token" : "为",
      "start_offset" : 88,
      "end_offset" : 89,
      "type" : "<IDEOGRAPHIC>",
      "position" : 26
    },
    {
      "token" : "博",
      "start_offset" : 89,
      "end_offset" : 90,
      "type" : "<IDEOGRAPHIC>",
      "position" : 27
    },
    {
      "token" : "主",
      "start_offset" : 90,
      "end_offset" : 91,
      "type" : "<IDEOGRAPHIC>",
      "position" : 28
    },
    {
      "token" : "原",
      "start_offset" : 91,
      "end_offset" : 92,
      "type" : "<IDEOGRAPHIC>",
      "position" : 29
    },
    {
      "token" : "创",
      "start_offset" : 92,
      "end_offset" : 93,
      "type" : "<IDEOGRAPHIC>",
      "position" : 30
    },
    {
      "token" : "文",
      "start_offset" : 93,
      "end_offset" : 94,
      "type" : "<IDEOGRAPHIC>",
      "position" : 31
    },
    {
      "token" : "章",
      "start_offset" : 94,
      "end_offset" : 95,
      "type" : "<IDEOGRAPHIC>",
      "position" : 32
    },
    {
      "token" : "转",
      "start_offset" : 96,
      "end_offset" : 97,
      "type" : "<IDEOGRAPHIC>",
      "position" : 33
    },
    {
      "token" : "载",
      "start_offset" : 97,
      "end_offset" : 98,
      "type" : "<IDEOGRAPHIC>",
      "position" : 34
    },
    {
      "token" : "请",
      "start_offset" : 98,
      "end_offset" : 99,
      "type" : "<IDEOGRAPHIC>",
      "position" : 35
    },
    {
      "token" : "附",
      "start_offset" : 99,
      "end_offset" : 100,
      "type" : "<IDEOGRAPHIC>",
      "position" : 36
    },
    {
      "token" : "上",
      "start_offset" : 100,
      "end_offset" : 101,
      "type" : "<IDEOGRAPHIC>",
      "position" : 37
    },
    {
      "token" : "博",
      "start_offset" : 101,
      "end_offset" : 102,
      "type" : "<IDEOGRAPHIC>",
      "position" : 38
    },
    {
      "token" : "文",
      "start_offset" : 102,
      "end_offset" : 103,
      "type" : "<IDEOGRAPHIC>",
      "position" : 39
    },
    {
      "token" : "链",
      "start_offset" : 103,
      "end_offset" : 104,
      "type" : "<IDEOGRAPHIC>",
      "position" : 40
    },
    {
      "token" : "接",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "<IDEOGRAPHIC>",
      "position" : 41
    }
  ]
}

无论如何,这个结果不符合我们的预期,因为把我们的邮箱和网址分的乱七八糟!那么针对这种情况,我们应该使用UAX URL电子邮件分词器(UAX RUL email tokenizer),该分词器将电子邮件和URL都作为单独的分词进行保留。

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text":"作者:张开来源:未知原文:https://www.cnblogs.com/Neeo/articles/10402742.html邮箱:xxxxxxx@xx.com版权声明:本文为博主原创文章,转载请附上博文链接!"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "作",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "者",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "张",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "开",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "来",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "源",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "未",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "知",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "原",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "文",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "https://www.cnblogs.com/Neeo/articles/10402742.html",
      "start_offset" : 13,
      "end_offset" : 64,
      "type" : "<URL>",
      "position" : 10
    },
    {
      "token" : "邮",
      "start_offset" : 64,
      "end_offset" : 65,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "箱",
      "start_offset" : 65,
      "end_offset" : 66,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "xxxxxxx@xx.com",
      "start_offset" : 67,
      "end_offset" : 81,
      "type" : "<EMAIL>",
      "position" : 13
    },
    {
      "token" : "版",
      "start_offset" : 81,
      "end_offset" : 82,
      "type" : "<IDEOGRAPHIC>",
      "position" : 14
    },
    {
      "token" : "权",
      "start_offset" : 82,
      "end_offset" : 83,
      "type" : "<IDEOGRAPHIC>",
      "position" : 15
    },
    {
      "token" : "声",
      "start_offset" : 83,
      "end_offset" : 84,
      "type" : "<IDEOGRAPHIC>",
      "position" : 16
    },
    {
      "token" : "明",
      "start_offset" : 84,
      "end_offset" : 85,
      "type" : "<IDEOGRAPHIC>",
      "position" : 17
    },
    {
      "token" : "本",
      "start_offset" : 86,
      "end_offset" : 87,
      "type" : "<IDEOGRAPHIC>",
      "position" : 18
    },
    {
      "token" : "文",
      "start_offset" : 87,
      "end_offset" : 88,
      "type" : "<IDEOGRAPHIC>",
      "position" : 19
    },
    {
      "token" : "为",
      "start_offset" : 88,
      "end_offset" : 89,
      "type" : "<IDEOGRAPHIC>",
      "position" : 20
    },
    {
      "token" : "博",
      "start_offset" : 89,
      "end_offset" : 90,
      "type" : "<IDEOGRAPHIC>",
      "position" : 21
    },
    {
      "token" : "主",
      "start_offset" : 90,
      "end_offset" : 91,
      "type" : "<IDEOGRAPHIC>",
      "position" : 22
    },
    {
      "token" : "原",
      "start_offset" : 91,
      "end_offset" : 92,
      "type" : "<IDEOGRAPHIC>",
      "position" : 23
    },
    {
      "token" : "创",
      "start_offset" : 92,
      "end_offset" : 93,
      "type" : "<IDEOGRAPHIC>",
      "position" : 24
    },
    {
      "token" : "文",
      "start_offset" : 93,
      "end_offset" : 94,
      "type" : "<IDEOGRAPHIC>",
      "position" : 25
    },
    {
      "token" : "章",
      "start_offset" : 94,
      "end_offset" : 95,
      "type" : "<IDEOGRAPHIC>",
      "position" : 26
    },
    {
      "token" : "转",
      "start_offset" : 96,
      "end_offset" : 97,
      "type" : "<IDEOGRAPHIC>",
      "position" : 27
    },
    {
      "token" : "载",
      "start_offset" : 97,
      "end_offset" : 98,
      "type" : "<IDEOGRAPHIC>",
      "position" : 28
    },
    {
      "token" : "请",
      "start_offset" : 98,
      "end_offset" : 99,
      "type" : "<IDEOGRAPHIC>",
      "position" : 29
    },
    {
      "token" : "附",
      "start_offset" : 99,
      "end_offset" : 100,
      "type" : "<IDEOGRAPHIC>",
      "position" : 30
    },
    {
      "token" : "上",
      "start_offset" : 100,
      "end_offset" : 101,
      "type" : "<IDEOGRAPHIC>",
      "position" : 31
    },
    {
      "token" : "博",
      "start_offset" : 101,
      "end_offset" : 102,
      "type" : "<IDEOGRAPHIC>",
      "position" : 32
    },
    {
      "token" : "文",
      "start_offset" : 102,
      "end_offset" : 103,
      "type" : "<IDEOGRAPHIC>",
      "position" : 33
    },
    {
      "token" : "链",
      "start_offset" : 103,
      "end_offset" : 104,
      "type" : "<IDEOGRAPHIC>",
      "position" : 34
    },
    {
      "token" : "接",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "<IDEOGRAPHIC>",
      "position" : 35
    }
  ]
}

路径层次分词器:path hierarchy tokenizer

路径层次分词器(path hierarchy tokenizer)允许以特定的方式索引文件系统的路径,这样在搜索时,共享同样路径的文件将被作为结果返回。

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text":"/usr/local/python/python2.7"
}

返回结果如下:

{
  "tokens" : [
    {
      "token" : "/usr",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local/python",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local/python/python2.7",
      "start_offset" : 0,
      "end_offset" : 27,
      "type" : "word",
      "position" : 0
    }
  ]
}

see also:elasticsearch tokenizers
欢迎斧正,that's all

 
 
 
posted @ 2019-04-05 15:30  heshun  阅读(123)  评论(0编辑  收藏  举报