ELK八:分析器
当数据传递到elasticsearch
后,到底发生了什么?
es的分析过程:
当数据被发送到elasticsearch
后并加入到倒排索引之前,elasticsearch
会对该文档进行一系列的处理步骤:
- 字符过滤:使用字符过滤器转变字符。可选。
- 文本切分为分词:将文本(档)分为单个或多个分词。必选。
- 分词过滤:使用分词过滤器转变每个分词。可选。
- 分词索引:最终将分词存储在Lucene倒排索引中。
一个分析器可以包括:
- 可选的字符过滤器
- 一个分词器
- 0个或多个分词过滤器
1.内置分析器
内置分析器,包括:
- 标准分析器:standard analyzer
- 简单分析器:simple analyzer
- 空白分析器:whitespace analyzer
- 停用词分析器:stop analyzer
- 关键词分析器:keyword analyzer
- 模式分析器:pattern analyzer
- 语言和多语言分析器:chinese
- 雪球分析器:snowball analyzer
标准分析器(standard analyzer):是elasticsearch的默认分析器。
该分析器综合了大多数欧洲语言来说合理的默认模块,包括标准分词器、标准分词过滤器、小写转换分词过滤器和停用词分词过滤器。
POST _analyze { "analyzer": "standard", "text":"To be or not to be, That is a question ———— 莎士比亚" } 分词结果如下: { "tokens" : [ { "token" : "to", "start_offset" : 0, "end_offset" : 2, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "be", "start_offset" : 3, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "or", "start_offset" : 6, "end_offset" : 8, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "not", "start_offset" : 9, "end_offset" : 12, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "to", "start_offset" : 13, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "be", "start_offset" : 16, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "that", "start_offset" : 21, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "is", "start_offset" : 26, "end_offset" : 28, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "a", "start_offset" : 29, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "莎", "start_offset" : 45, "end_offset" : 46, "type" : "<IDEOGRAPHIC>", "position" : 10 }, { "token" : "士", "start_offset" : 46, "end_offset" : 47, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "比", "start_offset" : 47, "end_offset" : 48, "type" : "<IDEOGRAPHIC>", "position" : 12 }, { "token" : "亚", "start_offset" : 48, "end_offset" : 49, "type" : "<IDEOGRAPHIC>", "position" : 13 } ] }
简单分析器(simple analyzer):简单分析器仅使用了小写转换分词。
这意味着在非字母处进行分词,并将分词自动转换为小写。这个分词器对于亚种语言来说效果不佳,因为亚洲语言不是根据空白来分词的,所以一般用于欧洲言中。
POST _analyze { "analyzer": "simple", "text":"To be or not to be, That is a question ———— 莎士比亚" } 分词结果如下: { "tokens" : [ { "token" : "to", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "be", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 1 }, { "token" : "or", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 2 }, { "token" : "not", "start_offset" : 9, "end_offset" : 12, "type" : "word", "position" : 3 }, { "token" : "to", "start_offset" : 13, "end_offset" : 15, "type" : "word", "position" : 4 }, { "token" : "be", "start_offset" : 16, "end_offset" : 18, "type" : "word", "position" : 5 }, { "token" : "that", "start_offset" : 21, "end_offset" : 25, "type" : "word", "position" : 6 }, { "token" : "is", "start_offset" : 26, "end_offset" : 28, "type" : "word", "position" : 7 }, { "token" : "a", "start_offset" : 29, "end_offset" : 30, "type" : "word", "position" : 8 }, { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "word", "position" : 9 }, { "token" : "莎士比亚", "start_offset" : 45, "end_offset" : 49, "type" : "word", "position" : 10 } ] }
空白(格)分析器(whitespace analyzer):只是根据空白将文本切分为若干分词!
POST _analyze { "analyzer": "whitespace", "text":"To be or not to be, That is a question ———— 莎士比亚" } 分词结果如下: { "tokens" : [ { "token" : "To", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "be", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 1 }, { "token" : "or", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 2 }, { "token" : "not", "start_offset" : 9, "end_offset" : 12, "type" : "word", "position" : 3 }, { "token" : "to", "start_offset" : 13, "end_offset" : 15, "type" : "word", "position" : 4 }, { "token" : "be,", "start_offset" : 16, "end_offset" : 19, "type" : "word", "position" : 5 }, { "token" : "That", "start_offset" : 21, "end_offset" : 25, "type" : "word", "position" : 6 }, { "token" : "is", "start_offset" : 26, "end_offset" : 28, "type" : "word", "position" : 7 }, { "token" : "a", "start_offset" : 29, "end_offset" : 30, "type" : "word", "position" : 8 }, { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "word", "position" : 9 }, { "token" : "————", "start_offset" : 40, "end_offset" : 44, "type" : "word", "position" : 10 }, { "token" : "莎士比亚", "start_offset" : 45, "end_offset" : 49, "type" : "word", "position" : 11 } ] }
停用词分析(stop analyzer)和简单分析器的行为很像,只是在分词流中额外的过滤了停用词。
POST _analyze { "analyzer": "stop", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果也很简单: { "tokens" : [ { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "word", "position" : 9 }, { "token" : "莎士比亚", "start_offset" : 45, "end_offset" : 49, "type" : "word", "position" : 10 } ] }
关键词分析器(keyword analyzer)将整个字段当做单独的分词,如无必要,我们不在映射中使用关键词分析器。
POST _analyze { "analyzer": "keyword", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "To be or not to be, That is a question ———— 莎士比亚", "start_offset" : 0, "end_offset" : 49, "type" : "word", "position" : 0 } ] }
模式分析器(pattern analyzer)允许我们指定一个分词切分模式。
但是通常更佳的方案是使用定制的分析器,组合现有的模式分词器和所需要的分词过滤器更加合适。
POST _analyze { "analyzer": "pattern", "explain": false, "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "to", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "be", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 1 }, { "token" : "or", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 2 }, { "token" : "not", "start_offset" : 9, "end_offset" : 12, "type" : "word", "position" : 3 }, { "token" : "to", "start_offset" : 13, "end_offset" : 15, "type" : "word", "position" : 4 }, { "token" : "be", "start_offset" : 16, "end_offset" : 18, "type" : "word", "position" : 5 }, { "token" : "that", "start_offset" : 21, "end_offset" : 25, "type" : "word", "position" : 6 }, { "token" : "is", "start_offset" : 26, "end_offset" : 28, "type" : "word", "position" : 7 }, { "token" : "a", "start_offset" : 29, "end_offset" : 30, "type" : "word", "position" : 8 }, { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "word", "position" : 9 } ] }
定制一个模式分析器示例:匹配邮箱的正则。
PUT pattern_test { "settings": { "analysis": { "analyzer": { "my_email_analyzer":{ "type":"pattern", "pattern":"\\W|_", "lowercase":true } } } } }
上例中,我们在创建一条索引的时候,配置分析器为自定义的分析器。
需要注意的是,在json
字符串中,正则的斜杠需要转义。
我们使用自定义的分析器来查询。
POST pattern_test/_analyze { "analyzer": "my_email_analyzer", "text": "John_Smith@foo-bar.com" } 结果如下: { "tokens" : [ { "token" : "john", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "smith", "start_offset" : 5, "end_offset" : 10, "type" : "word", "position" : 1 }, { "token" : "foo", "start_offset" : 11, "end_offset" : 14, "type" : "word", "position" : 2 }, { "token" : "bar", "start_offset" : 15, "end_offset" : 18, "type" : "word", "position" : 3 }, { "token" : "com", "start_offset" : 19, "end_offset" : 22, "type" : "word", "position" : 4 } ] }
语言和多语言分析器:chinese
我们可以指定其中之一的语言来指定特定的语言分析器,但必须是小写的名字!如果你要分析的语言不在上述集合中,可能还需要搭配相应的插件支持。
POST _analyze { "analyzer": "chinese", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "莎", "start_offset" : 45, "end_offset" : 46, "type" : "<IDEOGRAPHIC>", "position" : 10 }, { "token" : "士", "start_offset" : 46, "end_offset" : 47, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "比", "start_offset" : 47, "end_offset" : 48, "type" : "<IDEOGRAPHIC>", "position" : 12 }, { "token" : "亚", "start_offset" : 48, "end_offset" : 49, "type" : "<IDEOGRAPHIC>", "position" : 13 } ] }
雪球分析器(snowball analyzer),除了使用标准的分词和分词过滤器(和标准分析器一样)也是用了小写分词过滤器和停用词过滤器,除此之外,它还是用了雪球词干器对文本进行词干提取。
POST _analyze { "analyzer": "snowball", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "莎", "start_offset" : 45, "end_offset" : 46, "type" : "<IDEOGRAPHIC>", "position" : 10 }, { "token" : "士", "start_offset" : 46, "end_offset" : 47, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "比", "start_offset" : 47, "end_offset" : 48, "type" : "<IDEOGRAPHIC>", "position" : 12 }, { "token" : "亚", "start_offset" : 48, "end_offset" : 49, "type" : "<IDEOGRAPHIC>", "position" : 13 } ] }
2.内置字符过滤器
字符过滤器在<charFilter>
属性中定义,它是对字符流进行处理。三种字符过滤器:
- HTML字符过滤器(HTML Strip Char Filter)
- 映射字符过滤器(Mapping Char Filter)
- 模式替换过滤器(Pattern Replace Char Filter)
HTML字符过滤器(HTML Strip Char Filter),从文本中去除HTML元素。
POST _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text":"<p>I'm so <b>happy</b>!</p>"
}
结果如下:
{
"tokens" : [
{
"token" : """
I'm so happy!
""",
"start_offset" : 0,
"end_offset" : 32,
"type" : "word",
"position" : 0
}
]
}
映射字符过滤器(Mapping Char Filter),接收键值的映射,每当遇到与键相同的字符串时,它就用该键关联的值替换它们。
PUT pattern_test4
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
},
"char_filter": {
"my_char_filter":{
"type":"mapping",
"mappings":["苍井空 => 666","武藤兰 => 888"]
}
}
}
}
}
}
上例中,我们自定义了一个分析器,其内的分词器使用关键字分词器,字符过滤器则是自定制的,将字符中的苍井空替换为666,武藤兰替换为888。
POST pattern_test4/_analyze
{
"analyzer": "my_analyzer",
"text": "苍井空热爱武藤兰,可惜后来苍井空结婚了"
}
结果如下:
{
"tokens" : [
{
"token" : "666热爱888,可惜后来666结婚了",
"start_offset" : 0,
"end_offset" : 19,
"type" : "word",
"position" : 0
}
]
}
模式替换过滤器(Pattern Replace Char Filter),使用正则表达式匹配并替换字符串中的字符。但要小心你写的抠脚的正则表达式。因为这可能导致性能变慢!
PUT pattern_test5
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
上例中,我们自定义了一个正则规则。
POST pattern_test5/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}
结果如下:
{
"tokens" : [
{
"token" : "My",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "credit",
"start_offset" : 3,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "card",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "is",
"start_offset" : 15,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "123_456_789",
"start_offset" : 18,
"end_offset" : 29,
"type" : "<NUM>",
"position" : 4
}
]
}
在实际中生产中,一般会使用第三方插件,如ik分词器,来代替以上的内置字符过滤器
3.内置分词器
内置分词器,包括:
- 标准分词器:standard tokenizer
- 关键词分词器:keyword tokenizer
- 字母分词器:letter tokenizer
- 小写分词器:lowercase tokenizer
- 空白分词器:whitespace tokenizer
- 模式分词器:pattern tokenizer
- UAX URL电子邮件分词器:UAX RUL email tokenizer
- 路径层次分词器:path hierarchy tokenizer
由于elasticsearch内置了分析器,它同样也包含了分词器。分词器,顾名思义,主要的操作是将文本字符串分解为小块,而这些小块这被称为分词token
。
标准分词器:standard tokenizer
标准分词器(standard tokenizer),是一个基于语法的分词器,对于大多数欧洲语言来说还是不错的,它同时还处理了Unicode文本的分词,但分词默认的最大长度是255字节,它也移除了逗号和句号这样的标点符号。
POST _analyze { "tokenizer": "standard", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "To", "start_offset" : 0, "end_offset" : 2, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "be", "start_offset" : 3, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "or", "start_offset" : 6, "end_offset" : 8, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "not", "start_offset" : 9, "end_offset" : 12, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "to", "start_offset" : 13, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "be", "start_offset" : 16, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "That", "start_offset" : 21, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "is", "start_offset" : 26, "end_offset" : 28, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "a", "start_offset" : 29, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "莎", "start_offset" : 45, "end_offset" : 46, "type" : "<IDEOGRAPHIC>", "position" : 10 }, { "token" : "士", "start_offset" : 46, "end_offset" : 47, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "比", "start_offset" : 47, "end_offset" : 48, "type" : "<IDEOGRAPHIC>", "position" : 12 }, { "token" : "亚", "start_offset" : 48, "end_offset" : 49, "type" : "<IDEOGRAPHIC>", "position" : 13 } ] }
关键词分词器(keyword tokenizer)是一种简单的分词器,将整个文本作为单个的分词,提供给分词过滤器,当你只想用分词过滤器,而不做分词操作时,它是不错的选择。
POST _analyze { "tokenizer": "keyword", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "To be or not to be, That is a question ———— 莎士比亚", "start_offset" : 0, "end_offset" : 49, "type" : "word", "position" : 0 } ] }
字母分词器:letter tokenizer
字母分词器(letter tokenizer)根据非字母的符号,将文本切分成分词。
POST _analyze { "tokenizer": "letter", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "To", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "be", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 1 }, { "token" : "or", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 2 }, { "token" : "not", "start_offset" : 9, "end_offset" : 12, "type" : "word", "position" : 3 }, { "token" : "to", "start_offset" : 13, "end_offset" : 15, "type" : "word", "position" : 4 }, { "token" : "be", "start_offset" : 16, "end_offset" : 18, "type" : "word", "position" : 5 }, { "token" : "That", "start_offset" : 21, "end_offset" : 25, "type" : "word", "position" : 6 }, { "token" : "is", "start_offset" : 26, "end_offset" : 28, "type" : "word", "position" : 7 }, { "token" : "a", "start_offset" : 29, "end_offset" : 30, "type" : "word", "position" : 8 }, { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "word", "position" : 9 }, { "token" : "莎士比亚", "start_offset" : 45, "end_offset" : 49, "type" : "word", "position" : 10 } ] }
小写分词器(lowercase tokenizer)结合了常规的字母分词器和小写分词过滤器(跟你想的一样,就是将所有的分词转化为小写)的行为。通过一个单独的分词器来实现的主要原因是,一次进行两项操作会获得更好的性能。
POST _analyze { "tokenizer": "lowercase", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "to", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "be", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 1 }, { "token" : "or", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 2 }, { "token" : "not", "start_offset" : 9, "end_offset" : 12, "type" : "word", "position" : 3 }, { "token" : "to", "start_offset" : 13, "end_offset" : 15, "type" : "word", "position" : 4 }, { "token" : "be", "start_offset" : 16, "end_offset" : 18, "type" : "word", "position" : 5 }, { "token" : "that", "start_offset" : 21, "end_offset" : 25, "type" : "word", "position" : 6 }, { "token" : "is", "start_offset" : 26, "end_offset" : 28, "type" : "word", "position" : 7 }, { "token" : "a", "start_offset" : 29, "end_offset" : 30, "type" : "word", "position" : 8 }, { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "word", "position" : 9 }, { "token" : "莎士比亚", "start_offset" : 45, "end_offset" : 49, "type" : "word", "position" : 10 } ] }
空白分词器(whitespace tokenizer)通过空白来分隔不同的分词,空白包括空格、制表符、换行等。但是,我们需要注意的是,空白分词器不会删除任何标点符号。
POST _analyze { "tokenizer": "whitespace", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "To", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "be", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 1 }, { "token" : "or", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 2 }, { "token" : "not", "start_offset" : 9, "end_offset" : 12, "type" : "word", "position" : 3 }, { "token" : "to", "start_offset" : 13, "end_offset" : 15, "type" : "word", "position" : 4 }, { "token" : "be,", "start_offset" : 16, "end_offset" : 19, "type" : "word", "position" : 5 }, { "token" : "That", "start_offset" : 21, "end_offset" : 25, "type" : "word", "position" : 6 }, { "token" : "is", "start_offset" : 26, "end_offset" : 28, "type" : "word", "position" : 7 }, { "token" : "a", "start_offset" : 29, "end_offset" : 30, "type" : "word", "position" : 8 }, { "token" : "question", "start_offset" : 31, "end_offset" : 39, "type" : "word", "position" : 9 }, { "token" : "————", "start_offset" : 40, "end_offset" : 44, "type" : "word", "position" : 10 }, { "token" : "莎士比亚", "start_offset" : 45, "end_offset" : 49, "type" : "word", "position" : 11 } ] }
模式分词器(pattern tokenizer)允许指定一个任意的模式,将文本切分为分词。
POST _analyze { "tokenizer": "pattern", "text":"To be or not to be, That is a question ———— 莎士比亚" }
现在让我们手动定制一个以逗号分隔的分词器。
PUT pattern_test2 { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "tokenizer":"my_tokenizer" } }, "tokenizer": { "my_tokenizer":{ "type":"pattern", "pattern":"," } } } } }
上例中,在settings下的自定义分析器my_analyzer中,自定义的模式分词器名叫my_tokenizer;在与自定义分析器同级,为新建的自定义模式分词器设置一些属性,比如以逗号分隔。
POST pattern_test2/_analyze { "tokenizer": "my_tokenizer", "text":"To be or not to be, That is a question ———— 莎士比亚" } 结果如下: { "tokens" : [ { "token" : "To be or not to be", "start_offset" : 0, "end_offset" : 18, "type" : "word", "position" : 0 }, { "token" : " That is a question ———— 莎士比亚", "start_offset" : 19, "end_offset" : 49, "type" : "word", "position" : 1 } ] }
根据结果可以看到,文档被逗号分割为两部分。
UAX URL电子邮件分词器:UAX RUL email tokenizer
作者:张开 来源:未知 原文:https://www.cnblogs.com/Neeo/articles/10402742.html 邮箱:xxxxxxx@xx.com 版权声明:本文为博主原创文章,转载请附上博文链接!
现在让我们使用标准分词器查看一下:
POST _analyze { "tokenizer": "standard", "text":"作者:张开来源:未知原文:https://www.cnblogs.com/Neeo/articles/10402742.html邮箱:xxxxxxx@xx.com版权声明:本文为博主原创文章,转载请附上博文链接!" } 结果很长: { "tokens" : [ { "token" : "作", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "者", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "张", "start_offset" : 3, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "开", "start_offset" : 4, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position" : 3 }, { "token" : "来", "start_offset" : 5, "end_offset" : 6, "type" : "<IDEOGRAPHIC>", "position" : 4 }, { "token" : "源", "start_offset" : 6, "end_offset" : 7, "type" : "<IDEOGRAPHIC>", "position" : 5 }, { "token" : "未", "start_offset" : 8, "end_offset" : 9, "type" : "<IDEOGRAPHIC>", "position" : 6 }, { "token" : "知", "start_offset" : 9, "end_offset" : 10, "type" : "<IDEOGRAPHIC>", "position" : 7 }, { "token" : "原", "start_offset" : 10, "end_offset" : 11, "type" : "<IDEOGRAPHIC>", "position" : 8 }, { "token" : "文", "start_offset" : 11, "end_offset" : 12, "type" : "<IDEOGRAPHIC>", "position" : 9 }, { "token" : "https", "start_offset" : 13, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 10 }, { "token" : "www.cnblogs.com", "start_offset" : 21, "end_offset" : 36, "type" : "<ALPHANUM>", "position" : 11 }, { "token" : "Neeo", "start_offset" : 37, "end_offset" : 41, "type" : "<ALPHANUM>", "position" : 12 }, { "token" : "articles", "start_offset" : 42, "end_offset" : 50, "type" : "<ALPHANUM>", "position" : 13 }, { "token" : "10402742", "start_offset" : 51, "end_offset" : 59, "type" : "<NUM>", "position" : 14 }, { "token" : "html", "start_offset" : 60, "end_offset" : 64, "type" : "<ALPHANUM>", "position" : 15 }, { "token" : "邮", "start_offset" : 64, "end_offset" : 65, "type" : "<IDEOGRAPHIC>", "position" : 16 }, { "token" : "箱", "start_offset" : 65, "end_offset" : 66, "type" : "<IDEOGRAPHIC>", "position" : 17 }, { "token" : "xxxxxxx", "start_offset" : 67, "end_offset" : 74, "type" : "<ALPHANUM>", "position" : 18 }, { "token" : "xx.com", "start_offset" : 75, "end_offset" : 81, "type" : "<ALPHANUM>", "position" : 19 }, { "token" : "版", "start_offset" : 81, "end_offset" : 82, "type" : "<IDEOGRAPHIC>", "position" : 20 }, { "token" : "权", "start_offset" : 82, "end_offset" : 83, "type" : "<IDEOGRAPHIC>", "position" : 21 }, { "token" : "声", "start_offset" : 83, "end_offset" : 84, "type" : "<IDEOGRAPHIC>", "position" : 22 }, { "token" : "明", "start_offset" : 84, "end_offset" : 85, "type" : "<IDEOGRAPHIC>", "position" : 23 }, { "token" : "本", "start_offset" : 86, "end_offset" : 87, "type" : "<IDEOGRAPHIC>", "position" : 24 }, { "token" : "文", "start_offset" : 87, "end_offset" : 88, "type" : "<IDEOGRAPHIC>", "position" : 25 }, { "token" : "为", "start_offset" : 88, "end_offset" : 89, "type" : "<IDEOGRAPHIC>", "position" : 26 }, { "token" : "博", "start_offset" : 89, "end_offset" : 90, "type" : "<IDEOGRAPHIC>", "position" : 27 }, { "token" : "主", "start_offset" : 90, "end_offset" : 91, "type" : "<IDEOGRAPHIC>", "position" : 28 }, { "token" : "原", "start_offset" : 91, "end_offset" : 92, "type" : "<IDEOGRAPHIC>", "position" : 29 }, { "token" : "创", "start_offset" : 92, "end_offset" : 93, "type" : "<IDEOGRAPHIC>", "position" : 30 }, { "token" : "文", "start_offset" : 93, "end_offset" : 94, "type" : "<IDEOGRAPHIC>", "position" : 31 }, { "token" : "章", "start_offset" : 94, "end_offset" : 95, "type" : "<IDEOGRAPHIC>", "position" : 32 }, { "token" : "转", "start_offset" : 96, "end_offset" : 97, "type" : "<IDEOGRAPHIC>", "position" : 33 }, { "token" : "载", "start_offset" : 97, "end_offset" : 98, "type" : "<IDEOGRAPHIC>", "position" : 34 }, { "token" : "请", "start_offset" : 98, "end_offset" : 99, "type" : "<IDEOGRAPHIC>", "position" : 35 }, { "token" : "附", "start_offset" : 99, "end_offset" : 100, "type" : "<IDEOGRAPHIC>", "position" : 36 }, { "token" : "上", "start_offset" : 100, "end_offset" : 101, "type" : "<IDEOGRAPHIC>", "position" : 37 }, { "token" : "博", "start_offset" : 101, "end_offset" : 102, "type" : "<IDEOGRAPHIC>", "position" : 38 }, { "token" : "文", "start_offset" : 102, "end_offset" : 103, "type" : "<IDEOGRAPHIC>", "position" : 39 }, { "token" : "链", "start_offset" : 103, "end_offset" : 104, "type" : "<IDEOGRAPHIC>", "position" : 40 }, { "token" : "接", "start_offset" : 104, "end_offset" : 105, "type" : "<IDEOGRAPHIC>", "position" : 41 } ] }
无论如何,这个结果不符合我们的预期,因为把我们的邮箱和网址分的乱七八糟!那么针对这种情况,我们应该使用UAX URL电子邮件分词器(UAX RUL email tokenizer),该分词器将电子邮件和URL都作为单独的分词进行保留。
POST _analyze { "tokenizer": "uax_url_email", "text":"作者:张开来源:未知原文:https://www.cnblogs.com/Neeo/articles/10402742.html邮箱:xxxxxxx@xx.com版权声明:本文为博主原创文章,转载请附上博文链接!" } 结果如下: { "tokens" : [ { "token" : "作", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "者", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "张", "start_offset" : 3, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "开", "start_offset" : 4, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position" : 3 }, { "token" : "来", "start_offset" : 5, "end_offset" : 6, "type" : "<IDEOGRAPHIC>", "position" : 4 }, { "token" : "源", "start_offset" : 6, "end_offset" : 7, "type" : "<IDEOGRAPHIC>", "position" : 5 }, { "token" : "未", "start_offset" : 8, "end_offset" : 9, "type" : "<IDEOGRAPHIC>", "position" : 6 }, { "token" : "知", "start_offset" : 9, "end_offset" : 10, "type" : "<IDEOGRAPHIC>", "position" : 7 }, { "token" : "原", "start_offset" : 10, "end_offset" : 11, "type" : "<IDEOGRAPHIC>", "position" : 8 }, { "token" : "文", "start_offset" : 11, "end_offset" : 12, "type" : "<IDEOGRAPHIC>", "position" : 9 }, { "token" : "https://www.cnblogs.com/Neeo/articles/10402742.html", "start_offset" : 13, "end_offset" : 64, "type" : "<URL>", "position" : 10 }, { "token" : "邮", "start_offset" : 64, "end_offset" : 65, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "箱", "start_offset" : 65, "end_offset" : 66, "type" : "<IDEOGRAPHIC>", "position" : 12 }, { "token" : "xxxxxxx@xx.com", "start_offset" : 67, "end_offset" : 81, "type" : "<EMAIL>", "position" : 13 }, { "token" : "版", "start_offset" : 81, "end_offset" : 82, "type" : "<IDEOGRAPHIC>", "position" : 14 }, { "token" : "权", "start_offset" : 82, "end_offset" : 83, "type" : "<IDEOGRAPHIC>", "position" : 15 }, { "token" : "声", "start_offset" : 83, "end_offset" : 84, "type" : "<IDEOGRAPHIC>", "position" : 16 }, { "token" : "明", "start_offset" : 84, "end_offset" : 85, "type" : "<IDEOGRAPHIC>", "position" : 17 }, { "token" : "本", "start_offset" : 86, "end_offset" : 87, "type" : "<IDEOGRAPHIC>", "position" : 18 }, { "token" : "文", "start_offset" : 87, "end_offset" : 88, "type" : "<IDEOGRAPHIC>", "position" : 19 }, { "token" : "为", "start_offset" : 88, "end_offset" : 89, "type" : "<IDEOGRAPHIC>", "position" : 20 }, { "token" : "博", "start_offset" : 89, "end_offset" : 90, "type" : "<IDEOGRAPHIC>", "position" : 21 }, { "token" : "主", "start_offset" : 90, "end_offset" : 91, "type" : "<IDEOGRAPHIC>", "position" : 22 }, { "token" : "原", "start_offset" : 91, "end_offset" : 92, "type" : "<IDEOGRAPHIC>", "position" : 23 }, { "token" : "创", "start_offset" : 92, "end_offset" : 93, "type" : "<IDEOGRAPHIC>", "position" : 24 }, { "token" : "文", "start_offset" : 93, "end_offset" : 94, "type" : "<IDEOGRAPHIC>", "position" : 25 }, { "token" : "章", "start_offset" : 94, "end_offset" : 95, "type" : "<IDEOGRAPHIC>", "position" : 26 }, { "token" : "转", "start_offset" : 96, "end_offset" : 97, "type" : "<IDEOGRAPHIC>", "position" : 27 }, { "token" : "载", "start_offset" : 97, "end_offset" : 98, "type" : "<IDEOGRAPHIC>", "position" : 28 }, { "token" : "请", "start_offset" : 98, "end_offset" : 99, "type" : "<IDEOGRAPHIC>", "position" : 29 }, { "token" : "附", "start_offset" : 99, "end_offset" : 100, "type" : "<IDEOGRAPHIC>", "position" : 30 }, { "token" : "上", "start_offset" : 100, "end_offset" : 101, "type" : "<IDEOGRAPHIC>", "position" : 31 }, { "token" : "博", "start_offset" : 101, "end_offset" : 102, "type" : "<IDEOGRAPHIC>", "position" : 32 }, { "token" : "文", "start_offset" : 102, "end_offset" : 103, "type" : "<IDEOGRAPHIC>", "position" : 33 }, { "token" : "链", "start_offset" : 103, "end_offset" : 104, "type" : "<IDEOGRAPHIC>", "position" : 34 }, { "token" : "接", "start_offset" : 104, "end_offset" : 105, "type" : "<IDEOGRAPHIC>", "position" : 35 } ] }
路径层次分词器(path hierarchy tokenizer)允许以特定的方式索引文件系统的路径,这样在搜索时,共享同样路径的文件将被作为结果返回。
POST _analyze { "tokenizer": "path_hierarchy", "text":"/usr/local/python/python2.7" } 返回结果如下: { "tokens" : [ { "token" : "/usr", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "/usr/local", "start_offset" : 0, "end_offset" : 10, "type" : "word", "position" : 0 }, { "token" : "/usr/local/python", "start_offset" : 0, "end_offset" : 17, "type" : "word", "position" : 0 }, { "token" : "/usr/local/python/python2.7", "start_offset" : 0, "end_offset" : 27, "type" : "word", "position" : 0 } ] }
posted on 2018-04-07 17:26 myworldworld 阅读(335) 评论(0) 收藏 举报