ES 分词器相关
1、规范化 Normalization
规范化,主要实在ES对文本类型进行分词后,按照各自分词器的规范标准,对分词进行二次处理的过程.如was=>is(时态转换),brother‘s=>brother(复数变单数),Watch=>watch(大小写转换)等等,且还可能去掉量词a、an,is等和搜索无关的词语,不同的分词器规范化的过程不一样
总结:Normalization会做一些有利于搜索和规范化的操作,提升搜索效率,不同的分词器其Normalization流程各不相同.
3、分词器 tokenizer 官方文档
官方提供了10余种分词器,默认是standard分词器(根据Unicode文本分割算法的定义,标准标记器根据单词边界将文本划分为术语。它删除了大多数标点符号。它是大多数语言的最佳选择)
2.1 常用分词器(随便介绍两种,具体查阅文档)
stanard分词器
GET _analyze
{
"text": "Xiao chao was a good man",
"tokenizer": "standard"
}
分词结果如下:
xiao、chao、was、a、good、man
english分词器
GET _analyze
{
"text": "Xiao chao was a good man",
"analyzer": "english"
}
分词结果如下:
xiao、chao、good、man
和standard不同的是,english分词器,舍去了was a等和搜索相关度不高的词.
2.3 中文分词器
关于中文分词器参考ES 中文分词器ik
4、自定义分词器
结合上面的内容,来实现一个自定义分词器.
PUT test_index { "settings": { "analysis": { "char_filter": { "custom_char_filter":{ "type":"mapping", "mappings":[ "&=>and", "|=>or", "!=>not" ] }, "custom_html_strip_filter":{ "type":"html_strip", "escaped_tags":["a"] }, "custom_pattern_replace_filter":{ "type":"pattern_replace", "pattern": "(\\d{3})\\d{4}(\\d{4})", "replacement": "$1****$2" } }, "filter": { "custom_stop_filter":{ "type": "stop", "ignore_case": true, "stopwords": [ "and", "is","friend" ] } }, "tokenizer": { "custom_tokenizer":{ "type":"pattern", "pattern":"[ ,!.?]" } }, "analyzer": { "custom_analyzer":{ "type":"custom", "tokenizer":"custom_tokenizer", "char_filter":["custom_char_filter","custom_html_strip_filter","custom_pattern_replace_filter"], "filter":["custom_stop_filter"] } } } } }
当前自定义分析器用了字符串过滤器(三种形式),和令牌过滤器(这里只用了停用词).
关于过滤器相关参考ES 字符过滤器&令牌过滤器,关于分词器相关参考ES 分词器(示例使用了pattern分词器,参考文档)
执行以上代码后,执行搜索如下搜索代码:
GET test_index/_analyze { "analyzer": "custom_analyzer", "text":"&.|,!?13366666666.You and me is Friend <p>超链接</p>" }
执行结果如下:
{ "tokens" : [ { "token" : "or", "start_offset" : 2, "end_offset" : 3, "type" : "word", "position" : 1 }, { "token" : "not", "start_offset" : 4, "end_offset" : 5, "type" : "word", "position" : 2 }, { "token" : "133****6666", "start_offset" : 6, "end_offset" : 17, "type" : "word", "position" : 3 }, { "token" : "You", "start_offset" : 18, "end_offset" : 21, "type" : "word", "position" : 4 }, { "token" : "me", "start_offset" : 26, "end_offset" : 28, "type" : "word", "position" : 6 }, { "token" : """ 超链接 """, "start_offset" : 39, "end_offset" : 49, "type" : "word", "position" : 9 } ] }