7.ElasticSearch系列之分词
1. Analysis与Analyzer
- Analysis: 文本分析是把全文本转换一系列单词(term/token)的过程,也要分词
- Analysis是通过Analyzer来实现的。可使用ElasticSearch内置分析器或按需定制化分析器
- 除了在数据写入时转换词条,匹配Query语句时候也需要相同的分析器对查询语句进行分析
2. Analyzer组成
分词器Analyzer由三部分组成
- Character Filters(针对原始文本处理,例如去除html)
- Tokenizer(按照规则切分单词)
- Token Filter(将切分的单词进行加工,小写,删除stopwords,增加同义词)
3. ElasticSearch内置分词器
- Stanadard Analyzer - 默认分词器,按词切分,小写处理
- Keyword Analyzer - 不分词,直接将输入当作输出
- Customer Analyzer - 自定义分词器
- Simple Analyzer/Stop Analyzer/Whitespace Analyzer/Pattern Analyzer/Language
4. 使用_analyzer API
# 直接使用analyzer进行测试
GET _analyze
{
"analyzer": "standard",
"text": "I Love China"
}
# 指定索引的字段进行测试
POST books/_analyze
{
"field": "title",
"text": "I Love China"
}
# 自定义分词进行测试
POST _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "I Love China"
}
# 结果
{
"tokens" : [
{
"token" : "i",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "love",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "china",
"start_offset" : 7,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
5. 中文分词IK安装与使用
下载对应版本分词器https://github.com/medcl/elasticsearch-analysis-ik/releases
于ES安装目录plugins下并重命名为ik重启ES即可
对于docker-compose方式部署,可参考https://gitee.com/SJshenjian/blog-code/tree/master/src/main/java/online/shenjian/es并按照docker-compose.yml中挂载plugins目录重启即可
验证中文分词ik
# 最粗粒度分词
GET _analyze
{
"analyzer": "ik_smart",
"text": "我爱中华人民共和国"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "爱",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中华人民共和国",
"start_offset" : 2,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 2
}
]
}
# 最细粒度分词
GET _analyze
{
"analyzer": "ik_max_word",
"text": "我爱中华人民共和国"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "爱",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中华人民共和国",
"start_offset" : 2,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "中华人民",
"start_offset" : 2,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "中华",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "华人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "人民共和国",
"start_offset" : 4,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "人民",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "共和国",
"start_offset" : 6,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "共和",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "国",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 10
}
]
}
浙公网安备 33010602011771号