es自定义分词,拼音分词、手机号分词
一、需求描述
本文针对在工作中遇到的需求:通过es来实现模糊查询来进行总结;模糊查询的具体需求是:查询基金/A股/港股等金融数据,要求可以根据字段,拼音首字母,部分拼音全称进行联想查询;需要注意的是,金融数据名称中可能不止包含汉字,还有英文,数字,特殊字符等。
二、方案设计
常用的es模糊查询出于性能问题,官方建议是慎重使用的,但一般针对于与其他es查询相比,如果和其他搜索工具相比,es的模糊查询性能还是不错的;常见的模糊查询相关函数,例如wildcard,fuzzy,query_string等均不完全适配现有的业务需求,因此从另一个角度思考问题,拟采用更加灵活的分词器来解决多条件模糊查询问题。
ngram 分词器与传统的 standard 分词器或者是 ik 分词器相比,他的优点是可以分词出特殊字符,因此,在对字段查询时,可以采用 ngram 分词器;而对拼音全称以及首字母查询时,可以使用 keyword 与 拼音 结合的自定义分词。
三、自定义分词基础知识
一个analyzer即分析器,无论是内置的还是自定义的,只是一个包含character filters(字符过滤器)、 tokenizers(分词器)、token filters(令牌过滤器)三个细分模块的包。
看下这三个细分模块包的作用:
character filters(字符过滤器):分词之前的预处理,过滤无用字符
token filters(令牌过滤器):停用词、时态转换,大小写转换、同义词转换、语气词处理等。
tokenizers(分词器):切词
字符过滤器(character filters)
Character filters are used to preprocess the stream of characters before it is passed to the tokenizer.
A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert Hindu-Arabic numerals (٠١٢٣٤٥٦٧٨٩) into their Arabic-Latin equivalents (0123456789), or to strip HTML elements like
<b>from the stream.
在进行Tokenizer之前对原始文本进行处理,如增加、删除或替换字符等
| 字符过滤器 | 作用 |
|---|---|
| HTML Strip | 去除html标签和转换html实体 |
| Mapping | 字符串替换操作 |
| Pattern Replace | 正则匹配替换 |
(1)、Html strip 官方文档
过滤html标签,主要参数escaped_tags保留哪些html标签,示例代码如下:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{
//指定分词器
"tokenizer":"keyword",
//指定分析器的字符串过滤器
"char_filter":"custom_char_filter"
}
},
//字符过滤器
"char_filter": {
"custom_char_filter":{
//字符过滤器的类型
"type":"html_strip",
//跳过过滤的html标签
"escaped_tags": [
"a"
]
}
}
}
}
}
测试过滤器代码:
GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"]
}
执行结果如下:
{
"tokens" : [
{
"token" : """this is address of baidu<a>baidu</a>
baidu content
""",
"start_offset" : 0,
"end_offset" : 56,
"type" : "word",
"position" : 0
}
]
}
从结果中可以看出过滤了除a标签之外的所有html标签.
(2)、Mapping 官方文档
常用于敏感词过滤,示例代码如下:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{
"tokenizer":"keyword",
"char_filter":["custom_char_filter","custom_mapping_filter"]
}
},
"char_filter": {
"custom_char_filter":{
"type":"html_strip",
"escaped_tags": [
"a"
]
},
"custom_mapping_filter":{
"type": "mapping",
//当内容出现baidu或者is 全都用**替换
"mappings": [
"baidu=>**",
"is=>**"
]
}
}
}
}
}
执行搜索代码如下:
GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"]
}
执行结果如下:
{
"tokens" : [
{
"token" : """th** ** address of **<a>**</a>
** content
""",
"start_offset" : 0,
"end_offset" : 56,
"type" : "word",
"position" : 0
}
]
}
(3)、Pattern Replace 官方文档
主要用于一些结构化的内容(可以用正则表达式检索到的)的替换,示例代码如下:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{
"tokenizer":"keyword",
"char_filter":["custom_char_filter","custom_mapping_filter","custom_pattern_replace_filter"]
}
},
"char_filter": {
"custom_char_filter":{
"type":"html_strip",
"escaped_tags": [
"a"
]
},
"custom_mapping_filter":{
"type": "mapping",
"mappings": [
"baidu=>**",
"is=>**"
]
},
"custom_pattern_replace_filter":{
"type":"pattern_replace",
"pattern": "(\\d{3})\\d{4}(\\d{4})",
"replacement": "$1****$2"
}
}
}
}
}
在(1)、(2)的基础上增加了custom_pattern_replace_filter用于正则替换内容,主要作用是手机号脱敏
检索代码如下:
GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text": ["this is address of baidu<a>baidu</a><p>baidu content</p>telphone:13311112222"]
}
执行结果如下:
{
"tokens" : [
{
"token" : """th** ** address of **<a>**</a>
** content
telphone:133****2222""",
"start_offset" : 0,
"end_offset" : 76,
"type" : "word",
"position" : 0
}
]
}
手机号13311112222被替换成了133****2222
2.分词器(tokenizer)
A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a
whitespacetokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text"Quick brown fox!"into the terms[Quick, brown, fox!].
Word Oriented Tokenizers
The following tokenizers are usually used for tokenizing full text into individual words:
-
The
standardtokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages. -
The
lettertokenizer divides text into terms whenever it encounters a character which is not a letter. -
The
lowercasetokenizer, like thelettertokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms. -
The
whitespacetokenizer divides text into terms whenever it encounters any whitespace character. -
The
uax_url_emailtokenizer is like thestandardtokenizer except that it recognises URLs and email addresses as single tokens. -
The
classictokenizer is a grammar based tokenizer for the English Language. -
The
thaitokenizer segments Thai text into words.
Partial Word Tokenizers
These tokenizers break up text or words into small fragments, for partial word matching:
-
The
ngramtokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g.quick→[qu, ui, ic, ck]. -
The
edge_ngramtokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g.quick→[q, qu, qui, quic, quick].
Structured Text Tokenizers
The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text:
-
The
keywordtokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters likelowercaseto normalise the analysed terms. -
The
patterntokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms. -
The
simple_patterntokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than thepatterntokenizer. -
The
char_grouptokenizer is configurable through sets of characters to split on, which is usually less expensive than running regular expressions. -
Simple Pattern Split Tokenizer
The
simple_pattern_splittokenizer uses the same restricted regular expression subset as thesimple_patterntokenizer, but splits the input at matches rather than returning the matches as terms. -
The
path_hierarchytokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g./foo/bar/baz→[/foo, /foo/bar, /foo/bar/baz ].
3.令牌过滤器(token filters)
Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms).
对输出的单词(term)进行增加、删除、修改等操作
| 令牌过滤器 | 作用 |
|---|---|
| Lowercase | 将所有term转换为小写 |
| stop | 删除stop words |
| NGram | 和Edge NGram连词分割 |
| Synonym | 添加近义词的term |
令牌过滤器包含的内容过多,参考官方文档,这里分析几种常用的令牌过滤器
(1)、停用词stop 官方文档
在设置中指定的停用词,将不会创建倒排索引.
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{
"tokenizer":"ik_max_word",
"filter":["custom_stop_filter"]
}
},
"filter": {
"custom_stop_filter":{
"type": "stop",
"ignore_case": true,
"stopwords": [ "and", "is","friend" ]
}
}
}
}
}
执行以上代码,并执行以下搜索语句
GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text":"You and me IS friend"
}
执行结果如下:
{
"tokens" : [
{
"token" : "you",
"start_offset" : 0,
"end_offset" : 3,
"type" : "ENGLISH",
"position" : 0
},
{
"token" : "me",
"start_offset" : 8,
"end_offset" : 10,
"type" : "ENGLISH",
"position" : 1
}
]
}
注:也可以指定停用词文件路径,和ik分词器类似.具体参考官方文档.
4.自定义分词器
PUT test_index
{
"settings": {
"analysis": {
"char_filter": {
"custom_char_filter":{
"type":"mapping",
"mappings":[
"&=>and",
"|=>or",
"!=>not"
]
},
"custom_html_strip_filter":{
"type":"html_strip",
"escaped_tags":["a"]
},
"custom_pattern_replace_filter":{
"type":"pattern_replace",
"pattern": "(\\d{3})\\d{4}(\\d{4})",
"replacement": "$1****$2"
}
},
"filter": {
"custom_stop_filter":{
"type": "stop",
"ignore_case": true,
"stopwords": [ "and", "is","friend" ]
}
},
"tokenizer": {
"custom_tokenizer":{
"type":"pattern",
"pattern":"[ ,!.?]"
}
},
"analyzer": {
"custom_analyzer":{
"type":"custom",
"tokenizer":"custom_tokenizer",
"char_filter":["custom_char_filter","custom_html_strip_filter","custom_pattern_replace_filter"],
"filter":["custom_stop_filter"]
}
}
}
}
}
当前自定义分析器用了字符串过滤器(三种形式),和令牌过滤器(这里只用了停用词).
关于过滤器相关参考ES 字符过滤器&令牌过滤器,关于分词器相关参考ES 分词器(示例使用了pattern分词器,参考文档)
执行以上代码后,执行搜索如下搜索代码:
GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text":"&.|,!?13366666666.You and me is Friend <p>超链接</p>"
}
执行结果如下:
{
"tokens" : [
{
"token" : "or",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "not",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 2
},
{
"token" : "133****6666",
"start_offset" : 6,
"end_offset" : 17,
"type" : "word",
"position" : 3
},
{
"token" : "You",
"start_offset" : 18,
"end_offset" : 21,
"type" : "word",
"position" : 4
},
{
"token" : "me",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 6
},
{
"token" : """
超链接
""",
"start_offset" : 39,
"end_offset" : 49,
"type" : "word",
"position" : 9
}
]
}
四、默认分词器
分词器是es中的一个组件,通俗意义上理解,就是将一段文本按照一定的逻辑,分析成多个词语,同时对这些词语进行常规化的一种工具;ES会将text格式的字段按照分词器进行分词,并编排成倒排索引,正是因为如此,es的查询才如此之快;
es本身就内置有多种分词器,他们的特性与作用梳理如下:
| 分词器 | 作用 |
|---|---|
| Standard | ES默认分词器,按单词分类并进行小写处理 |
| Simple | 按照非字母切分,然后去除非字母并进行小写处理 |
| Stop | 按照停用词过滤并进行小写处理,停用词包括the、a、is |
| Whitespace | 按照空格切分 |
| Language | 据说提供了30多种常见语言的分词器 |
| Pattern | 按照正则表达式进行分词,默认是\W+ ,代表非字母 |
| Keyword | 不进行分词,作为一个整体输出 |
这些分词器用于处理单词和字母,那功能基本已经覆盖,可以说是相当全面了!但对于中文而言,不同汉字组合成词语,往往多个字符组合在一起表达一种意思,显然,上述分词器无法满足需求;对应于中文,目前也有许多对应分词器,例如:IK,jieba,THULAC等,使用最多的即是IK分词器。
除了中文文字以外,我们也经常会使用拼音,例如各类输入法,百度的搜索框等都支持拼音的联想搜索,那么假如将数据存入到es中,如何通过拼音搜索我们想要的数据呢,这个时候对应的拼音分词器可以有效帮助到我们,它的开发者也正是ik分词器的创始人。
不同分词器的效果对比
- standard分词器 —— ES默认分词器,对于中文会按每个字分开处理,会忽略特殊字符
{
"tokens": [
{
"token": "白",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "兔",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "万",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "岁",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "a",
"start_offset": 4,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 4
}
]
}
- ik 分词器 —— 适用于根据词语查询整个内容信息,同样忽略其他特殊字符以及英文字符
{
"tokens": [
{
"token": "白兔",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "万岁",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "万",
"start_offset": 2,
"end_offset": 3,
"type": "TYPE_CNUM",
"position": 2
},
{
"token": "岁",
"start_offset": 3,
"end_offset": 4,
"type": "COUNT",
"position": 3
}
]
}
- pinyin 分词器 —— 适用于通过拼音查询到对应字段信息,同时忽略特殊字符
{
"tokens": [
{
"token": "bai",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "btwsa",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "tu",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "wan",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
},
{
"token": "sui",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 3
},
{
"token": "a",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 4
}
]
}
中文分词器
IK分词器有两种分词模式:ik_max_word和ik_smart模式
- ik_max_word模式会对文本进行最细粒度的拆分, 比如将华为手机分为华为、手、手机
- ik_smart模式是粗粒度的,将华为手机分为华为、手机。
# 默认标准分词器
GET _analyze
{
"analyzer": "standard",
"text": ["我爱北京天安门!","it is so beautiful?"]
}
# ik分词器,粗粒度
GET _analyze
{
"analyzer": "ik_smart",
"text": ["我爱北京天安门!","it is so beautiful?"]
}
# ik分词器,细粒度
GET _analyze
{
"analyzer": "ik_max_word",
"text": ["我爱北京天安门!","it is so beautiful?"]
}
NGram 分词器
edge_ngram和ngram是ElasticSearch自带的两个分词器,一般设置索引映射的时候都会用到,设置完步长之后,就可以直接给解析器analyzer的tokenizer赋值使用。
需要注意的是es7以后的版本min_gram和max_gram的粒度默认是不大于1,也就是说分词是一个字符一个字符逐个分的。如果粒度需要大于1需要设置一下index.max_ngram_diff大于等于它们的差值,否则会报错。
分词粒度的效果,例:搜索我是中国人
分词粒度为默认1,以ngram分词器分词,则分词效果为
我 是 中 国 人
分词粒度为默认3,以ngram分词器分词,则分词效果为
我 我是 我是中 是 是中 是中国 中 中国 中国人 国 国人 人
主要区别在于edge_ngram会按照首字符逐字匹配,ngram是全字符逐个匹配,比如分词粒度都是3的两个分词器,搜索我是中国人:
edge_ngram分词
我 我是 我是中 (edge_ngram分词必须以首字 ”我“ 开头逐个按步长,逐字符分词)
ngram分词
我 我是 我是中 是 是中 是中国 中 中国 中国人 国 国人 人(ngram分词逐字开始按步长,逐字符分词)
五、分词结果查看
查看具体数据的分词。查看user_addresses索引id为55655083数据的address_name字段的分词结果
GET user_addresses/_doc/55655083/_termvectors?fields=address_name
查看user_addresses索引address_name字段值是『山东省青岛市黄岛区』的分词结果(利用该字段的分词器进行测试)
GET user_addresses/_analyze
{
"field": "address_name",
"text": "山东省青岛市黄岛区"
}
查看指定分词器的分词结果(GET或者PUT都行)
GET _analyze
{
"analyzer": "english",
"text": "Eating an apple a day keeps docker away"
}
查看指定分词器及filter(GET或者POST都行)
GET _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text":"Hello WORLD"
}
# 结果
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
六、分词器指定
写时分词器需要在 mapping 中指定,而且一经指定就不能再修改,若要修改必须新建索引。
读时分词器默认与写时分词器默认保持一致,读写采用一致的分词器,才能尽最大可能保证分词的结果是可以匹配的。
七、自定义拼音方案
settings:
{
"analysis":{
"analyzer":{
"my_ngram_analyzer":{
"tokenizer":"my_ngram_tokenizer"
},
"my_pinyin_analyzer":{
"tokenizer":"keyword",
"filter":"py"
}
},
"tokenizer":{
"my_ngram_tokenizer":{
"type":"ngram",
"min_ngram":1,
"max_ngram":1
}
},
"filter":{
"py":{
"type":"pinyin",
"first_letter":"prefix",
# 设置为true的话每个字的首字母都会进行分词
"keep_separate_first_letter":true,
"keep_full_pinyin":true,
"keep_joined_full_pinyin":true,
"keep_original":true,
# 字符长度
"limit_first_letter_length":16,
"lowercase":true,
"remove_duplicated_term":true
}
}
}
}
mapping:
{
"properties":{
"name":{
"type":"text",
"analyzer":"my_ngram_analyzer"
},
"fields":{
"PY":{
"type":"text",
"analyzer":"my_pinyin_analyzer",
"term_vector":"with_positions_offsets",
"boost":10.0
}
}
}
}
以text = "恒生电子"为例,它的自定义拼音分词器 my_pinyin_analyzer 效果如下:
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "heng",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "恒生电子",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "hengshengdianzi",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "hsdz",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "s",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "sheng",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "d",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "dian",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "z",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "zi",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 3
}
]
}
而在对应的代码层面,出于对输入词的关联精确性与词语顺序的考虑,从match , match phrase 以及 match phrase prefix中选择match phrase来进行查询,将字段查询与拼音首字母查询隔离,即通过中文查询则只查询name字段,通过非中文查询则只查询name.PY,Java代码修改如下:
if (!imageStr.matches("(.*)[\u4e00-\u9fa5](.*)")) {
BoolQueryBuilder boolQueryBuilderKeyWord = QueryBuilders.boolQuery().matchPhraseQuery("name.PY", imageStr);
} else {
BoolQueryBuilder boolQueryBuilderKeyWord = QueryBuilders.boolQuery().matchPhraseQuery("name", imageStr);
}
八、自定义数字方案
PUT test_index
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 2,
"index.max_ngram_diff": 11,
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer"
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "1",
"max_gram": "11"
}
}
}
},
"mappings": {
"properties": {
"mobile": {
"type": "text",
"fields": {
"split": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
执行语句
GET test_index/_analyze
{
"field": "mobile.split",
"text": "15154227089"
}
执行结果
{
"tokens" : [
{
"token" : "1",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "15",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "151",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "1515",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 3
},
{
"token" : "15154",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 4
},
{
"token" : "151542",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 5
},
{
"token" : "1515422",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 6
},
{
"token" : "15154227",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 7
},
{
"token" : "151542270",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 8
},
{
"token" : "1515422708",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 9
},
{
"token" : "15154227089",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 10
},
{
"token" : "5",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 11
},
{
"token" : "51",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 12
},
{
"token" : "515",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 13
},
{
"token" : "5154",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 14
},
{
"token" : "51542",
"start_offset" : 1,
"end_offset" : 6,
"type" : "word",
"position" : 15
},
{
"token" : "515422",
"start_offset" : 1,
"end_offset" : 7,
"type" : "word",
"position" : 16
},
{
"token" : "5154227",
"start_offset" : 1,
"end_offset" : 8,
"type" : "word",
"position" : 17
},
{
"token" : "51542270",
"start_offset" : 1,
"end_offset" : 9,
"type" : "word",
"position" : 18
},
{
"token" : "515422708",
"start_offset" : 1,
"end_offset" : 10,
"type" : "word",
"position" : 19
},
{
"token" : "5154227089",
"start_offset" : 1,
"end_offset" : 11,
"type" : "word",
"position" : 20
},
{
"token" : "1",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 21
},
{
"token" : "15",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 22
},
{
"token" : "154",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 23
},
{
"token" : "1542",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 24
},
{
"token" : "15422",
"start_offset" : 2,
"end_offset" : 7,
"type" : "word",
"position" : 25
},
{
"token" : "154227",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 26
},
{
"token" : "1542270",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 27
},
{
"token" : "15422708",
"start_offset" : 2,
"end_offset" : 10,
"type" : "word",
"position" : 28
},
{
"token" : "154227089",
"start_offset" : 2,
"end_offset" : 11,
"type" : "word",
"position" : 29
},
{
"token" : "5",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 30
},
{
"token" : "54",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 31
},
{
"token" : "542",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 32
},
{
"token" : "5422",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 33
},
{
"token" : "54227",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 34
},
{
"token" : "542270",
"start_offset" : 3,
"end_offset" : 9,
"type" : "word",
"position" : 35
},
{
"token" : "5422708",
"start_offset" : 3,
"end_offset" : 10,
"type" : "word",
"position" : 36
},
{
"token" : "54227089",
"start_offset" : 3,
"end_offset" : 11,
"type" : "word",
"position" : 37
},
{
"token" : "4",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 38
},
{
"token" : "42",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 39
},
{
"token" : "422",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 40
},
{
"token" : "4227",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 41
},
{
"token" : "42270",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 42
},
{
"token" : "422708",
"start_offset" : 4,
"end_offset" : 10,
"type" : "word",
"position" : 43
},
{
"token" : "4227089",
"start_offset" : 4,
"end_offset" : 11,
"type" : "word",
"position" : 44
},
{
"token" : "2",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 45
},
{
"token" : "22",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 46
},
{
"token" : "227",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 47
},
{
"token" : "2270",
"start_offset" : 5,
"end_offset" : 9,
"type" : "word",
"position" : 48
},
{
"token" : "22708",
"start_offset" : 5,
"end_offset" : 10,
"type" : "word",
"position" : 49
},
{
"token" : "227089",
"start_offset" : 5,
"end_offset" : 11,
"type" : "word",
"position" : 50
},
{
"token" : "2",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 51
},
{
"token" : "27",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 52
},
{
"token" : "270",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 53
},
{
"token" : "2708",
"start_offset" : 6,
"end_offset" : 10,
"type" : "word",
"position" : 54
},
{
"token" : "27089",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 55
},
{
"token" : "7",
"start_offset" : 7,
"end_offset" : 8,
"type" : "word",
"position" : 56
},
{
"token" : "70",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 57
},
{
"token" : "708",
"start_offset" : 7,
"end_offset" : 10,
"type" : "word",
"position" : 58
},
{
"token" : "7089",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 59
},
{
"token" : "0",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 60
},
{
"token" : "08",
"start_offset" : 8,
"end_offset" : 10,
"type" : "word",
"position" : 61
},
{
"token" : "089",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 62
},
{
"token" : "8",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 63
},
{
"token" : "89",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 64
},
{
"token" : "9",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 65
}
]
}
https://blog.csdn.net/qq_37426635/article/details/124745970
https://blog.csdn.net/qq_37426635/article/details/124725601
https://blog.csdn.net/dl674756321/article/details/119979708
https://blog.csdn.net/qq_38949494/article/details/121120864
https://www.shuzhiduo.com/A/1O5EO4v4z7/
https://www.cnblogs.com/GreenLeaves/p/16562620.html

浙公网安备 33010602011771号