es自定义分词,拼音分词、手机号分词

一、需求描述

本文针对在工作中遇到的需求:通过es来实现模糊查询来进行总结;模糊查询的具体需求是:查询基金/A股/港股等金融数据,要求可以根据字段拼音首字母部分拼音全称进行联想查询;需要注意的是,金融数据名称中可能不止包含汉字,还有英文,数字,特殊字符等。

二、方案设计

常用的es模糊查询出于性能问题,官方建议是慎重使用的,但一般针对于与其他es查询相比,如果和其他搜索工具相比,es的模糊查询性能还是不错的;常见的模糊查询相关函数,例如wildcard,fuzzy,query_string等均不完全适配现有的业务需求,因此从另一个角度思考问题,拟采用更加灵活的分词器来解决多条件模糊查询问题。

ngram 分词器与传统的 standard 分词器或者是 ik 分词器相比,他的优点是可以分词出特殊字符,因此,在对字段查询时,可以采用 ngram 分词器;而对拼音全称以及首字母查询时,可以使用 keyword 与 拼音 结合的自定义分词。

三、自定义分词基础知识

一个analyzer即分析器,无论是内置的还是自定义的,只是一个包含character filters(字符过滤器)、 tokenizers(分词器)、token filters(令牌过滤器)三个细分模块的包。

看下这三个细分模块包的作用:

character filters(字符过滤器):分词之前的预处理,过滤无用字符

token filters(令牌过滤器):停用词、时态转换,大小写转换、同义词转换、语气词处理等。

tokenizers(分词器):切词

字符过滤器(character filters)

Character filters are used to preprocess the stream of characters before it is passed to the tokenizer.

A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert Hindu-Arabic numerals (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML elements like <b> from the stream.

在进行Tokenizer之前对原始文本进行处理,如增加、删除或替换字符等

字符过滤器 作用
HTML Strip 去除html标签和转换html实体
Mapping 字符串替换操作
Pattern Replace 正则匹配替换

(1)、Html strip 官方文档

过滤html标签,主要参数escaped_tags保留哪些html标签,示例代码如下:

PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{
          //指定分词器
          "tokenizer":"keyword",
          //指定分析器的字符串过滤器
          "char_filter":"custom_char_filter"
        }
      },
      //字符过滤器
      "char_filter": {
        "custom_char_filter":{
          //字符过滤器的类型
          "type":"html_strip",
          //跳过过滤的html标签
          "escaped_tags": [
            "a"
          ]
        }
      }
    }
  }
}

测试过滤器代码:

GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"]
}

执行结果如下:

{
  "tokens" : [
    {
      "token" : """this is address of baidu<a>baidu</a>
baidu content
""",
      "start_offset" : 0,
      "end_offset" : 56,
      "type" : "word",
      "position" : 0
    }
  ]
}

从结果中可以看出过滤了除a标签之外的所有html标签.

(2)、Mapping 官方文档

常用于敏感词过滤,示例代码如下:

PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["custom_char_filter","custom_mapping_filter"]
        }
      },
      "char_filter": {
        "custom_char_filter":{
          "type":"html_strip",
          "escaped_tags": [
            "a"
          ]
        },
        "custom_mapping_filter":{
          "type": "mapping",
          //当内容出现baidu或者is 全都用**替换
          "mappings": [
            "baidu=>**",
            "is=>**"
          ]
        }
      }
    }
  }
}

执行搜索代码如下:

GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"]
}

执行结果如下:

{
  "tokens" : [
    {
      "token" : """th** ** address of **<a>**</a>
** content
""",
      "start_offset" : 0,
      "end_offset" : 56,
      "type" : "word",
      "position" : 0
    }
  ]
}

(3)、Pattern Replace 官方文档

主要用于一些结构化的内容(可以用正则表达式检索到的)的替换,示例代码如下:

PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["custom_char_filter","custom_mapping_filter","custom_pattern_replace_filter"]
        }
      },
      "char_filter": {
        "custom_char_filter":{
          "type":"html_strip",
          "escaped_tags": [
            "a"
          ]
        },
        "custom_mapping_filter":{
          "type": "mapping",
          "mappings": [
            "baidu=>**",
            "is=>**"
          ]
        },
        "custom_pattern_replace_filter":{
          "type":"pattern_replace",
          "pattern": "(\\d{3})\\d{4}(\\d{4})",
          "replacement": "$1****$2"
        }
      }
    }
  }
}

在(1)、(2)的基础上增加了custom_pattern_replace_filter用于正则替换内容,主要作用是手机号脱敏

检索代码如下:

GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text": ["this is address of baidu<a>baidu</a><p>baidu content</p>telphone:13311112222"]
}

执行结果如下:

{
  "tokens" : [
    {
      "token" : """th** ** address of **<a>**</a>
** content
telphone:133****2222""",
      "start_offset" : 0,
      "end_offset" : 76,
      "type" : "word",
      "position" : 0
    }
  ]
}

手机号13311112222被替换成了133****2222

2.分词器(tokenizer)

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into the terms [Quick, brown, fox!].

Word Oriented Tokenizers

The following tokenizers are usually used for tokenizing full text into individual words:

  • Standard Tokenizer

    The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.

  • Letter Tokenizer

    The letter tokenizer divides text into terms whenever it encounters a character which is not a letter.

  • Lowercase Tokenizer

    The lowercase tokenizer, like the letter tokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.

  • Whitespace Tokenizer

    The whitespace tokenizer divides text into terms whenever it encounters any whitespace character.

  • UAX URL Email Tokenizer

    The uax_url_email tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens.

  • Classic Tokenizer

    The classic tokenizer is a grammar based tokenizer for the English Language.

  • Thai Tokenizer

    The thai tokenizer segments Thai text into words.

Partial Word Tokenizers

These tokenizers break up text or words into small fragments, for partial word matching:

  • N-Gram Tokenizer

    The ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g. quick[qu, ui, ic, ck].

  • Edge N-Gram Tokenizer

    The edge_ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g. quick[q, qu, qui, quic, quick].

Structured Text Tokenizers

The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text:

  • Keyword Tokenizer

    The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters like lowercase to normalise the analysed terms.

  • Pattern Tokenizer

    The pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.

  • Simple Pattern Tokenizer

    The simple_pattern tokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than the pattern tokenizer.

  • Char Group Tokenizer

    The char_group tokenizer is configurable through sets of characters to split on, which is usually less expensive than running regular expressions.

  • Simple Pattern Split Tokenizer

    The simple_pattern_split tokenizer uses the same restricted regular expression subset as the simple_pattern tokenizer, but splits the input at matches rather than returning the matches as terms.

  • Path Tokenizer

    The path_hierarchy tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g. /foo/bar/baz[/foo, /foo/bar, /foo/bar/baz ].

3.令牌过滤器(token filters)

Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms).

对输出的单词(term)进行增加、删除、修改等操作

令牌过滤器 作用
Lowercase 将所有term转换为小写
stop 删除stop words
NGram 和Edge NGram连词分割
Synonym 添加近义词的term

令牌过滤器包含的内容过多,参考官方文档,这里分析几种常用的令牌过滤器

(1)、停用词stop 官方文档

在设置中指定的停用词,将不会创建倒排索引.

PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{
          "tokenizer":"ik_max_word",
          "filter":["custom_stop_filter"]
        }
      },
      "filter": {
        "custom_stop_filter":{
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "and", "is","friend" ]
        }
      }
    }
  }
}

执行以上代码,并执行以下搜索语句

GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text":"You and me IS friend"
}

执行结果如下:

{
  "tokens" : [
    {
      "token" : "you",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "me",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 1
    }
  ]
}

注:也可以指定停用词文件路径,和ik分词器类似.具体参考官方文档.

4.自定义分词器

PUT test_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "custom_char_filter":{
          "type":"mapping",
          "mappings":[
            "&=>and",
            "|=>or",
            "!=>not"
            ]
        },
        "custom_html_strip_filter":{
          "type":"html_strip",
          "escaped_tags":["a"]
        },
        "custom_pattern_replace_filter":{
          "type":"pattern_replace",
          "pattern": "(\\d{3})\\d{4}(\\d{4})",
          "replacement": "$1****$2"
        }
      },
      "filter": {
        "custom_stop_filter":{
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "and", "is","friend" ]
        }
      },
      "tokenizer": {
        "custom_tokenizer":{
          "type":"pattern",
          "pattern":"[ ,!.?]"
        }
      }, 
      "analyzer": {
        "custom_analyzer":{
          "type":"custom",
          "tokenizer":"custom_tokenizer",
          "char_filter":["custom_char_filter","custom_html_strip_filter","custom_pattern_replace_filter"],
          "filter":["custom_stop_filter"]
        }
      }
    }
  }
}

当前自定义分析器用了字符串过滤器(三种形式),和令牌过滤器(这里只用了停用词).

关于过滤器相关参考ES 字符过滤器&令牌过滤器,关于分词器相关参考ES 分词器(示例使用了pattern分词器,参考文档)

执行以上代码后,执行搜索如下搜索代码:

GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text":"&.|,!?13366666666.You and me is Friend <p>超链接</p>"
}

执行结果如下:

{
  "tokens" : [
    {
      "token" : "or",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "not",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "133****6666",
      "start_offset" : 6,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "You",
      "start_offset" : 18,
      "end_offset" : 21,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "me",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : """
超链接
""",
      "start_offset" : 39,
      "end_offset" : 49,
      "type" : "word",
      "position" : 9
    }
  ]
}

四、默认分词器

分词器是es中的一个组件,通俗意义上理解,就是将一段文本按照一定的逻辑,分析成多个词语,同时对这些词语进行常规化的一种工具;ES会将text格式的字段按照分词器进行分词,并编排成倒排索引,正是因为如此,es的查询才如此之快;

es本身就内置有多种分词器,他们的特性与作用梳理如下:

分词器 作用
Standard ES默认分词器,按单词分类并进行小写处理
Simple 按照非字母切分,然后去除非字母并进行小写处理
Stop 按照停用词过滤并进行小写处理,停用词包括the、a、is
Whitespace 按照空格切分
Language 据说提供了30多种常见语言的分词器
Pattern 按照正则表达式进行分词,默认是\W+ ,代表非字母
Keyword 不进行分词,作为一个整体输出

这些分词器用于处理单词和字母,那功能基本已经覆盖,可以说是相当全面了!但对于中文而言,不同汉字组合成词语,往往多个字符组合在一起表达一种意思,显然,上述分词器无法满足需求;对应于中文,目前也有许多对应分词器,例如:IK,jieba,THULAC等,使用最多的即是IK分词器。
   除了中文文字以外,我们也经常会使用拼音,例如各类输入法,百度的搜索框等都支持拼音的联想搜索,那么假如将数据存入到es中,如何通过拼音搜索我们想要的数据呢,这个时候对应的拼音分词器可以有效帮助到我们,它的开发者也正是ik分词器的创始人。

不同分词器的效果对比

  • standard分词器 —— ES默认分词器,对于中文会按每个字分开处理,会忽略特殊字符
{
    "tokens": [
        {
            "token": "白",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "兔",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "万",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "岁",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "a",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 4
        }
    ]
}
  • ik 分词器 —— 适用于根据词语查询整个内容信息,同样忽略其他特殊字符以及英文字符
{
    "tokens": [
        {
            "token": "白兔",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "万岁",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "万",
            "start_offset": 2,
            "end_offset": 3,
            "type": "TYPE_CNUM",
            "position": 2
        },
        {
            "token": "岁",
            "start_offset": 3,
            "end_offset": 4,
            "type": "COUNT",
            "position": 3
        }
    ]
}
  • pinyin 分词器 —— 适用于通过拼音查询到对应字段信息,同时忽略特殊字符
{
    "tokens": [
        {
            "token": "bai",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
            "token": "btwsa",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
            "token": "tu",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 1
        },
        {
            "token": "wan",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 2
        },
        {
            "token": "sui",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 3
        },
        {
            "token": "a",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 4
        }
    ]
}

中文分词器

IK分词器有两种分词模式:ik_max_word和ik_smart模式

  • ik_max_word模式会对文本进行最细粒度的拆分, 比如将华为手机分为华为、手、手机
  • ik_smart模式是粗粒度的,将华为手机分为华为、手机。
# 默认标准分词器
GET _analyze
{
  "analyzer": "standard",
  "text": ["我爱北京天安门!","it is so beautiful?"]
}

# ik分词器,粗粒度
GET _analyze
{
  "analyzer": "ik_smart",
  "text": ["我爱北京天安门!","it is so beautiful?"]
}

# ik分词器,细粒度
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": ["我爱北京天安门!","it is so beautiful?"]
}

NGram 分词器

edge_ngram和ngram是ElasticSearch自带的两个分词器,一般设置索引映射的时候都会用到,设置完步长之后,就可以直接给解析器analyzer的tokenizer赋值使用。

需要注意的是es7以后的版本min_gram和max_gram的粒度默认是不大于1,也就是说分词是一个字符一个字符逐个分的。如果粒度需要大于1需要设置一下index.max_ngram_diff大于等于它们的差值,否则会报错。

分词粒度的效果,例:搜索我是中国人
分词粒度为默认1,以ngram分词器分词,则分词效果为
我 是 中 国 人
分词粒度为默认3,以ngram分词器分词,则分词效果为
我 我是 我是中 是 是中 是中国 中 中国 中国人 国 国人 人

主要区别在于edge_ngram会按照首字符逐字匹配,ngram是全字符逐个匹配,比如分词粒度都是3的两个分词器,搜索我是中国人:
edge_ngram分词
我 我是 我是中 (edge_ngram分词必须以首字 ”我“ 开头逐个按步长,逐字符分词)
ngram分词
我 我是 我是中 是 是中 是中国 中 中国 中国人 国 国人 人(ngram分词逐字开始按步长,逐字符分词)

五、分词结果查看

查看具体数据的分词。查看user_addresses索引id为55655083数据的address_name字段的分词结果

GET user_addresses/_doc/55655083/_termvectors?fields=address_name

查看user_addresses索引address_name字段值是『山东省青岛市黄岛区』的分词结果(利用该字段的分词器进行测试)

GET user_addresses/_analyze
{
  "field": "address_name",
  "text": "山东省青岛市黄岛区"
}

查看指定分词器的分词结果(GET或者PUT都行)

GET _analyze
{
  "analyzer": "english",
  "text": "Eating an apple a day keeps docker away"
}

查看指定分词器及filter(GET或者POST都行)

GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text":"Hello WORLD"
}
# 结果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

六、分词器指定

写时分词器需要在 mapping 中指定,而且一经指定就不能再修改,若要修改必须新建索引。

读时分词器默认与写时分词器默认保持一致,读写采用一致的分词器,才能尽最大可能保证分词的结果是可以匹配的。

七、自定义拼音方案

settings:
{
    "analysis":{
        "analyzer":{
            "my_ngram_analyzer":{
                "tokenizer":"my_ngram_tokenizer"
            },
            "my_pinyin_analyzer":{
                "tokenizer":"keyword",
                "filter":"py"
            }
        },
        "tokenizer":{
            "my_ngram_tokenizer":{
                "type":"ngram",
                "min_ngram":1,
                "max_ngram":1
            }
        },
        "filter":{
            "py":{
                "type":"pinyin",
                "first_letter":"prefix",
              	# 设置为true的话每个字的首字母都会进行分词
                "keep_separate_first_letter":true,
                "keep_full_pinyin":true,
                "keep_joined_full_pinyin":true,
                "keep_original":true,
              	# 字符长度
                "limit_first_letter_length":16,
                "lowercase":true,
                "remove_duplicated_term":true
            }
        }
    }
}

mapping:
{
    "properties":{
        "name":{
            "type":"text",
            "analyzer":"my_ngram_analyzer"
        },
        "fields":{
            "PY":{
                "type":"text",
                "analyzer":"my_pinyin_analyzer",
                "term_vector":"with_positions_offsets",
                "boost":10.0
            }
        }
    }
}

以text = "恒生电子"为例,它的自定义拼音分词器 my_pinyin_analyzer 效果如下:

{
    "tokens": [
        {
            "token": "h",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "heng",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "恒生电子",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "hengshengdianzi",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "hsdz",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "s",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "sheng",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "d",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 2
        },
        {
            "token": "dian",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 2
        },
        {
            "token": "z",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 3
        },
        {
            "token": "zi",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 3
        }
    ]
}

而在对应的代码层面,出于对输入词的关联精确性词语顺序的考虑,从match , match phrase 以及 match phrase prefix中选择match phrase来进行查询,将字段查询与拼音首字母查询隔离,即通过中文查询则只查询name字段,通过非中文查询则只查询name.PY,Java代码修改如下:

if (!imageStr.matches("(.*)[\u4e00-\u9fa5](.*)")) {
    BoolQueryBuilder boolQueryBuilderKeyWord = QueryBuilders.boolQuery().matchPhraseQuery("name.PY", imageStr);
} else {
    BoolQueryBuilder boolQueryBuilderKeyWord = QueryBuilders.boolQuery().matchPhraseQuery("name", imageStr);
}

八、自定义数字方案

PUT test_index
{
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 2,
        "index.max_ngram_diff": 11,
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "tokenizer": "ngram_tokenizer"
                }
            },
            "tokenizer": {
                "ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": "1",
                    "max_gram": "11"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "mobile": {
                "type": "text",
                "fields": {
                    "split": {
                        "type": "text",
                        "analyzer": "ngram_analyzer"
                    }
                }
            }
        }
    }
}

执行语句

GET test_index/_analyze
{
  "field": "mobile.split",
  "text": "15154227089"
}

执行结果

{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "15",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "151",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "1515",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "15154",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "151542",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "1515422",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "15154227",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "151542270",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "1515422708",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "15154227089",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "5",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "51",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "515",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "5154",
      "start_offset" : 1,
      "end_offset" : 5,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : "51542",
      "start_offset" : 1,
      "end_offset" : 6,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "515422",
      "start_offset" : 1,
      "end_offset" : 7,
      "type" : "word",
      "position" : 16
    },
    {
      "token" : "5154227",
      "start_offset" : 1,
      "end_offset" : 8,
      "type" : "word",
      "position" : 17
    },
    {
      "token" : "51542270",
      "start_offset" : 1,
      "end_offset" : 9,
      "type" : "word",
      "position" : 18
    },
    {
      "token" : "515422708",
      "start_offset" : 1,
      "end_offset" : 10,
      "type" : "word",
      "position" : 19
    },
    {
      "token" : "5154227089",
      "start_offset" : 1,
      "end_offset" : 11,
      "type" : "word",
      "position" : 20
    },
    {
      "token" : "1",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 21
    },
    {
      "token" : "15",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 22
    },
    {
      "token" : "154",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 23
    },
    {
      "token" : "1542",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 24
    },
    {
      "token" : "15422",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "word",
      "position" : 25
    },
    {
      "token" : "154227",
      "start_offset" : 2,
      "end_offset" : 8,
      "type" : "word",
      "position" : 26
    },
    {
      "token" : "1542270",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 27
    },
    {
      "token" : "15422708",
      "start_offset" : 2,
      "end_offset" : 10,
      "type" : "word",
      "position" : 28
    },
    {
      "token" : "154227089",
      "start_offset" : 2,
      "end_offset" : 11,
      "type" : "word",
      "position" : 29
    },
    {
      "token" : "5",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 30
    },
    {
      "token" : "54",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 31
    },
    {
      "token" : "542",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 32
    },
    {
      "token" : "5422",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "word",
      "position" : 33
    },
    {
      "token" : "54227",
      "start_offset" : 3,
      "end_offset" : 8,
      "type" : "word",
      "position" : 34
    },
    {
      "token" : "542270",
      "start_offset" : 3,
      "end_offset" : 9,
      "type" : "word",
      "position" : 35
    },
    {
      "token" : "5422708",
      "start_offset" : 3,
      "end_offset" : 10,
      "type" : "word",
      "position" : 36
    },
    {
      "token" : "54227089",
      "start_offset" : 3,
      "end_offset" : 11,
      "type" : "word",
      "position" : 37
    },
    {
      "token" : "4",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 38
    },
    {
      "token" : "42",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 39
    },
    {
      "token" : "422",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 40
    },
    {
      "token" : "4227",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 41
    },
    {
      "token" : "42270",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 42
    },
    {
      "token" : "422708",
      "start_offset" : 4,
      "end_offset" : 10,
      "type" : "word",
      "position" : 43
    },
    {
      "token" : "4227089",
      "start_offset" : 4,
      "end_offset" : 11,
      "type" : "word",
      "position" : 44
    },
    {
      "token" : "2",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 45
    },
    {
      "token" : "22",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 46
    },
    {
      "token" : "227",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "word",
      "position" : 47
    },
    {
      "token" : "2270",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "word",
      "position" : 48
    },
    {
      "token" : "22708",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "word",
      "position" : 49
    },
    {
      "token" : "227089",
      "start_offset" : 5,
      "end_offset" : 11,
      "type" : "word",
      "position" : 50
    },
    {
      "token" : "2",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 51
    },
    {
      "token" : "27",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 52
    },
    {
      "token" : "270",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "word",
      "position" : 53
    },
    {
      "token" : "2708",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 54
    },
    {
      "token" : "27089",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 55
    },
    {
      "token" : "7",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 56
    },
    {
      "token" : "70",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 57
    },
    {
      "token" : "708",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "word",
      "position" : 58
    },
    {
      "token" : "7089",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 59
    },
    {
      "token" : "0",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 60
    },
    {
      "token" : "08",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "word",
      "position" : 61
    },
    {
      "token" : "089",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 62
    },
    {
      "token" : "8",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 63
    },
    {
      "token" : "89",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 64
    },
    {
      "token" : "9",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 65
    }
  ]
}

https://blog.csdn.net/qq_37426635/article/details/124745970

https://blog.csdn.net/qq_37426635/article/details/124725601

https://blog.csdn.net/dl674756321/article/details/119979708

https://blog.csdn.net/qq_38949494/article/details/121120864

https://www.shuzhiduo.com/A/1O5EO4v4z7/

https://www.cnblogs.com/GreenLeaves/p/16562620.html

http://www.jnnr.cn/a/135106.html

https://blog.51cto.com/bigdata/5015422

posted @ 2023-01-10 09:40  未月廿三  阅读(234)  评论(0编辑  收藏  举报