1. Trie树

两种实现

class TrieNode {
public Character value;
public TrieNode[] next = new TrieNode[65536]; // 65536 = 2^16
}

class TrieNode {
public Character value;
public Map<Character, TrieNode> next = new HashMap<Character, TrieNode>();
}

Double-array实现

Double-array结合了array查询效率高、list节省空间的优点，具体是通过两个数组basecheck来实现。Trie树可以等同于一个自动机，状态为树节点的编号，边为字符；那么goto函数$g(r,c) = s$则表示状态r可以按字符c转移到状态s。base数组便是goto函数array实现，check数组为验证转移的有效性；两个数组满足如下转移方程

base[r] + c = s
check[s] = r

tail of string [b1..bh] has no common prefix and the corresponding state is m:
base[m] < 0;
p = -base[m], tail[p] = b1, tail[p+1] = b2, ..., tail[p+h-1] = bh;

// root -> b
base[1] + 'b' = 4 + 3 = 7
// root -> b -> a
base[7] + 'a' = 1 + 2 = 3
// root -> b -> a -> d
base[3] + 'd' = 1 + 5 = 6
base[6] = -12
tail[12..14] = 'ge#'

2. DAT应用

词典

Ansjcore.dic给出中文词典的DAT实现：

249952
37  %   65536   -1  3   {q=1}
39  '   65536   -1  4   {en=1}
46  .   65536   -1  5   {nb=1}
...
21360   印   92338   -1  2   {j=24, n=1, ng=2, nr=0, v=32}
24230   度   89338   -1  2   {k=0, ng=2, q=28, v=7, vg=2}
27827   河   142597  -1  2   {n=29, q=0}
...
116568  印度  71557   21360   2   {ns=51}
99384   印度河 65536   116568  3   {ns=0}
116553  振臂一 94926   129740  1   null
116566  捅娄子 65536   116571  3   {v=0}
65333   Ｕ   65536   -1  4   {en=1}
...

index   name    base    check   status  {词性->词频}  

index['印度'] = 116568 = base['印'] + index['度'] = 92338 + 24230
check['印度'] = 21360 = index['印']
index['印度河'] = 99384 = base['印度'] + index['河'] = 71557 + 27827
check['印度河'] = 116568 = index['印度']

• 1对应的词性为null，name不能单独成词，应继续，比如“振臂一”；
• 2表示name既可单独成词，也可与其他字符组成新词，比如词“印度”；
• 3表示词结束，name成词不再继续，比如词“捅娄子”；
• 4表示英文字母（包括全角）+字符'，共计105(26*4+1)个字符;
• 5表示数字（包括全角）+小数点，共有21(10*2+1)个字符.

分词

import org.ansj.library.DATDictionary
import scala.collection.mutable.ArrayBuffer

// max-matching algorithm for CWS
def maxMatching(sentence: String): Array[String] = {
val segmented = ArrayBuffer.empty[String]
val chars = sentence.toCharArray
var i = 0
while (i < chars.length) {
DATDictionary.status(chars(i)) match {
// not in core.dic or word-end or last char
case t if t == 0 || t == 3 || i == chars.length - 1 =>
i = singleCharWord(chars, i, segmented)
// word-start
case t if t == 1 || t == 2 =>
i = goOnWord(chars, i, segmented)
// English character or number
case _ =>
i = goOnEnNum(chars, i, segmented)
}
}
segmented.toArray
}

// a single character segment
private def singleCharWord(chars: Array[Char], start: Int, arr: ArrayBuffer[String]): Int = {
arr += chars(start).toString
start + 1
}

// word segment which is in core.dic
private def goOnWord(chars: Array[Char], start: Int, arr: ArrayBuffer[String]): Int = {
var nextIndex: Int = chars(start).toInt
for (j <- start + 1 until chars.length) {
val preIndex = nextIndex
nextIndex = DATDictionary.getItem(nextIndex).getBase + chars(j).toInt
if (DATDictionary.getItem(nextIndex).getCheck != preIndex) {
arr += chars.subSequence(start, j).toString
return j
}
}
chars.length
}

// English chars and numbers compose a word
private def goOnEnNum(chars: Array[Char], start: Int, arr: ArrayBuffer[String]): Int = {
for (j <- start + 1 until chars.length) {
val status = DATDictionary.status(chars(j))
if (status != 4 && status != 5) {
arr += chars.subSequence(start, j).toString
return j
}
}
chars.length
}

val sentence = "非农一触即发，现货原油扑朔迷离，伦敦金回暖已定"
println(maxMatching(sentence).mkString("/"))
// 非农/一触即发/，/现货/原油/扑朔迷离/，/伦敦/金/回暖/已/定

3. 参考资料

[1] Aoe, J. I., Morimoto, K., & Sato, T. (1992). An efficient implementation of trie structures. Software: Practice and Experience, 22(9), 695-721.
[2] Theppitak Karoonboonyanan, An Implementation of Double-Array Trie.

posted @ 2017-01-09 14:49 Treant 阅读(...) 评论(...) 编辑 收藏