SharpICTCLAS分词系统简介(1)读取词典库

ICTCLAS分词的总体流程包括：1）初步分词；2）词性标注；3）人名、地名识别；4）重新分词；5）重新词性标注这五步。就第一步分词而言，又细分成：1）原子切分；2）找出原子之间所有可能的组词方案；3）N-最短路径中文词语粗分三步。

在所有内容中，词典库的读取是最基本的功能。ICTCLAS中词典存放在Data目录中，常用的词典包括coreDict.dct（词典库）、BigramDict.dct（词与词间的关联库）、nr.dct（人名库）、ns.dct（地名库）、tr.dct（翻译人名库），它们的文件格式是完全相同的，都使用CDictionary类进行解析。如果想深入了解ICTCLAS词典结构，可以参考sinboy的《ICTCLAS分词系统研究（二）--词典结构》一文，详细介绍了词典结构。我这里只给出SharpICTCLAS中的实现。

首先是对基本元素的定义。在SharpICTCLAS中，对原有命名进行了部分调整，使得更具有实际意义并适合C#的习惯。代码如下：

Copy Code

WordDictionaryElement.cs 程序

using System;
using System.Collections.Generic;
using System.Text;

namespace SharpICTCLAS
{
   //==================================================
   // Original predefined in DynamicArray.h file
   //==================================================
   public class ArrayChainItem
   {
      public int col, row;//row and column
      public double value;//The value of the array
      public int nPOS;
      public int nWordLen;
      public string sWord;
      //The possible POS of the word related to the segmentation graph
      public ArrayChainItem next;
   }

   public class WordResult
   {
      //The word
      public string sWord;

      //the POS of the word
      public int nPOS;

      //The -log(frequency/MAX)
      public double dValue;
   }

   //--------------------------------------------------
   // data structure for word item
   //--------------------------------------------------
   public class WordItem
   {
      public int nWordLen;

      //The word
      public string sWord;

      //the process or information handle of the word
      public int nPOS;

      //The count which it appear
      public int nFrequency;
   }

   //--------------------------------------------------
   //data structure for dictionary index table item
   //--------------------------------------------------
   public class IndexTableItem
   {
      //The count number of words which initial letter is sInit
      public int nCount;

      //The head of word items
      public WordItem[] WordItems;
   }

   //--------------------------------------------------
   //data structure for word item chain
   //--------------------------------------------------
   public class WordChain
   {
      public WordItem data;
      public WordChain next;
   }

   //--------------------------------------------------
   //data structure for dictionary index table item
   //--------------------------------------------------
   public class ModifyTableItem
   {
      //The count number of words which initial letter is sInit
      public int nCount;

      //The number of deleted items in the index table
      public int nDelete;

      //The head of word items
      public WordChain pWordItemHead;
   }
}

其中ModifyTableItem用于组成ModifyTable，但在实际分词时，词库往往处于“只读”状态，因此用于修改词库的ModifyTable实际上起的作用并不大。因此在后面我将ModifyTable的代码暂时省略。

有了基本元素的定义后，就该定义“词典”类了。原有C++代码中所有类名均以大写的“C”打头，词典类名为CDictionary，在SharpICTCLAS中，我去掉了开头的“C”，并且为了防止和系统的Dictionary类重名，特起名为“WordDictionary”类。该类主要负责完成词典库的读、写以及检索操作。让我们看看如何读取词典库：

Copy Code

词典库的读取：

public class WordDictionary
{
   public bool bReleased = true;

   public IndexTableItem[] indexTable;
   public ModifyTableItem[] modifyTable;

   public bool Load(string sFilename)
   {
      return Load(sFilename, false);
   }

   public bool Load(string sFilename, bool bReset)
   {
      int frequency, wordLength, pos;   //频率、词长、读取词性
      bool isSuccess = true;
      FileStream fileStream = null;
      BinaryReader binReader = null;

      try
      {
         fileStream = new FileStream(sFilename, FileMode.Open, FileAccess.Read);
         if (fileStream == null)
            return false;

         binReader = new BinaryReader(fileStream, Encoding.GetEncoding("gb2312"));

         indexTable = new IndexTableItem[Predefine.CC_NUM];

         bReleased = false;
         for (int i = 0; i < Predefine.CC_NUM; i++)
         {
            //读取以该汉字打头的词有多少个
            indexTable[i] = new IndexTableItem();
            indexTable[i].nCount = binReader.ReadInt32();

            if (indexTable[i].nCount <= 0)
               continue;

            indexTable[i].WordItems = new WordItem[indexTable[i].nCount];

            for (int j = 0; j < indexTable[i].nCount; j++)
            {
               indexTable[i].WordItems[j] = new WordItem();

               frequency = binReader.ReadInt32();   //读取频率
               wordLength = binReader.ReadInt32(); //读取词长
               pos = binReader.ReadInt32();      //读取词性

               if (wordLength > 0)
                  indexTable[i].WordItems[j].sWord = Utility.ByteArray2String(binReader.ReadBytes(wordLength));
               else
                  indexTable[i].WordItems[j].sWord = "";

               //Reset the frequency
               if (bReset)
                  indexTable[i].WordItems[j].nFrequency = 0;
               else
                  indexTable[i].WordItems[j].nFrequency = frequency;

               indexTable[i].WordItems[j].nWordLen = wordLength;
               indexTable[i].WordItems[j].nPOS = pos;
            }
         }
      }
      catch (Exception e)
      {
         Console.WriteLine(e.Message);
         isSuccess = false;
      }
      finally
      {
         if (binReader != null)
            binReader.Close();

         if (fileStream != null)
            fileStream.Close();
      }
      return isSuccess;
   }
   //......
}

下面内容节选自词库中CCID为2、3、4、5的单元， CCID的取值范围自1～6768，对应6768个汉字，所有与该汉字可以组成的词均记录在相应的单元内。词库中记录的词是没有首汉字的（我用带括号的字补上了），其首汉字就是该单元对应的汉字。词库中记录了词的词长、频率、词性以及词。

另外特别需要注意的是在一个单元内，词是按照CCID大小排序的！这对我们后面的分析至关重要。

Copy Code

ICTCLAS词库部分内容

汉字:埃, ID ：2

词长频率词性   词
    0   128    h   (埃)
    0     0    j   (埃)
    2     4    n   (埃)镑
    2    28    ns (埃)镑
    4     4    n   (埃)菲尔
    2   511    ns (埃)及
    4     4    ns (埃)克森
    6     2    ns (埃)拉特湾
    4     4    nr (埃)里温
    6     2    nz (埃)默鲁市
    2    27    n   (埃)塞
    8    64    ns (埃)塞俄比亚
   22     2    ns (埃)塞俄比亚联邦民主共和国
    4     3    ns (埃)塞萨
    4     4    ns (埃)舍德
    6     2    nr (埃)斯特角
    4     2    ns (埃)松省
    4     3    nr (埃)特纳
    6     2    nz (埃)因霍温
====================================
汉字:挨, ID ：3

词长频率词性   词
    0    56    h   (挨)
    2     1    j   (挨)次
    2    19    n   (挨)打
    2     3    ns (挨)冻
    2     1    n   (挨)斗
    2     9    ns (挨)饿
    2     4    ns (挨)个
    4     2    ns (挨)个儿
    6    17    nr (挨)家挨户
    2     1    nz (挨)近
    2     0    n   (挨)骂
    6     1    ns (挨)门挨户
    2     1    ns (挨)批
    2     0    ns (挨)整
    2    12    ns (挨)着
    2     0    nr (挨)揍
====================================
汉字:哎, ID ：4

词长频率词性   词
    0    10    h   (哎)
    2     3    j   (哎)呀
    2     2    n   (哎)哟
====================================
汉字:唉, ID ：5

词长频率词性   词
    0     9    h   (唉)
    6     4    j   (唉)声叹气

在这里还应当注意的是，一个词可能有多个词性，因此一个词可能在词典中出现多次，但词性不同。若想从词典中唯一定位一个词的话，必须同时指明词与词性。

另外在WordDictionary类中用到得比较多的就是词的检索，这由FindInOriginalTable方法实现。原ICTCLAS代码中该方法的实现结构比较复杂，同时考虑了多种检索需求，因此代码也相对复杂一些。在SharpICTCLAS中，我对该方法进行了重载，针对不同检索目的设计了不同的FindInOriginalTable方法，简化了程序接口和代码复杂度。其中一个FindInOriginalTable方法代码如下，实现了判断某一词性的一词是否存在功能。

Copy Code

FindInOriginalTable方法的一个重载版本

private bool FindInOriginalTable(int nInnerCode, string sWord, int nPOS)
{
   WordItem[] pItems = indexTable[nInnerCode].WordItems;

   int nStart = 0, nEnd = indexTable[nInnerCode].nCount - 1;
   int nMid = (nStart + nEnd) / 2, nCmpValue;

   //Binary search
   while (nStart <= nEnd)
   {
      nCmpValue = Utility.CCStringCompare(pItems[nMid].sWord, sWord);

      if (nCmpValue == 0 && (pItems[nMid].nPOS == nPOS || nPOS == -1))
         return true;//find it
      else if (nCmpValue < 0 || (nCmpValue == 0 && pItems[nMid].nPOS < nPOS && nPOS != -1))
         nStart = nMid + 1;
      else if (nCmpValue > 0 || (nCmpValue == 0 && pItems[nMid].nPOS > nPOS && nPOS != -1))
         nEnd = nMid - 1;

      nMid = (nStart + nEnd) / 2;
   }
   return false;
}

其它功能在这里就不再介绍了。

小结

1、WordDictionary类实现了对字典的读取、写入、更改、检索等功能。

2、词典中记录了以6768个汉字打头的词、词性、出现频率的信息，具体结构需要了解。

posted on 2007-03-08 14:25 吕震宇阅读(11885) 评论(8) 收藏举报

刷新页面返回顶部

First we try, then we trust

公告