sunpinyin “二次开发”
此贴介绍sunpinyin. 它的用户词典是~/.sunpinyin/userdict,下面的程序往里面加词:
// -I/usr/include/sunpinyin-2.0 add-word.cpp -lsunpinyin #include <ime-core/userdict.h> #include <pinyin/pinyin_data.h> TSyllable (*py2i)(const char*) = CPinyinData::encodeSyllable; int main(int argc, char** argv) { if (argc != 2) return 0; CUserDict ud; ud.load(argv[1]); // 文件可以不存在 CSyllables slbs; // typedef vector<TSyllable> CSyllables slbs.resize(2); slbs[0] = py2i("yi"); slbs[1] = py2i("er"); // If locale is an empty string, ""... is set according to the environment variables setlocale(LC_CTYPE, ""); unsigned wcs[3] = {}; int n = mbstowcs((wchar_t*)wcs, "①②", 2); ud.addWord(slbs, wcs); return 0; }
apt install libsunpinyin-dev
多次加同一个词不会出来n条。
userdict是个sqlite数据库,里面有且仅有一张表dict. 下面是常用sqlite 命令:
导入导出 sqlite3 file.db .dump >dump.sql sqlite3 file.db <dump.sql .tables - 查看所有表名 .schema 表名 - 查看特定表结构 select * from dict ;
不用编译它的源码,用现成的程序也能训练出online用的数据。程序在包sunpinyin-utils里:
- genpyt - generate the PINYIN lexicon
- getwordfreq - print word freq information from language model
- idngram_merge - merge idngram file into one
- ids2ngram - generate n-gram data file from ids file
- mmseg - maximum matching segment Chinese text
- slmbuild - generate language model from idngram file
- slminfo - get information of a back-off language model
- slmpack - convert the ARPA format of SunPinyin back-off language model to its binary representation
- slmprune - prune the back-off language model to a reasonable size
- slmseg - maximum matching segment Chinese text. slmthread add back-off-state for each slm node in the primitive_slm. Also it compresses 32-bit float into 16 bit representation. These processing speeds up the looking up. The primitive_slm is always generated by slmprune. And the threaded_slm can be used to feed slmseg as a reference to segment Chinese text.
- slmthread - threads the language model
- tslmendian - change the byte-order of sunpinyin's threaded back-off language model
- tslminfo - get information of a threaded back-off language model
不是所有的程序都要用到。
在gitee能下载到phrase-pinyin-data-master.zip和chinese-dictionary-main.zip.
106666 cc_cedict.txt 6850 di.txt 411960 large_pinyin.txt 872 overwrite.txt 47115 pinyin.txt 348513 zdic_cibs.txt 32633 zdic_cybs.txt 954609 总计 㹴犬: gěng quǎn
iconv可转换文件编码。
浙公网安备 33010602011771号