用Python或C把搜狗细胞词库转成文本文件
翁学天 (Xuetian Weng, CSSlayer),写了scel2org (fcitx-tools的一部分)。
我把他的.c程序挪了出来,不用装别的,gcc scel2org.c即可编译:
〔这里〕可下载 scel2org.c utarray.h uthash.h utils.h
补充:uthash.h utils.h不是必须的。改名.cpp后把boolean替换为bool,不include utils.h,可以。
然后搞了个Python版(29行)
import sys f = open(sys.argv[1], 'rb') if f.read(12) != b'\x40\x15\0\0\x44\x43\x53\x01\x01\0\0\0': raise ValueError() f.seek(0x1540) if f.read(4) != b'\x9d\x01\0\0': raise ValueError() all_py = [] to_uint16 = lambda bs: int.from_bytes(bs, byteorder='little', signed=False) dec_utf16 = lambda bs: bs.decode('utf-16') while True: f.read(2) pinyin = dec_utf16(f.read(to_uint16(f.read(2)))) all_py.append(pinyin) if pinyin == "zuo": break while True: bs = f.read(2) if len(bs) == 0: break symcnt = to_uint16(bs) cnt = to_uint16(f.read(2)) bs = f.read(cnt) pyidx = [to_uint16(bs[i:i+2]) for i in range(0, len(bs), 2)] for i in range(symcnt): b = f.read(to_uint16(f.read(2))) print(dec_utf16(b), ' '.join([all_py[i] for i in pyidx])) f.read(to_uint16(f.read(2)))
Python很方便,fread改f.read :-)
增加对b'\x40\x15\0\0\x45\x43\x53\x01\x01\0\0\0'型的处理:
import sys def do_file(f): bs = f.read(12) if bs == b'\x40\x15\0\0\x44\x43\x53\x01\x01\0\0\0': flag = 1 elif bs == b'\x40\x15\0\0\x45\x43\x53\x01\x01\0\0\0': flag = 2 else: print(bs, file=sys.stderr); raise ValueError() to_uint16 = lambda bs: int.from_bytes(bs, byteorder='little', signed=False) dec_utf16 = lambda bs: bs.decode('utf-16') all_py = [] f.seek(0x1544) while True: f.read(2) pinyin = dec_utf16(f.read(to_uint16(f.read(2)))) all_py.append(pinyin) if pinyin == "zuo": break if flag == 2: f.seek(0x26de) while True: bs = f.read(2) if len(bs) == 0: break symcnt = to_uint16(bs) cnt = to_uint16(f.read(2)) bs = f.read(cnt) pyidx = [to_uint16(bs[i:i+2]) for i in range(0, len(bs), 2)] for i in range(symcnt): b = f.read(to_uint16(f.read(2))) try: print(dec_utf16(b), ' '.join([all_py[i] for i in pyidx])) except Exception: pass f.read(to_uint16(f.read(2))) for fn in sys.argv[1:]: try: with open(fn, 'rb') as f: do_file(f) except Exception as e: print(fn, file=sys.stderr)
翁学天很强,他现在是fcitx的主要开发者。有人称他为CS Slayer (和CSS没关系)。
小企鹅输入法(Fcitx)最初是由Yuking开发的,名为gWuBi,自1.7版改名为Fcitx,为Linux在中国的普及做出了重要贡献。
XIM (X Input Method) is a protocol used in the X Window System that facilitates complex text input, particularly for languages requiring special characters or symbols.
Fcitx 原意 Free Chinese Input Tool of X 现在 Flexible Input Method Framework
〔https://zh.opensuse.org/Fcitx〕
“fcitx-libpinyin算法比sunpinyin先进”不对。前者用的是词的2-gram, 后者词的3-gram.
Linux下,for (int i = 1; i < argc; i++) do_file(argv[i]);
scel2org *.scel 可批量转换。shell扩展命令行;命令行可以非常长。

浙公网安备 33010602011771号