[来龙去脉]
语料:wiki_zh 1.2GB
词典:74001行 sysdic,其中字约17,000个,来自googlepinyin. 不是所有的GB18030汉字都能用UNICODE表示。GB18030采用单/双/四字节混合编码。
词是googlepinyin+sunpinyin合并去重。喜羊羊与灰太狼之兔年顶呱呱 74083 xi'yang'yang'yu'hui'tai'lang'zhi'tu'nian'ding'gua'gua
mmseg时只用了字。
118M lm_sc.t3g
This is a 3-gram back-off model, using -log(pr)
1 items in 0-level
10876 items in 1-level
1945235 items in 2-level # ** 0.5 = 1395
12444533 items in 3-level # ** 1/3 = 232 能说有1200万词不?
训练时间:不到半小时。
效果:输入自然流畅。输入shuruzilanliuchang,输入自然流畅是首选。
TODO:
- 往userdict里加词太慢,不断copy数据库,改成加词前自己备份文件,加词时不copy. [job done]
- 把县可读xuan之流删掉。把sysdic拆成了zi, zi.多音和词,拼接起来OK. 533个多音字。The IME is on fire, 简直输入啥都有嘛。
我写的极乱的程序:
#!/usr/bin/python d = {} def merge_all_w(f, s): d[f[0]] = ' '.join(f[1:]) def get_all_g_w(f, s): w = f[0] if len(w) <= 1: return f[1:] = f[3:] print(' '.join(f)) def get_all_s_w(f, s): w = f[0] if len(w) <= 1: return f[1:] = f[2:] print(' '.join(f)) d = {} def get_g_zi (f, s): w = f[0] if len(w) > 1: return d.setdefault(w, []).append(f[3]) d = {} def sort_g_by_freq (f, s): # 按词频降序排列 w = f[0] if len(w) <= 1: return freq = int(float(f[1])) f[1:] = f[3:] d.setdefault(freq, []).append(' '.join(f)) wid = 16563 def get_g_23 (f, s): # 高频词里的二三字词 global wid xx = len(f[0]) if xx != 2 and xx != 3: return f[1] = str(wid); wid += 1 f[2:] = f[3:] print(' '.join(f)) def get_s_zi (f, s): # 字 if int(f[1]) >= 100 and len(f[0]) == 1: f[1:] = f[2:]; print(' '.join(f)) def all_minus_sys(f, s): if not f[0] in st: print(s) def sys_dic_pie(f, s): if int(f[1]) > 100 and len(f[0]) > 1: print(f[0], f[1], "'".join(f[2:])) else: print(s) wid = 58005 def usr_dic_pie(f, s): global wid print(f[0], wid, "'".join(f[1:])) wid += 1 def do_ (cb): try: while True: s = input(); cb(s.split(), s) except EOFError: pass except Exception as e : print('ERROR:', e) do_(usr_dic_pie) ''' do_(sys_dic_pie) st = set() for s in open('/t/sysdic', 'r'): st.add(s.split()[0]) do_(all_minus_sys) do_(merge_all_w) for k,v in d.items(): print(f'{k} {v}') do_(get_all_s_w) do_(get_all_g_w) do_(get_g_23) do_(get_g_zi) n = 100 for k,v in d.items(): print(f'{k} {n}', ' '.join(v)); n += 1 #噷 16562 hm do_(sort_g_by_freq) for k in sorted(d.keys(), reverse=True): print('\n'.join(d[k])) do_(get_s_zi) ''' # grep -v '%' 多音字
SConstructs里cflags = '-g -Wall',CFLAGS=cflags, CXXFLAGS='-std=c++11', 可是编译.cpp用的是CXXFLAGS。乱改成了:env.MergeFlags(['-pipe -O -DHAVE_CONFIG_H',
scons -c就像make clean.
SConstructs就是个Python程序,不必学autoconf和automake了。
原计划搞台16GB内存的电脑,现在没必要了。
某《电脑市场》版,还讨论nvme SSD通过pci-e卡转接呢——消费真是降级了。
浙公网安备 33010602011771号