随机字符串识别 Gibberish Detection
一个简单的方案是:
1. 通过大量正常文本, 统计 "字->字" 条件概率方阵,
(例子中使用了26个英文字母+空格, 共27个基础字符, 构造的概率转移方阵为27*27)
2. 对于需要判断的文本, 将其拆解为顺序的2-gram词组, 对于每个词组, 查表得到第一个字符后出现第二个字符的概率,
3. 求字符串中所有2-gram词组的条件概率平均值, 如果低于一定阈值, 则说明该字符串字符间的次序性与常规概率分布有偏, 可能是随机生成的.
(例子中的阈值使用了 "(0类的最小概率 + 1类的最大概率) / 2" )
该方案并未更进一步计算多跳稳态矩阵, 与其说是MC, 其实为简单的条件概率 :)
1 #coding=utf8 2 3 import math 4 import pickle 5 6 accepted_chars = 'abcdefghijklmnopqrstuvwxyz ' 7 8 pos = dict([(char, idx) for idx, char in enumerate(accepted_chars)]) 9 10 def normalize(line): 11 """ Return only the subset of chars from accepted_chars. 12 This helps keep the model relatively small by ignoring punctuation, 13 infrequenty symbols, etc. """ 14 return [c.lower() for c in line if c.lower() in accepted_chars] 15 16 def ngram(n, l): 17 """ Return all n grams from l after normalizing """ 18 filtered = normalize(l) 19 for start in range(0, len(filtered) - n + 1): 20 yield ''.join(filtered[start:start + n]) 21 22 def train(): 23 """ Write a simple model as a pickle file """ 24 k = len(accepted_chars) 25 # Assume we have seen 10 of each character pair. This acts as a kind of 26 # prior or smoothing factor. This way, if we see a character transition 27 # live that we've never observed in the past, we won't assume the entire 28 # string has 0 probability. 29 counts = [[10 for i in range(k)] for i in range(k)] 30 31 # Count transitions from big text file, taken 32 # from http://norvig.com/spell-correct.html 33 # 统计常规大文本段落中,词词出现的次数 34 for line in open('big.txt'): 35 for a, b in ngram(2, line): 36 counts[pos[a]][pos[b]] += 1 37 38 # Normalize the counts so that they become log probabilities. 39 # We use log probabilities rather than straight probabilities to avoid 40 # numeric underflow issues with long texts. 41 # This contains a justification: 42 # http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/ 43 # 按首词进行概率归一化 44 for i, row in enumerate(counts): 45 s = float(sum(row)) 46 for j in range(len(row)): 47 row[j] = math.log(row[j] / s) 48 49 # Find the probability of generating a few arbitrarily choosen good and 50 # bad phrases. 51 # 分别计算good/bad类平均转移概率,以确定阈值 52 good_probs = [avg_transition_prob(l, counts) for l in open('good.txt')] 53 bad_probs = [avg_transition_prob(l, counts) for l in open('bad.txt')] 54 55 # Assert that we actually are capable of detecting the junk. 56 assert min(good_probs) > max(bad_probs) 57 58 # And pick a threshold halfway between the worst good and best bad inputs. 59 thresh = (min(good_probs) + max(bad_probs)) / 2 60 pickle.dump({'mat': counts, 'thresh': thresh}, open('gib_model.pki', 'wb')) 61 62 def avg_transition_prob(line, log_prob_mat): 63 """ Return the average transition prob from l through log_prob_mat. """ 64 log_prob = 0.0 65 transition_ct = 0 66 for a, b in ngram(2, line): 67 log_prob += log_prob_mat[pos[a]][pos[b]] 68 transition_ct += 1 69 # The exponentiation translates from log probs to probs. 70 return math.exp(log_prob / (transition_ct or 1)) 71 72 73 if __name__ == '__main__': 74 train()
参考:
https://github.com/rrenaud/Gibberish-Detector

浙公网安备 33010602011771号