随机字符串识别 Gibberish Detection

 

一个简单的方案是:

1. 通过大量正常文本, 统计 "字->字" 条件概率方阵,

  (例子中使用了26个英文字母+空格, 共27个基础字符, 构造的概率转移方阵为27*27)

 

2. 对于需要判断的文本, 将其拆解为顺序的2-gram词组, 对于每个词组, 查表得到第一个字符后出现第二个字符的概率,

 

3. 求字符串中所有2-gram词组的条件概率平均值, 如果低于一定阈值, 则说明该字符串字符间的次序性与常规概率分布有偏, 可能是随机生成的.

  (例子中的阈值使用了 "(0类的最小概率 + 1类的最大概率) / 2" )

 

 

该方案并未更进一步计算多跳稳态矩阵, 与其说是MC, 其实为简单的条件概率 :)

 

 

 1 #coding=utf8
 2 
 3 import math
 4 import pickle
 5 
 6 accepted_chars = 'abcdefghijklmnopqrstuvwxyz '
 7 
 8 pos = dict([(char, idx) for idx, char in enumerate(accepted_chars)])
 9 
10 def normalize(line):
11     """ Return only the subset of chars from accepted_chars.
12     This helps keep the  model relatively small by ignoring punctuation,
13     infrequenty symbols, etc. """
14     return [c.lower() for c in line if c.lower() in accepted_chars]
15 
16 def ngram(n, l):
17     """ Return all n grams from l after normalizing """
18     filtered = normalize(l)
19     for start in range(0, len(filtered) - n + 1):
20         yield ''.join(filtered[start:start + n])
21 
22 def train():
23     """ Write a simple model as a pickle file """
24     k = len(accepted_chars)
25     # Assume we have seen 10 of each character pair.  This acts as a kind of
26     # prior or smoothing factor.  This way, if we see a character transition
27     # live that we've never observed in the past, we won't assume the entire
28     # string has 0 probability.
29     counts = [[10 for i in range(k)] for i in range(k)]
30 
31     # Count transitions from big text file, taken
32     # from http://norvig.com/spell-correct.html
33     # 统计常规大文本段落中,词词出现的次数
34     for line in open('big.txt'):
35         for a, b in ngram(2, line):
36             counts[pos[a]][pos[b]] += 1
37 
38     # Normalize the counts so that they become log probabilities.
39     # We use log probabilities rather than straight probabilities to avoid
40     # numeric underflow issues with long texts.
41     # This contains a justification:
42     # http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/
43     # 按首词进行概率归一化
44     for i, row in enumerate(counts):
45         s = float(sum(row))
46         for j in range(len(row)):
47             row[j] = math.log(row[j] / s)
48 
49     # Find the probability of generating a few arbitrarily choosen good and
50     # bad phrases.
51     # 分别计算good/bad类平均转移概率,以确定阈值
52     good_probs = [avg_transition_prob(l, counts) for l in open('good.txt')]
53     bad_probs = [avg_transition_prob(l, counts) for l in open('bad.txt')]
54 
55     # Assert that we actually are capable of detecting the junk.
56     assert min(good_probs) > max(bad_probs)
57 
58     # And pick a threshold halfway between the worst good and best bad inputs.
59     thresh = (min(good_probs) + max(bad_probs)) / 2
60     pickle.dump({'mat': counts, 'thresh': thresh}, open('gib_model.pki', 'wb'))
61 
62 def avg_transition_prob(line, log_prob_mat):
63     """ Return the average transition prob from l through log_prob_mat. """
64     log_prob = 0.0
65     transition_ct = 0
66     for a, b in ngram(2, line):
67         log_prob += log_prob_mat[pos[a]][pos[b]]
68         transition_ct += 1
69     # The exponentiation translates from log probs to probs.
70     return math.exp(log_prob / (transition_ct or 1))
71 
72 
73 if __name__ == '__main__':
74     train()

 

 

参考:

https://github.com/rrenaud/Gibberish-Detector

用机器学习识别随机生成的C&C域名
乱语识别

 

posted @ 2021-01-11 21:01  awr2sdgae  阅读(1043)  评论(2)    收藏  举报