Keras文本预处理相关函数简介（text preprocessing）

text_to_word_sequence

函数原型：

from keras.preprocessing.text import text_to_word_sequence
text_to_word_sequence(text,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=" ")

这个函数的作用是把一个整句子split成为一个list，每个词都分开，组成一个word sequence。参数的作用：

text ：就是要分的文本
filters ：给定一个char的list或者一个字符串，每个char都是要过滤掉的字符，比如标点符号等，默认的就是上面的一串。
lower ：这是一个布尔值，lower=True的话表示都转成小写
split ：传一个string，这是分词的seperator。默认是空格，也就是遇到空格就分开成两个词。因此可以对于中文先用jieba之类的做一个分词，然后用某个比如空格join起来，在用这个转换成list较为方便。

栗子：

from keras.preprocessing.text import text_to_word_sequence
sentence = 'Near is a good name, you should always be near to someone to save'
seq = text_to_word_sequence(sentence)
print seq # ['near', 'is', 'a', 'good', 'name', 'you', 'should', 'always', 'be', 'near', 'to', 'someone', 'to', 'save']

one_hot

函数原型：

from keras.preprocessing.text import one_hot
one_hot(text,n, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=" ")

这个函数表示把一个string的文本编码成一个index的list，这里的index指的是在字典中的index。字典的规模可以制定，就是n，其他的参数和上面一样，不在详述。

栗子：

from keras.preprocessing.text import one_hot
sentence = 'Near is a good name, you should always be near to someone to save'
seq = one_hot(sentence, n=20)
print seq # [6, 13, 14, 17, 4, 8, 7, 3, 18, 6, 2, 17, 2, 4]

这个也很方便，直接给一个已经分好词的文本，就可以将字典编码出来，并给出相应的index。

hashing_trick

函数原型：

hashing_trick(text,n,hash_function=None,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ')

上面的one_hot函数实际上是hashing_trick的一个封装，或者说是一个特例，就是hash_function为默认的 hash 的情况，其余参数相同，这个函数可以自行设置hash函数，比如md5。

栗子：

from keras.preprocessing.text import hashing_trick
sentence = 'Near is a good name, you should always be near to someone to save'
seq = hashing_trick(sentence, n=20, hash_function='md5')
print seq # [5L, 19L, 14L, 15L, 15L, 3L, 13L, 12L, 7L, 5L, 6L, 16L, 6L, 11L]

Tokenizer

原型：

keras.preprocessing.text.Tokenizer(num_words=None,
                                   filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                   lower=True,
                                   split=" ",
                                   char_level=False,
                                   oov_token=None)

这是一个用来向量化文本的类（class），参数前面有些已经介绍了。多出来的num_words用来初始化一个Tokenizer类，表示用多少词语生成词典（vocabulary），给定以后，就用most common的K个数生成vocabulary了，其余的低频词丢掉，char_level表示 if True, every character will be treated as a token. oov_token是out-of-vocabulary，如果给定一个string作为这个oov token的话，就将这个string也加到word_index，也就是从word到index 的映射中，用来代替那些字典上没有的字。

看一个栗子：

from keras.preprocessing.text import Tokenizer

sentence1 = 'I am kira'
sentence2 = 'I am the Lord of the new world'
text = [sentence1,sentence2]

tok = Tokenizer(num_words=None)
tok.fit_on_texts(text)
# print tok attributes
print tok.word_counts # 每个word出现了几次
print tok.word_docs # 每个word出现在几个文档中
print tok.word_index # 每个word对应的index，字典映射
print tok.document_count # 一共有多少文档
# print vectorized text
print tok.texts_to_matrix(text) # 返回一个【文档数×num_words】的mat

结果：

OrderedDict([('i', 2), ('am', 2), ('kira', 1), ('the', 2), ('lord', 1), ('of', 1), ('new', 1), ('world', 1)])
{'i': 2, 'of': 1, 'am': 2, 'lord': 1, 'new': 1, 'world': 1, 'the': 1, 'kira': 1}
{'world': 8, 'i': 1, 'of': 6, 'am': 2, 'new': 7, 'lord': 5, 'the': 3, 'kira': 4}
2
[[ 0.  1.  1. ...,  0.  0.  0.]
 [ 0.  1.  1. ...,  1.  1.  1.]]

fit_on_texts是Tokenizer的一个method，只有对于一个texts学习后，才有这些attribute，比如word_counts等。

其他Method如：texts_to_sequences(texts)，texts_to_sequences_generator(texts)，texts_to_matrix(texts)，fit_on_sequences(sequences)，sequences_to_matrix(sequences)。顾名思义，输入到输出。

2018年02月28日00:32:52

每一根线都有生命，经年生长，日渐成熟，最终布料才能呈现出它曾深藏不露的美丽。 —— 设计师，山本耀司

posted @ 2018-02-28 00:34 毛利小九郎阅读(361) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

兔角与禅 (Part II)

Keras文本预处理相关函数简介（text preprocessing）

Keras文本预处理相关函数简介（text preprocessing）

text_to_word_sequence

one_hot

hashing_trick

Tokenizer

公告