5.特征提取

有很多特征提取技术可以应用到文本数据上，但在深入学习之前，先思考特征的意义。为什么需要这些特征？它们又如何发挥作用？数据集中通常包含很多数据。一般情况下，数据集的行和列是数据集的不同特征或属性，每行或者每个观测值都是特殊的值。在机器学习术语中，特征是独一无二的，是数据集中每个观测值或数据的可度量的属性或性质。特征通常具有数据的性质，可能是绝对值或是列表中每个分类进行二进制编码的分类特征，这一过程为一位有效（one-hot）编码过程。特征的特区和选择过程即使一门科学，也是一门艺术，这个过程也称为特征提取或特征工程。

通常情况下，为获取洞见，把提取到的特征送入机器学习算法以学习可以应用到新数据特征上的模式。因为每个算法的核心是数学上的优化操作，当算法从数据的观测值上学习模式时，是一个最小化误差和错误的过程，所以这些算法一般都期望特征是数值向量的形式。因此，处理文本数据增加的挑战就是如何转换文本数据并从中提取数值特征。

现在，看一些与文本数据有关的特征提取概念的技术。

向量空间模型 是处理文本数据非常有用概念和模型，并在信息索引与文档排序中广泛使用。向量空间模型也称为词向量模型，定义为文本文档转换与表示的数学或代数模型，作为形成向量维度的特定词项的数字向量。数学上定义如下，假设在文档向量空间 VS 中有一个文档 D。每个文档维度和列数量将是向量空间中全部文档中不同词项或单词的总数量。

因此，向量空间可以表示为：

VS = {W₁, W₂, ..., W_n}

其中，n 是全部文档中不同单词的数量。现在，可以吧文档 D 在向量空间标示为：

D = { w_D1, w_D2,..., w_Dn}

其中，w_Dn 表示文档 D 中第 n 个词的权重。这个权重是一个数量值，可以表示任何事，可是文档中单词的频率、平均的出现频率，或者是 TF-IDF 权重。

下面见介绍和实现如下特征提取技术：

词袋模型。
TF-IDF 模型。
高级词向量模型。

对于特征提取，需要记住的一个关键问题是，一旦建立一个使用一些转换和数学操作的特征提取器，就需要确保从新文档提取特征时重用同样的过程，不需要对新文档重新建立整个算法。对于每项技术，都将使用一个例子进行说明。请注意，对于例子都将使用 nltk、genism 和 scikit-learn 等函数库。

特征提取的实现可以分为两个块。

feature_extractors.py 折叠源码

# -*- coding: utf-8 -*-
"""
Created on Sat Aug 27 04:03:12 2016
@author: DIP
"""
 
from sklearn.feature_extraction.text import CountVectorizer
 
def bow_extractor(corpus, ngram_range=(1,1)):
     
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features
     
     
from sklearn.feature_extraction.text import TfidfTransformer
 
def tfidf_transformer(bow_matrix):
     
    transformer = TfidfTransformer(norm='l2',
                                   smooth_idf=True,
                                   use_idf=True)
    tfidf_matrix = transformer.fit_transform(bow_matrix)
    return transformer, tfidf_matrix
     
     
from sklearn.feature_extraction.text import TfidfVectorizer
 
def tfidf_extractor(corpus, ngram_range=(1,1)):
     
    vectorizer = TfidfVectorizer(min_df=1,
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features
     
 
import numpy as np   
     
def average_word_vectors(words, model, vocabulary, num_features):
     
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
     
    for word in words:
        if word in vocabulary:
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
     
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
         
    return feature_vector
     
    
def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)
     
     
def tfidf_wtd_avg_word_vectors(words, tfidf_vector, tfidf_vocabulary, model, num_features):
     
    word_tfidfs = [tfidf_vector[0, tfidf_vocabulary.get(word)]
                   if tfidf_vocabulary.get(word)
                   else 0 for word in words]   
    word_tfidf_map = {word:tfidf_val for word, tfidf_val in zip(words, word_tfidfs)}
     
    feature_vector = np.zeros((num_features,),dtype="float64")
    vocabulary = set(model.index2word)
    wts = 0.
    for word in words:
        if word in vocabulary:
            word_vector = model[word]
            weighted_word_vector = word_tfidf_map[word] * word_vector
            wts = wts + word_tfidf_map[word]
            feature_vector = np.add(feature_vector, weighted_word_vector)
    if wts:
        feature_vector = np.divide(feature_vector, wts)
         
    return feature_vector
     
def tfidf_weighted_averaged_word_vectorizer(corpus, tfidf_vectors,
                                   tfidf_vocabulary, model, num_features):
                                        
    docs_tfidfs = [(doc, doc_tfidf)
                   for doc, doc_tfidf
                   in zip(corpus, tfidf_vectors)]
    features = [tfidf_wtd_avg_word_vectors(tokenized_sentence, tfidf, tfidf_vocabulary,
                                   model, num_features)
                    for tokenized_sentence, tfidf in docs_tfidfs]
    return np.array(features)

包括后面建立分类器时使用的通用函数。在

feature_extraction_demo.py 折叠源码

# -*- coding: utf-8 -*-
"""
Created on Thu Aug 25 00:09:56 2016
@author: DIP
"""
 
CORPUS = [
'the sky is blue',
'sky is blue and sky is beautiful',
'the beautiful sky is so blue',
'i love blue cheese'
]
 
new_doc = ['loving this blue sky today']
 
import pandas as pd
 
def display_features(features, feature_names):
    df = pd.DataFrame(data=features,
                      columns=feature_names)
    print df
 
 
from feature_extractors import bow_extractor   
     
bow_vectorizer, bow_features = bow_extractor(CORPUS)
features = bow_features.todense()
print features
 
new_doc_features = bow_vectorizer.transform(new_doc)
new_doc_features = new_doc_features.todense()
print new_doc_features
 
feature_names = bow_vectorizer.get_feature_names()
print feature_names
 
display_features(features, feature_names)
display_features(new_doc_features, feature_names)
 
 
import numpy as np
from feature_extractors import tfidf_transformer
feature_names = bow_vectorizer.get_feature_names()
     
tfidf_trans, tdidf_features = tfidf_transformer(bow_features)
features = np.round(tdidf_features.todense(), 2)
display_features(features, feature_names)
 
nd_tfidf = tfidf_trans.transform(new_doc_features)
nd_features = np.round(nd_tfidf.todense(), 2)
display_features(nd_features, feature_names)
 
 
 
import scipy.sparse as sp
from numpy.linalg import norm
feature_names = bow_vectorizer.get_feature_names()
 
# compute term frequency
tf = bow_features.todense()
tf = np.array(tf, dtype='float64')
 
# show term frequencies
display_features(tf, feature_names)
 
# build the document frequency matrix
df = np.diff(sp.csc_matrix(bow_features, copy=True).indptr)
df = 1 + df # to smoothen idf later
 
# show document frequencies
display_features([df], feature_names)
 
# compute inverse document frequencies
total_docs = 1 + len(CORPUS)
idf = 1.0 + np.log(float(total_docs) / df)
 
# show inverse document frequencies
display_features([np.round(idf, 2)], feature_names)
 
# compute idf diagonal matrix 
total_features = bow_features.shape[1]
idf_diag = sp.spdiags(idf, diags=0, m=total_features, n=total_features)
idf = idf_diag.todense()
 
# print the idf diagonal matrix
print np.round(idf, 2)
 
# compute tfidf feature matrix
tfidf = tf * idf
 
# show tfidf feature matrix
display_features(np.round(tfidf, 2), feature_names)
 
# compute L2 norms
norms = norm(tfidf, axis=1)
 
# print norms for each document
print np.round(norms, 2)
 
# compute normalized tfidf
norm_tfidf = tfidf / norms[:, None]
 
# show final tfidf feature matrix
display_features(np.round(norm_tfidf, 2), feature_names)
  
 
# compute new doc term freqs from bow freqs
nd_tf = new_doc_features
nd_tf = np.array(nd_tf, dtype='float64')
 
# compute tfidf using idf matrix from train corpus
nd_tfidf = nd_tf*idf
nd_norms = norm(nd_tfidf, axis=1)
norm_nd_tfidf = nd_tfidf / nd_norms[:, None]
 
# show new_doc tfidf feature vector
display_features(np.round(norm_nd_tfidf, 2), feature_names)
 
 
from feature_extractors import tfidf_extractor
     
tfidf_vectorizer, tdidf_features = tfidf_extractor(CORPUS)
display_features(np.round(tdidf_features.todense(), 2), feature_names)
 
nd_tfidf = tfidf_vectorizer.transform(new_doc)
display_features(np.round(nd_tfidf.todense(), 2), feature_names)   
 
 
import gensim
import nltk
 
TOKENIZED_CORPUS = [nltk.word_tokenize(sentence)
                    for sentence in CORPUS]
tokenized_new_doc = [nltk.word_tokenize(sentence)
                    for sentence in new_doc]                       
 
model = gensim.models.Word2Vec(TOKENIZED_CORPUS,
                               size=10,
                               window=10,
                               min_count=2,
                               sample=1e-3)
 
 
from feature_extractors import averaged_word_vectorizer
 
 
avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
                                                 model=model,
                                                 num_features=10)
print np.round(avg_word_vec_features, 3)
 
nd_avg_word_vec_features = averaged_word_vectorizer(corpus=tokenized_new_doc,
                                                    model=model,
                                                    num_features=10)
print np.round(nd_avg_word_vec_features, 3)
 
               
from feature_extractors import tfidf_weighted_averaged_word_vectorizer
 
corpus_tfidf = tdidf_features
vocab = tfidf_vectorizer.vocabulary_
wt_tfidf_word_vec_features = tfidf_weighted_averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
                                                                     tfidf_vectors=corpus_tfidf,
                                                                     tfidf_vocabulary=vocab,
                                                                     model=model,
                                                                     num_features=10)
print np.round(wt_tfidf_word_vec_features, 3)
 
nd_wt_tfidf_word_vec_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_new_doc,
                                                                     tfidf_vectors=nd_tfidf,
                                                                     tfidf_vocabulary=vocab,
                                                                     model=model,
                                                                     num_features=10)
print np.round(nd_wt_tfidf_word_vec_features, 3)

中通过一些实际的雷子使用同样的函数说明每项技术如何工作。将使用 CORPUS 变量中描述的以下文档提取特征，并建立一些向量化模型。为说明如何从新文档中提取特征（作为测试数据集的一部分），将在下面的代码段中使用 new_doc 变量中独立的文档。

CORPUS = [
'the sky is blue',
'sky is blue and sky is beautiful',
'the beautiful sky is so blue',
'i love blue cheese'
]
 
new_doc = ['loving this blue sky today']

词袋模型

词袋模型也许是从文本文档中提取特征最简单但又最有效的技术。这个模型的本质是将文本文档转化成向量，从而将每个文档转化成一个向量，这个向量表示在文档空间中全部不同的单词在该文档中出现的概率。因此，根据前面的数学定义，这里的例子向量记为 D，每个单词的权重和该词在文档中出现的频率相等。

有意思的事情是可以为单个单词出现频率和 n 元分词出现频率建立同样的模型，该模型就是 n 元分词词袋模型，它计算不同的 n 元分词在每个文档中的出现频率。

下面的代码片段给出了一个函数，实现了基于词袋的特征提取模块，该模块也接受 ngram_range 参数作为 n 元分词的特征。

from sklearn.feature_extraction.text import CountVectorizer
 
def bow_extractor(corpus, ngram_range=(1,1)):
     
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

上面的函数使用 CountVectorizer 类，可以在 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 地址访问详细的 API（应用程序接口）文档，根据你要提取的特征类型，该函数有一系列不同参数用于调优。我们使用默认的配置，这对大多数场景都是足够的，其中 min_df 设置为 1 表示的整个文档空间中最小频率为 1 的词项都会被考虑。可以设置 ngram_range 为不同的参数值，如（1，3），将建立包括所有 unigram、bigram 和 trigram 的向量空间。下面的代码片段显示函数在样本余料，即 4 个训练文档和 1 个测试文档的执行情况。

In [57]: bow_vectorizer, bow_features = bow_extractor(CORPUS)
 
In [58]: features = bow_features.todense()
 
In [59]: print(features)
[[0 0 1 0 1 0 1 0 1]
 [1 1 1 0 2 0 2 0 0]
 [0 1 1 0 1 0 1 1 1]
 [0 0 1 1 0 1 0 0 0]]

In [62]: new_doc_features = bow_vectorizer.transform(new_doc)
 
In [63]: new_doc_features = new_doc_features.todense()
 
In [64]: print(new_doc_features)
[[0 0 1 0 0 0 1 0 0]]

In [65]: feature_names = bow_vectorizer.get_feature_names()
 
In [66]: print(feature_names)
['and', 'beautiful', 'blue', 'cheese', 'is', 'love', 'sky', 'so', 'the']

上述输出显示每个文档如何转换为向量。每行代表语料库中的一个文档，我们对两个语料库均执行相同操作。使用 CORPUS 变量中的文档建立了向量生成器。用它来提取特征，使用这个建立的变量生成器从全新的文档中提取特征。向量的每一列描述的单词在 feature_names 变量中描述，每列的值是改词在文档中的频率。第一次看到它时可能难于理解，因此准备了下面的函数，它有助于更好地理解特征向量。

import pandas as pd
 
def display_features(features, feature_names):
    df = pd.DataFrame(data=features,
                      columns=feature_names)
    print(df)

现在可以将特征名字和向量送入这个函数，以一种比较容易理解的结构查看特征矩阵，如下所示：

In [71]: display_features(features, feature_names)
   and  beautiful  blue  cheese  is  love  sky  so  the
0    0          0     1       0   1     0    1   0    1
1    1          1     1       0   2     0    2   0    0
2    0          1     1       0   1     0    1   1    1
3    0          0     1       1   0     1    0   0    0
  
In [72]: display_features(new_doc_features, feature_names)
   and  beautiful  blue  cheese  is  love  sky  so  the
0    0          0     1       0   0     0    1   0    0

这使得事情变得更加清楚。考虑一下 CORPUS 的第二个文档，在上面的第一个表的第 1 行中表示。可以看到，'sky is blue and sky is beautiful' 这句话，特征 sky 的值为 2，beautiful 值为 1 ，等等。文档中未出现的单词的值为 0 。请注意，对于新的文档变量 new_doc，这句话中没有 today、this 或 loving 这些单词，因此没有这些词的特征。前面提到过这个原因，就是特征提取过程、模型和单词总是基于训练数据，将不随着新闻的变化或受其影响，这将用于后面的测试或其他语料的预测。获取已经猜到这是因为一个模型总是基于训练数据进行训练，除非重新建立模型，否则模型不会受到新文档的影响。因此这个模型的特征总是受限于训练语料的文档向量空间。

已经了解了如何从文本数据中提取基于向量的有意义的特征，在从前看来这是不可能的。试着使用上面的函数，把 ngram_range 参数设置为（1，3）,观察输出结果。

In [87]: bow_vectorizer, bow_features = bow_extractor(CORPUS, (1,3))
 
In [88]: features = bow_features.todense()
 
In [89]: new_doc_features = bow_vectorizer.transform(new_doc)
 
In [90]: new_doc_features = new_doc_features.todense()
 
In [91]: feature_names = bow_vectorizer.get_feature_names()
 
In [92]: display_features(features, feature_names)
    and  and sky  and sky is  beautiful  beautiful sky  beautiful sky is     ...      so blue  the  the beautiful  the beautiful sky  the sky  the sky is
0    0        0           0          0              0                 0     ...            0    1              0                  0        1           1
1    1        1           1          1              0                 0     ...            0    0              0                  0        0           0
2    0        0           0          1              1                 1     ...            1    1              1                  1        0           0
3    0        0           0          0              0                 0     ...            0    0              0                  0        0           0
 
[4 rows x 32 columns]
 
 
In [93]: display_features(new_doc_features, feature_names)
   and  and sky  and sky is  beautiful  beautiful sky  beautiful sky is     ...      so blue  the  the beautiful  the beautiful sky  the sky  the sky is
0    0        0           0          0              0                 0     ...            0    0              0                  0        0           0
 
[1 rows x 32 columns]

IT-IDF 模型

词袋模型还不错，但向量完全依赖于单词出现的绝对频率。这存在一些潜在的问题，语料库全部文档中出现次数较多的单词将会拥有较高的频率，这些词将会影响其他一些出现不如这些词频繁但对于文档分类更有意义和有效的单词。这就是 TF-IDF 的来源。TF-IDF 代表的是词频，逆文档频率，是两个度量的组合：词频和逆文档频率。该技术最初作为显示搜索引擎用户查询结果排序函数的一个度量，现在已经成为信息检索和文本特征提取的一部分。

现在正是定义 TF-IDF，开始实现之前，看一下它的数学表示。数学上，TF-IDF 是两个度量的乘积，可以表示为 tƒidƒ = tƒ × idƒ , 其中词频（tƒ）和逆文档频率（idƒ）是两个度量。

词频有 tƒ 表示，由词袋模型计算得出。任何文档的词频是该词在特定文档出现的原始频率值。数学上，词频可以表示为 tf(ω, D) = ƒ _ωD，其中 ƒ_ωD 表示单词 ω 在文档 D 中的频率，这就是词频（tƒ）。有一些其他的词频表示没有出现过。有时，也可以通过对数运算或频率平均值将原始频率标准化。将在具体实现中使用原始频率。

逆文档频率由 idƒ 表示，是每个单词的文档频率的逆。该值由语料库中全部文档数量除以每个单词的文档频率，然后对结果应用对数运算变换其比例。在这里的实现中，将对每个单词的文档频率加 1，意味着词汇表中每个单词至少包含在一个语料库文档中。这是为了避免为 0 除的错误，平滑逆文档频率。也对 idƒ 的计算结果加 1，避免被忽略单词拥有 0 值的 idƒ。数学上，idƒ 实现表示如下：

其中，idƒ(t) 表示单词 t 的 idƒ， C 表示语料库中文档的总数量，dƒ(t) 表示包含单词 t 的文档数量频率。

因此，词频-逆文档频率可以通过把两个度量乘在一起来计算。最终将要使用的 TF-IDF 度量是 tƒidƒ 矩阵的归一化版本，矩阵是 tƒ 和 idƒ 的乘积。将 tƒidƒ 矩阵除以矩阵的 L2 范数来进行矩阵归一化，L2 范数也称为欧几里得范数，它是每个单词 tƒidƒ 权重平方和的平方根。数学上，将最终的 tƒidƒ 特征向量表示为 :

其中|| tƒidƒ || 表示 tƒidƒ 矩阵的欧几里得 L2 范数。

下面的代码片段是考虑已经有了前面的词袋特征向量的情况下，获得基于 tƒidƒ 的特征向量的具体实现：

from sklearn.feature_extraction.text import TfidfTransformer
 
def tfidf_transformer(bow_matrix):
     
    transformer = TfidfTransformer(norm='l2',
                                   smooth_idf=True,
                                   use_idf=True)
    tfidf_matrix = transformer.fit_transform(bow_matrix)
    return transformer, tfidf_matrix

可以看到，在参数中使用了 L2 范数选项，并且对一些单词可能存在 idƒ 为 0 的情况以增加权重的方式对 idƒ 进行平滑处理，而没有忽略它们。下面的代码片段观察这个函数的执行情况：

import numpy as np
from feature_extractors import tfidf_transformer
feature_names = bow_vectorizer.get_feature_names()

In [100]: tfidf_trans, tdidf_features = tfidf_transformer(bow_features)
 
In [101]: features = np.round(tdidf_features.todense(), 2)
 
In [102]: display_features(features, feature_names)
    and  beautiful  blue  cheese    is  love   sky    so   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.00  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00  0.00
2  0.00       0.43  0.29    0.00  0.35  0.00  0.35  0.55  0.43
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00  0.00

In [103]: nd_tfidf = tfidf_trans.transform(new_doc_features)
 
In [104]: nd_features = np.round(nd_tfidf.todense(), 2)
 
In [105]: display_features(nd_features, feature_names)
   and  beautiful  blue  cheese   is  love   sky   so  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0  0.0

上面的输出显示了全部例子文档的 tƒidƒ 特征向量。使用 TfidfTransformer 类有助有计算每个文档基于前面方程描述的 tƒidƒ 值。

现在，将看看该类内部如何工作。也会看到如何实现前面描述的数学方程以计算基于 tƒidƒ 的特征向量。将载入必要的依赖，并通过重用词袋模型特征计算样例语料的单词频率（TF）,该词频也可以作为训练语料 CORPUS 的词频。

import scipy.sparse as sp
from numpy.linalg import norm
feature_names = bow_vectorizer.get_feature_names()

In [113]: tf = bow_features.todense()
 
In [114]: tf = np.array(tf, dtype='float64')
 
In [115]: display_features(tf, feature_names)
   and  beautiful  blue  cheese   is  love  sky   so  the
0  0.0        0.0   1.0     0.0  1.0   0.0  1.0  0.0  1.0
1  1.0        1.0   1.0     0.0  2.0   0.0  2.0  0.0  0.0
2  0.0        1.0   1.0     0.0  1.0   0.0  1.0  1.0  1.0
3  0.0        0.0   1.0     1.0  0.0   1.0  0.0  0.0  0.0

将基于出现某单词的文档数量计算每个单词的文档频率（DF）。下面的代码片段显示如何从词袋模型特征矩阵获得 DF。

In [120]: df = np.diff(sp.csc_matrix(bow_features, copy=True).indptr)
 
In [121]: df = 1 + df
 
In [122]: display_features([df], feature_names)
   and  beautiful  blue  cheese  is  love  sky  so  the
0    2          3     5       2   4     2    4   2    3

上述输出向我们展示了每个单词的文档频率（DF）,可以使用 CORPUS 中的文档来验证它。假设有一个文档，其中所有单词都出现一次，请记住，已经对每个词频增加 1 以平滑 idƒ 值，避免被 0 除的错误。因此，如果验证 CORPUS，将会看到 blue 出现 4 （+1）次，sky 出现 3（+1）次，考虑到使用了（+1）来进行平滑。

现在拥有了文档频率，就可以使用前面的公式计算逆文档频率（idƒ）。记住，对语料库中文档的总数加 1，因为早先假设平滑 idƒ 包含所有单词至少 1 次。

In [123]: total_docs = 1 + len(CORPUS)
 
In [124]: idf = 1.0 + np.log(float(total_docs) / df)
 
In [125]: display_features([np.round(idf, 2)], feature_names)
    and  beautiful  blue  cheese    is  love   sky    so   the
0  1.92       1.51   1.0    1.92  1.22  1.92  1.22  1.92  1.51

In [132]: total_features = bow_features.shape[1]
 
In [133]: idf_diag = sp.spdiags(idf, diags=0, m=total_features, n=total_features)
 
In [134]: idf = idf_diag.todense()
 
In [135]: print(np.round(idf, 2))
[[1.92 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.   1.51 0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.   1.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   1.92 0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.   1.22 0.   0.   0.   0.  ]
 [0.   0.   0.   0.   0.   1.92 0.   0.   0.  ]
 [0.   0.   0.   0.   0.   0.   1.22 0.   0.  ]
 [0.   0.   0.   0.   0.   0.   0.   1.92 0.  ]
 [0.   0.   0.   0.   0.   0.   0.   0.   1.51]]

现在看到了 idƒ 矩阵，这个矩阵是基于数学公式计算得到的，把它转换为对角矩阵。当计算词频乘积时，这将非常有用。

既然有了 tƒ 和 idƒ，就可以使用急诊乘积计算 tƒidƒ 特征矩阵了，如以下代码所示：

In [136]: tfidf = tf * idf
 
In [137]: display_features(np.round(tfidf, 2), feature_names)
    and  beautiful  blue  cheese    is  love   sky    so   the
0  0.00       0.00   1.0    0.00  1.22  0.00  1.22  0.00  1.51
1  1.92       1.51   1.0    0.00  2.45  0.00  2.45  0.00  0.00
2  0.00       1.51   1.0    0.00  1.22  0.00  1.22  1.92  1.51
3  0.00       0.00   1.0    1.92  0.00  1.92  0.00  0.00  0.00

现在已经得到了 tƒidƒ 特征矩阵，但是要等一等，现在还没有结束。如果你还记得前面描述的方程，就会知道还需要把它除以 L2 范数。下面的代码片段计算每个文档的 tƒidƒ 范数，使用这个范数除以 tƒidƒ 权重得到最终想要的 tƒidƒ 矩阵：

In [138]: norms = norm(tfidf, axis=1)
 
In [139]: print(np.round(norms, 2))
[2.5  4.35 3.5  2.89]
 
In [140]: norm_tfidf = tfidf / norms[:, None]
 
In [141]: display_features(np.round(norm_tfidf, 2), feature_names)
    and  beautiful  blue  cheese    is  love   sky    so   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.00  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00  0.00
2  0.00       0.43  0.29    0.00  0.35  0.00  0.35  0.55  0.43
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00  0.00

比较上面得到的 CORPUS 矩阵中文档的 tƒidƒ 特征矩阵和前面使用 TfidfTransformer 得到的特征矩阵。注意，它们是相同的，因此验证了我们的数学实现时正确的，事实上， scikit-learn 的 TfidfTransformer 采用同样的数学实现，在后台进行了一些优化。现在，假设我们想计算新文档 new_doc 基于 tƒidƒ 的特征矩阵，可以使用下面的代码片段计算它。

在计算词频前，复用 new_doc_features 词袋向量：

In [197]: # compute new doc term freqs from bow freqs
 
In [198]: nd_tf = new_doc_features
 
In [199]: nd_tf = np.array(nd_tf, dtype='float64')
 
In [200]: # compute tfidf using idf matrix from train corpus
 
In [201]: nd_tfidf = nd_tf*idf
 
In [202]: nd_norms = norm(nd_tfidf, axis=1)
 
In [203]: norm_nd_tfidf = nd_tfidf / nd_norms[:, None]
 
In [204]: # show new_doc tfidf feature vector
 
In [205]: display_features(np.round(norm_nd_tfidf, 2), feature_names)
   and  beautiful  blue  cheese   is  love   sky   so  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0  0.0

上面的输出描述了 new_doc 基于 tƒidƒ 的特征向量，可以看到它与通过 TfidfTransformer 计算得到的相同。

掌握了 tƒidƒ 计算内部的工作原理之后，将实现一个通用的函数，可以直接从原始文档中计算文档基于 tƒidƒ 的特征向量。下面的代码描述了相同的过程。

from sklearn.feature_extraction.text import TfidfVectorizer
 
def tfidf_extractor(corpus, ngram_range=(1,1)):
    vectorizer = TfidfVectorizer(min_df=1,
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

上面的函数直接使用了 TfidfVectorizer，通过把原始文档作为输入，在内部计算词频和逆文档频率，直接计算 tƒidƒ 向量，避免了使用 CountVectorizer 计算基于词袋模型的词频。它也支持将 n 元分词加入特征向量中。可以在下面的片段中看到函数的执行情况：

In [212]: tfidf_vectorizer, tdidf_features = tfidf_extractor(CORPUS)
 
In [213]: display_features(np.round(tdidf_features.todense(), 2), feature_names)
    and  beautiful  blue  cheese    is  love   sky    so   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.00  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00  0.00
2  0.00       0.43  0.29    0.00  0.35  0.00  0.35  0.55  0.43
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00  0.00

In [217]: nd_tfidf = tfidf_vectorizer.transform(new_doc)
 
In [218]: display_features(np.round(nd_tfidf.todense(), 2), feature_names)
   and  beautiful  blue  cheese   is  love   sky   so  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0  0.0

从上面的输出可以看到 tƒidƒ 特征向量与前面得到的一致。到此要结束使用 tƒidƒ 进行特征提取的讨论了。

高级词向量模型

为从文档中提取特征，有各种各样的方法创建更高级的词向量模型。这里将讨论一些词向量模型，它们使用了主流的谷歌 word2vec 算法。word2vec 模型由谷歌公司 2013 年发布，它是一个基于神经网络的实现，使用 CBOW (Continuous Bag of Words) 和 skip-gram 两种结构学习单词的分布式向量表示。word2vec 模型对于其他神经网络实现运行速度更快，而且不需要手工标签来创建单词的意义标识。

在代码实现中，将使用 gensim 库，该库是 word2vec 的 Python 实现，提供了几个高级的接口使得建模非常容易。基本思想是提供一些文档语料作为输入，会得到词向量特征作为输出。在模型内部，简历基于输入文档的词汇表，通过前面提到的各种技术学习单词的向量表示，一旦学习完成，就建立了一个可用于从文档中提取单词的向量的模型。使用如平均值和 tƒidƒ 加权等方法，可以使用词向量计算文档的平均向量。可以在 http://radimrehurek.com/gensim/models/word2vec.html 上获得有关 gensim 库接口更详细的信息。

在训练语料简历模型时，将主要关注下面的参数。

size：该参数用于设定词向量的纬度，可以是几十到几千。可以尝试不同的纬度，已获得最好的效果。
window：该参数用于设定语境或窗口尺寸，指定了训练时对算法来说可算做上下文的单词窗口长度。
min_count：该参数指定单词表中单词在预料中出现的最小次数。这个参数有助于去除一些文档中出现次数较少的不重要的单词。
sample：该参数用于对单词出现的频率进行下采样，其理想值在 0.01 到 0.0001 之间。

建立模型之后，将基于一些加权策略来定义和实现两种词向量与文档结合的技术。接下来将实现下面两个技术：

平均词向量。
TF-IDF 加权词向量。

再进一步实现之前，使用训练语料简历 word2vec 模型，开始特征提取的过程。下面代码显示了如何实现：

import gensim
import nltk
 
TOKENIZED_CORPUS = [nltk.word_tokenize(sentence)
                    for sentence in CORPUS]
tokenized_new_doc = [nltk.word_tokenize(sentence)
                    for sentence in new_doc]                       
 
model = gensim.models.Word2Vec(TOKENIZED_CORPUS,
                               size=10,
                               window=10,
                               min_count=2,
                               sample=1e-3)

如你所见，使用前面描述的参数建立了模型。可以尝试调整这些参数，也可以查看文档其他参数，以改变结构类型、worker 数量等，训练不同的模型。现在已经有了自己的模型，可以开始实现特征提取技术：

平均词向量

上面的模型为单词表中每个单词创建一个向量表示，可以输入下面的代码来查看它们。

In [5]: print(model['sky'])
/usr/local/bin/ipython:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  #!/usr/local/bin/python
[ 0.0146156  -0.03722425  0.02307606  0.03794555  0.04357427  0.02248405
 -0.04318777 -0.0192292  -0.00593164  0.01981338]
 
In [6]: print(model['blue'])
/usr/local/bin/ipython:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  #!/usr/local/bin/python
[-0.04851234 -0.03277206  0.02654879  0.04666627 -0.0460027   0.01473446
 -0.01312309  0.02642985  0.0062849  -0.04868636]

基于前面指定的参数尺寸，每个单词向量的长度为10。但是当我们处理语句和文档时，必须执行一些合并和聚合操作，以确保无论文本长度、单词数量等情况如何，最后特征的维度是相同的。这个技术中，使用平均加权词向量，对于每个文档将提取文档中所有单词，获得单词表中每个单词的词向量。将全部词向量加在一起，除以单词表中匹配单词的总数，最后得到文档的平均词向量表示结果。上述描述可以使用以下数学公式表示：

其中 AVW(D) 表示文档 D 的平均词向量，文档 D 中包括单词 ω₁，ω₂，..., ω_n, ωv(ω) 表示的是单词 ω 的词向量。

该算法的为代码描述如下：

model := the word2vec model we built
vocabulary := unique_words(model)
document := [words]
matched_word_count := 0
vector := []
 
 
for word in words:
  if word in vocabulary:
    vector := vector + model[word]
    matched_word_count := matched_word_count +1
averaged_word_vector := vector / matched_word_count

这个伪代码以一种较好的、易于累计额的方式显示了操作流程。现在，要使用下面的代码在 Python 中实现这个算法。

import numpy as np   
     
def average_word_vectors(words, model, vocabulary, num_features):
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    for word in words:
        if word in vocabulary:
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
     
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
    return feature_vector
     
    
def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

你一定对 average_word_vectors() 函数非常熟悉，它是前面使用伪代码表示的算法的完整试下。也创建了一个通用的函数 averaged_word_vectorizer() 来实现语料库中多个文档平均词向量的计算。下面代码展示了我们这个函数在示例语料库上的执行情况。

from feature_extractors import averaged_word_vectorizer
 
avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
                                                 model=model,
                                                 num_features=10)

In [33]: print(np.round(avg_word_vec_features, 3))
[[ 0.005 -0.034  0.011  0.024  0.012  0.009 -0.034  0.007 -0.016 -0.012]
 [ 0.009 -0.033  0.023  0.011  0.012  0.02  -0.028  0.002 -0.021 -0.015]
 [ 0.012 -0.034  0.017  0.022  0.004  0.001 -0.029  0.008 -0.022 -0.011]
 [-0.049 -0.033  0.027  0.047 -0.046  0.015 -0.013  0.026  0.006 -0.049]]

警告

如果出现如下错误：

...
AttributeError: 'Word2Vec' object has no attribute 'index2word'

则通过下述方法解决，目前暂不知原因：

# 导出
model_name = "300features_40minwords_10context"
model.save(model_name)

# 导入
model = Word2Vec.load("300features_40minwords_10context")

修改代码：

def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

继续执行上述代码。

nd_avg_word_vec_features = averaged_word_vectorizer(corpus=tokenized_new_doc,
                                                     model=model,
                                                     num_features=10)

In [38]: print(nd_avg_word_vec_features)
[[-0.01694837 -0.03499815  0.02481242  0.04230591 -0.00121422  0.01860925
  -0.02815543  0.00360032  0.00017663 -0.01443649]]

TF-IDF 加权平均词向量

前面的向量生成器基于模型单词表中的单词，简单地对任何文档中相关的词向量进行求和，通过除以匹配的单词的数量计算一个简单的平均值。下面使用单词的 TF-IDF 评分对每个匹配的词向量进行加权，对它们进行求和并处以文档中匹配的单词数量。将得到每个文档的一个 TF-IDF 加权平均词向量。

上述描述可以使用下面的数学公式表示：

其中 TWA(D) 表示 TF-IDF 文档 D 加权平均词向量，文档 D 中包括单词 w₁, w₂,..., w_n, wv(w) 表示的是单词 w 的词向量 tƒidƒ(w) 是单词 w 的 TF-IDF 权重。下面的代码段在 Python 中实现了这个算法，因此可以使用它提取特征。

def tfidf_wtd_avg_word_vectors(words, tfidf_vector, tfidf_vocabulary, model, num_features):
    word_tfidfs = [tfidf_vector[0, tfidf_vocabulary.get(word)]
                   if tfidf_vocabulary.get(word)
                   else 0 for word in words]   
    word_tfidf_map = {word:tfidf_val for word, tfidf_val in zip(words, word_tfidfs)}
    feature_vector = np.zeros((num_features,),dtype="float64")
    vocabulary = set(model.index2word)
    wts = 0.
    for word in words:
        if word in vocabulary:
            word_vector = model[word]
            weighted_word_vector = word_tfidf_map[word] * word_vector
            wts = wts + word_tfidf_map[word]
            feature_vector = np.add(feature_vector, weighted_word_vector)
    if wts:
        feature_vector = np.divide(feature_vector, wts)
    return feature_vector
     
def tfidf_weighted_averaged_word_vectorizer(corpus, tfidf_vectors, vocab
                                   tfidf_vocabulary, model, num_features):    
    docs_tfidfs = [(doc, doc_tfidf)
                   for doc, doc_tfidf
                   in zip(corpus, tfidf_vectors)]
    features = [tfidf_wtd_avg_word_vectors(tokenized_sentence, tfidf, tfidf_vocabulary,
                                   model, num_features)
                    for tokenized_sentence, tfidf in docs_tfidfs]
    return np.array(features)

tfidf_wtd_avg_word_vectors() 函数帮助我们获得每个文档的 TF-IDF 加权平均词向量。也创建一个函数 tfidf_weighted_averaged_word_vectorizer() 实现语料库中多个文档 TF-IDF 加权平均词向量的计算。使用下面代码看看实现的这个函数在示例语料库上的执行情况：

vocab = tfidf_vectorizer.vocabulary_
wt_tfidf_word_vec_features = tfidf_weighted_averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
    tfidf_vectors=corpus_tfidf,
    tfidf_vocabulary=vocab,
    model=model,
    num_features=10)

In [65]: print(np.round(wt_tfidf_word_vec_features, 3))
[[ 0.009 -0.034  0.009  0.024  0.015  0.006 -0.035  0.007 -0.018 -0.009]
 [ 0.014 -0.033  0.021  0.007  0.022  0.025 -0.031 -0.002 -0.022 -0.012]
 [ 0.016 -0.034  0.016  0.022  0.005 -0.003 -0.029  0.008 -0.024 -0.009]
 [-0.049 -0.033  0.027  0.047 -0.046  0.015 -0.013  0.026  0.006 -0.049]]

nd_wt_tfidf_word_vec_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_new_doc,
    tfidf_vectors=nd_tfidf,
    tfidf_vocabulary=vocab,
    model=model,
    num_features=10)nd_tfidf

In [69]: print(np.round(nd_wt_tfidf_word_vec_features, 3))
[[-0.014 -0.035  0.025  0.042  0.003  0.019 -0.03   0.001 -0.    -0.011]]

从上面的结果，看到了我们如何将每个文档转换为 TF-IDF 加权平衡词向量。在实现基于 TF-IDF 的文档特征提取时，也使用到了之前获得的 TF-IDF 权重和单词表。

posted @ 2019-08-14 18:37 翡翠嫩白菜阅读(1053) 评论(0) 收藏举报

刷新页面返回顶部

5.特征提取

5.特征提取

词袋模型

IT-IDF 模型

高级词向量模型

平均词向量

TF-IDF 加权平均词向量

公告