TF-IDF笔记(直接调用函数、手写)
首先TF-IDF 全称:term frequency–inverse document frequency,是一种用于信息检索与数据挖掘的常用加权技术。
TF是词频(Term Frequency),IDF是逆文本频率指数(Inverse Document Frequency)。
上面是百度的结果
我的理解就是用来筛选特征的,看看那些词用来当特征比较好。
词频(TF):就是一个词在一个文本里出现的次数除以文本词数。(文本内词出现次数 /文本内词总数)
逆文本频率指数(IDF):就是总文本数除以包含这个词的文本数的10的对数,有点饶哈哈。lg(总文本数/包含这个词的文本数)
TF-IDF = TF*IDF
先看下调用的:
# CountVectorizer会将文本中的词语转换为词频矩阵
vectorizer = CountVectorizer(max_features=1200, min_df=12)
# TfidfTransformer用于统计vectorizer中每个词语的TF-IDF值
tf_idf_transformer = TfidfTransformer()
# vectorizer.fit_transform()计算每个词出现的次数
# tf_idf_transformer.fit_transform将词频矩阵统计成TF-IDF值
tf_idf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(train_features['features'].values.astype('U'))) # .values.astype('U')
x_train_weight = tf_idf.toarray() # 训练集TF-IDF权重矩阵
然后是我手写的:
参数格式是,[词1 词2 词3,词1 词2 词3,词1 词2 词3]
一个字符串列表,词与词间用空格隔开。
print("-"*5+"构建tf-idf权重矩阵中"+"-"*5)
def get_tf_idf(list_words):
# 构建词典
wordSet = list(set(" ".join(list_words).split()))
# 统计词数
def count_(words):
wordDict = dict.fromkeys(wordSet, 0)
for i in words:
wordDict[i] += 1
return wordDict
# 计算tf
def computeTF(words):
cnt_dic = count_(words)
tfDict = {}
nbowCount = len(words)
for word, count in cnt_dic.items():
tfDict[word] = count / nbowCount
return tfDict
# 计算idf
def get_idf():
filecont = dict.fromkeys(wordSet, 0)
for i in wordSet:
for j in list_words:
if i in j.split():
filecont[i] += 1
idfDict = dict.fromkeys(wordSet, 0)
le = len(list_words)
for word, cont in filecont.items():
idfDict[word] = math.log10(le/cont+1)
return idfDict
# 计算每个词的TF*IDF的值
def get_tf_idf(list_words):
idf_dic = get_idf()
ret = []
for words in list_words:
tf_dic = computeTF(words.split())
tf_idf_dic = {}
temp = []
for word, tf in tf_dic.items():
idf = idf_dic[word]
tf_idf = tf * math.log(len(list_words) / (idf+1))
tf_idf_dic[word] = tf_idf
for word in wordSet:
temp.append(tf_idf_dic.get(word, 0))
ret.append(temp)
return ret
return np.array(get_tf_idf(list_words))
tf-idf矩阵:
word_tf_idf = get_tf_idf(features)
慢的飞起,哈哈哈哈。

浙公网安备 33010602011771号