# 词义消除歧义NLP项目实验

### 词义消除歧义NLP项目实验

• Lesk 算法

• Original Lesk (Lesk, 1986)
• Adapted/Extended Lesk (Banerjee and Pederson, 2002/2003)
• Simple Lesk (with definition, example(s) and hyper+hyponyms)
• Cosine Lesk (use cosines to calculate overlaps instead of using raw counts)

• Path similarity (Wu-Palmer, 1994; Leacock and Chodorow, 1998)
• Information Content (Resnik, 1995; Jiang and Corath, 1997; Lin, 1998)
• 基线

• Random sense
• First NLTK sense
• Highest lemma counts

pip install -U nltk
pip install -U pywsd

from pywsd.lesk import simple_lesk   #引入pywsd库
sent = 'I went to the bank to deposit my money'  #设定包含具有多义的词的句子
ambiguous = 'bank'              #设定多义的词语


LESK算法是以一种以TF-IDF为权重的频数判别算法，主要流程可以简述为：

• 去掉停用词
• 统计出该词以外的TF-IDF值
• 累加起来，比较多个义项下这个值的大小，值越大说明是该句子的义项

import os
import jieba
from math import log2
​
# 读取每个义项的语料
with open(path, 'r', encoding='utf-8') as f:
lines = [_.strip() for _ in f.readlines()]
return lines
​
# 对示例句子分词
sent = '赛季初的时候，火箭是众望所归的西部决赛球队。'
wsd_word = '火箭'
​
sent_words = list(jieba.cut(sent, cut_all=False))
​
# 去掉停用词
stopwords = [wsd_word, '我', '你', '它', '他', '她', '了', '是', '的', '啊', '谁', '什么','都',\
'很', '个', '之', '人', '在', '上', '下', '左', '右', '。', '，', '！', '？']
​
sent_cut = []
for word in sent_words:
if word not in stopwords:
sent_cut.append(word)
​
print(sent_cut)
​
​
# 计算其他词的TF-IDF以及频数
wsd_dict = {}
for file in os.listdir('.'):
if wsd_word in file:
​
# 统计每个词语在语料中出现的次数
tf_dict = {}
for meaning, sents in wsd_dict.items():
tf_dict[meaning] = []
for word in sent_cut:
word_count = 0
for sent in sents:
example = list(jieba.cut(sent, cut_all=False))
word_count += example.count(word)
​
if word_count:
tf_dict[meaning].append((word, word_count))
​
idf_dict = {}
for word in sent_cut:
document_count = 0
for meaning, sents in wsd_dict.items():
for sent in sents:
if word in sent:
document_count += 1
​
idf_dict[word] = document_count
​
# 输出值
total_document = 0
for meaning, sents in wsd_dict.items():
total_document += len(sents)
​
# 计算tf_idf值
mean_tf_idf = []
for k, v in tf_dict.items():
print(k+':')
tf_idf_sum = 0
for item in v:
word = item[0]
tf = item[1]
tf_idf = item[1]*log2(total_document/(1+idf_dict[word]))
tf_idf_sum += tf_idf
print('%s, 频数为: %s, TF-IDF值为: %s'% (word, tf, tf_idf))
​
mean_tf_idf.append((k, tf_idf_sum))
​
sort_array = sorted(mean_tf_idf, key=lambda x:x[1], reverse=True)
true_meaning = sort_array[0][0].split('_')[1]
print('\n经过词义消岐，%s在该句子中的意思为 %s .' % (wsd_word, true_meaning))

['赛季', '初', '时候', '众望所归', '西部', '决赛', '球队']

​

经过词义消岐，火箭在该句子中的意思为 NBA球队名 .

['三十多年', '前', '战士', '们', '戈壁滩', '白手起家', '建起', '我国', '发射', '基地']

​

posted @ 2019-09-25 09:53  WelkinChan  阅读(1741)  评论(0编辑  收藏  举报