第一次个人编程作业

一、Github

二、思路

作业要求是对比两个文本的相似度，然后我在百度上搜索了挺久的，也看了一下大神们的代码，发现余弦相似度来写相对还是比较简单的。

余弦函数在三角形中的计算公式为：

在直角坐标系中，向量表示的三角形的余弦函数是怎么样的呢？下图中向量a用坐标(x1,y1)表示，向量b用坐标(x2,y2)表示。向量a和向量b在直角坐标中的长度为向量a和向量b之间的距离我们用向量c表示，就是上图中的黄色直线，那么向量c在直角坐标系中的长度为，将a，b，c带入三角函数的公式中得到如下的公式：

这是2维空间中余弦函数的公式，那么多维空间余弦函数的公式就是：

余弦相似度算法：一个向量空间中两个向量夹角间的余弦值作为衡量两个个体之间差异的大小，余弦值接近1，夹角趋于0，表明两个向量越相似，余弦值接近于0，夹角趋于90度，表明两个向量越不相似。

余弦相似度
余弦相似度量：计算个体间的相似度。
相似度越小，距离越大。相似度越大，距离越小。

余弦相似度算法：一个向量空间中两个向量夹角间的余弦值作为衡量两个个体之间差异的大小，余弦值接近1，夹角趋于0，表明两个向量越相似，余弦值接近于0，夹角趋于90度，表明两个向量越不相似。
那么我们就可以用这个思路来计算两个文本的相似度：
如：
1.今天是星期天，天气晴，今天晚上我要去看电影。
2.今天是周天，天气晴朗，我晚上要去看电影。
1.分词：
使用结巴分词将文本分词
list1=['今天','是','星期天','天气','晴','今天','晚上','我','要','去','看电影']
list2=[,今天','是','周天','天气','晴朗','我',晚上',',要','去','看电影']
2、列出所有词，将listA和listB放在一个set中，得到：
set={'看电影','星期天','周天','今天', '天气','晴', '晴朗', '晚上', '我', '要', '去', '是'}
将上述set转换为dict，key为set中的词，value为set中词出现的位置，即‘今天’:1这样的形式
dict1={'看电影'：0,'今天'：1,'星期天'：2,'周天'：3, '天气'：4,'晴'：5, '晴朗'：6, '晚上'：7, '我'：8, '要'：9, '去'：10, '是'：11}
list1code=[1,11,2,4,5,1,7,8,9,10,0]
list2code=[1,11,3,4,6,8,7,9,10,0]
我们来分析list1code，结合dict1，可以看到8对应的字是“我”，4对应的字是“天气”，9对应的字是“要”，就是句子1和句子2转换为用数字来表示。
4、对listAcode和listBcode进行oneHot编码，就是计算每个分词出现的次数。oneHot编号后得到的结果如下：
list1codeOneHot = [1, 2, 1, 0, 1, 1, 0, 1, 1, 1, 1]
list2codeOneHot = [1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1]
由此得出两个句子的词频向量之后，就变成了求两个向量夹角的余弦值，值越大相似度约高。
cos(a)=0.70
即有：
大致流程就是：文件读入——>分词列出所有词——>计算词频并计算稀疏向量——>计算余弦相似度得到结果——>将结果写入新文件

代码

  #1将文本字符串化并去掉符号
  def string(file):
      with open(file, encoding='utf-8') as File:
          # 读取
          lines = File.readlines()
          line = ''.join(lines)
          # 去特殊符号
          character_string = re.sub(r"[%s]+" % ',$%^*(+)]+|[+——()?【】“”！，。？、~@#￥%……&*（）：]+', "", line)
  return character_string

  #2计算词频并计算稀疏向量
  def get_word(original, content):
      dictionary = {}
      return_dic = {}
      key_word = jieba.cut_for_search(content)
      for x in key_word:
          if x in dictionary:
              dictionary[x] += 1
      else:
              dictionary[x] = 1
      topK = 30
   tfidf = jieba.analyse.extract_tags(content, topK=topK, withWeight=True)
   stop_keyword = [line.strip() for line in original]
       for word_weight in tfidf:
              if word_weight in stop_keyword:
              continue
       word_frequency = dictionary.get(word_weight[0], 'not found')
       return_dic[word_weight[0]] = word_frequency
   return return_dic
        
   #3计算余弦相似度得到结果
   def similar(all_keys, original_document_dic, original_document_test_dic):
       str1_vector = []
       str2_vector = []
       # 计算词频向量
       for i in all_keys:
       str1_count = original_document_dic.get(i, 0)
       str1_vector.append(str1_count)
       str2_count = original_document_test_dic.get(i, 0)
       str2_vector.append(str2_count)

      # 计算各自平方和
   str1_map = map(lambda x: x * x, str1_vector)
   str2_map = map(lambda x: x * x, str2_vector)

   str1_mod = reduce(lambda x, y: x + y, str1_map)
   str2_mod = reduce(lambda x, y: x + y, str2_map)
   # 计算平方根
   str1_mod = math.sqrt(str1_mod)
   str2_mod = math.sqrt(str2_mod)
   # 计算向量积
   vector_multi = reduce(lambda x, y: x + y, map(lambda x, y: x * y, str1_vector, str2_vector))
   # 计算余弦值
   cosine = float(vector_multi) / (str1_mod * str2_mod)
   return cosine

三、计算模块接口部分的性能改进

覆盖率：

菜鸟教程(runoob.com)

四、单元测试结果

单元测试代码：

import re import jieba import jieba.analyse import math from functools import reduce def string(file): with open(file, encoding='utf-8') as File: # 读取 lines = File.readlines() line = ''.join(lines) # 去特殊符号 character_string = re.sub(r"[%s]+" % ',$%^*(+)]+|[+——()?【】“”！，。？、~@#￥%……&*（）：]+', "", line) return character_string def get_word(original, content): dictionary = {} return_dic = {} # 分词 key_word = jieba.cut_for_search(content) for x in key_word: if x in dictionary: dictionary[x] += 1 else: dictionary[x] = 1 topK = 30 # 关键词比率 tfidf = jieba.analyse.extract_tags(content, topK=topK, withWeight=True) stop_keyword = [line.strip() for line in original] for word_weight in tfidf: if word_weight in stop_keyword: continue word_frequency = dictionary.get(word_weight[0], 'not found') return_dic[word_weight[0]] = word_frequency return return_dic def similar(all_keys, original_document_dic, original_document_test_dic): str1_vector = [] str2_vector = [] # 计算词频向量 for i in all_keys: str1_count = original_document_dic.get(i, 0) str1_vector.append(str1_count) str2_count = original_document_test_dic.get(i, 0) str2_vector.append(str2_count) # 计算各自平方和 str1_map = map(lambda x: x * x, str1_vector) str2_map = map(lambda x: x * x, str2_vector) str1_mod = reduce(lambda x, y: x + y, str1_map) str2_mod = reduce(lambda x, y: x + y, str2_map) # 计算平方根 str1_mod = math.sqrt(str1_mod) str2_mod = math.sqrt(str2_mod) # 计算向量积 vector_multi = reduce(lambda x, y: x + y, map(lambda x, y: x * y, str1_vector, str2_vector)) # 计算余弦值 cosine = float(vector_multi) / (str1_mod * str2_mod) return cosine def test(doc_name): test_file = "C:/Users/Administrator/sim_0.8/"+doc_name original_document_test = test_file all_key = set() original_document = "C:/Users/Administrator/sim_0.8/orig.txt" str_Original_document = string(original_document) str_Original_document_test = string(original_document_test) original_document_dic1 = get_word(original_document, str_Original_document) for k, v in original_document_dic1.items(): all_key.add(k) original_document_dic2 = get_word(original_document, str_Original_document_test) for k, v in original_document_dic2.items(): all_key.add(k) cos = similar(all_key, original_document_dic1, original_document_dic2) print("%s 的相似度 = %.2f" % (doc_name, cos)) test("orig_0.8_add.txt") test("orig_0.8_del.txt") test("orig_0.8_dis_1.txt") test("orig_0.8_dis_3.txt") test("orig_0.8_dis_7.txt") test("orig_0.8_dis_10.txt") test("orig_0.8_dis_15.txt") test("orig_0.8_mix.txt") test("orig_0.8_rep.txt")

输出结果：

orig_0.8_add.txt 的相似度 = 0.95 orig_0.8_del.txt 的相似度 = 0.90 orig_0.8_dis_1.txt 的相似度 = 0.97 orig_0.8_dis_3.txt 的相似度 = 0.91 orig_0.8_dis_7.txt 的相似度 = 0.85 orig_0.8_dis_10.txt 的相似度 = 0.84 orig_0.8_dis_15.txt 的相似度 = 0.83 orig_0.8_mix.txt 的相似度 = 0.91 orig_0.8_rep.txt 的相似度 = 0.91

异常处理

except Exception as e: print(e)

PSP表格
<html> <head> <meta charset="utf-8"> <title>菜鸟教程(runoob.com)</title> <style> table,th,td { border:1px solid black; } </style> </head> <body> <table> <tr> <th>PSP2.1</th> <th>Personal Software Process Stages</th> <th>预估耗时（分钟）</th> <th>实际耗时（分钟）</th> </tr> <tr> <td>Plannning</td> <td>计划</td> <td>60</td> <td>70</td> </tr> <tr> <td>Estimate</td> <td>估计这个任务需要多少时间</td> <td>60</td> <td>60</td> </tr> <tr> <td>Development</td> <td>开发</td> <td>50</td> <td>60</td> </tr> <tr> <td>Analysis</td> <td>需求分析 (包括学习新技术)</td> <td>400</td> <td>450</td> </tr> <tr> <td>Design Spec</td> <td>生成设计文档</td> <td>50</td> <td>50</td> </tr> <tr> <td>Design Review</td> <td>设计复审</td> <td>30</td> <td>45</td> </tr> <tr> <td>Coding Standard</td> <td>代码规范 (为目前的开发制定合适的规范)</td> <td>60</td> <td>60</td> </tr> <tr> <td>Design</td> <td>具体设计</td> <td>60</td> <td>60</td> </tr> <tr> <td>Coding</td> <td>具体编码</td> <td>240</td> <td>300</td> </tr> <tr> <td>Code Review</td> <td>代码复审</td> <td>120</td> <td>120</td> </tr> <tr> <td>Test</td> <td>测试（自我测试，修改代码，提交修改）</td> <td>120</td> <td>120</td> </tr> <tr> <td>Reporting</td> <td>报告</td> <td>80</td> <td>80</td> </tr> <tr> <td>Test Repor</td> <td>测试报告</td> <td>50</td> <td>50</td> </tr> <tr> <td>Size Measurement</td> <td>计算工作量</td> <td>20</td> <td>30</td> </tr> <tr> <td>Postmortem & Process Improvement Plan</td> <td>事后总结, 并提出过程改进计划</td> <td>80</td> <td>100</td> </tr> <tr> <td></td> <td>合计</td> <td>1330</td> <td>1475</td> </tr> </table> </body> </html>
总结

看到这个题目就觉得很难很难，然后试着1去网上搜了一下，发现可以用高中所学的知识去完成这个任务，就觉得哎，我高中是不是白学了，忘记好多知识，还有就是完成代码，并且能运行代码只是第一步，我们还得去构思如何简化，如何优化这个代码。
嗨，加油吧，对以后的项目更有兴趣了，希望下个作业能做更好！

posted @ 2020-09-17 21:35 我是WiFi靠近我！阅读(164) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

我是WiFi靠近我！