第一次个人编程作业

这个作业属于哪个课程	https://edu.cnblogs.com/campus/gdgy/CSGrade21-12
这个作业要求在哪里	https://edu.cnblogs.com/campus/gdgy/CSGrade21-12/homework/13014
这个作业的目标	个人项目

GitHub链接

PSP2.1	Persional Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	140	60
.Estimate	.估计这个任务需要多少时间	120	120
Development	开发	240	200
.Analysis	.需求分析	160	200
Design Spec	生成设计文档	30	30
Design Review	设计复审	15	10
Coding Standard	代码规范（为目前的开发制定合适的规范）	20	10
Design	具体设计	10	10
Coding	具体编码	180	200
Code Review	代码复审	20	10
Test	测试（自我测试，修改代码，提交修改）	30	10
Reporting	报告	30	30
Test Repor	测试报告	40	40
Size Measurement	计算工作量	5	5
Postmortem & Process Improvement Plan	事后总结，并提出过程改进计划	5	5
	合计	975	880

代码实现

使用的数据库

import jieba
import gensim
import re
import difflib
import os

获取文件并获取其中的内容

def get_file_contents(path):
    if not os.path.exists(path):
        print("File path does not exist. Please check!")
        return None
    str = ''
    f = open(path, 'r', encoding='UTF-8')
    line = f.readline()
    while line:
        str = str + line
        line = f.readline()
    f.close()
    return str

if __name__ == '__main__':
    path1 = "D:\Cprogram\papercheak\test_text\orig.txt"  # 原文
    path2 = "D:\Cprogram\papercheak\test_text/orig_0.8_add.txt"  # 抄袭版论文
    main(path1, path2)

使用jieba库获取指定路径的文件内容

def filter(str):
    # 将读取到的文件内容先进行jieba分词
    str = jieba.lcut(str)
    result = []
    # 使用正则运算过滤特殊字符
    for i in str:
        if (re.match(u"[a-zA-Z0-9\u4e00-\u9fa5]", i)):
            result.append(i)
        else:
            pass
    return result

传入过滤之后的数据，调用gensim库和difflib库分别计算相似度

def gen_sim(text1, text2):
    texts = [text1, text2]
    dictionary = gensim.corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    similarity = gensim.similarities.Similarity('-Similarity-index', corpus, num_features=len(dictionary))
    test_corpus_1 = dictionary.doc2bow(text1)
    sim = similarity[test_corpus_1][1]
    return sim

def diff_sim(text1,text2):
    sim = difflib.SequenceMatcher(None, text1, text2).ratio()
    return sim

main函数

def main(path1, path2):
    str1 = get_file_contents(path1)
    str2 = get_file_contents(path2)
    text1 = filter(str1)
    text2 = filter(str2)
    gensim = gen_sim(text1, text2)
    diffsim=diff_sim(text1,text2)
    print("使用difflib库文章相似度:  %.4f" % diffsim)
    print("使用gensim库文章相似度： %.4f" % gensim)

结果显示

可见本程序运行时间仅为0.422秒，且可以看出，gensim库得出结果比difflib要大。

代码覆盖率：

此结果由pycharm插件coverage得到，可见，代码覆盖率达到了96%，经分析发现，未执行的代码仅在文件路径不存在时执行。此外，代码的性能是优秀的。

posted @ 2023-09-16 14:49 黄皓坤阅读(21) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

jiaoqud86

第一次个人编程作业

代码实现

结果显示

公告