第一次个人编程作业

Github连接: Ender39831/3123004694: homework

](https://github.com/Ender39831/3123004694)

这个作业属于哪个课程	https://edu.cnblogs.com/campus/gdgy/Class12Grade23ComputerScience
这个作业要求在哪里	https://edu.cnblogs.com/campus/gdgy/Class12Grade23ComputerScience/homework/13468
这个作业的目标	体会整个软件的开发过程，同时记录开发过程中的信息

PSP表格：

*PSP2.1*	*Personal Software Process Stages*	*预估耗时（分钟）*	*实际耗时（分钟）*
Planning	计划	10	10
· Estimate	· 估计这个任务需要多少时间	10	10
Development	开发	340	460
· Analysis	· 需求分析 (包括学习新技术)	20	30
· Design Spec	· 生成设计文档	30	35
· Design Review	· 设计复审	20	15
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	10	10
· Design	· 具体设计	30	45
· Coding	· 具体编码	150	210
· Code Review	· 代码复审	10	15
· Test	· 测试（自我测试，修改代码，提交修改）	60	100
Reporting	报告	60	45
· Test Repor	· 测试报告	20	15
· Size Measurement	· 计算工作量	20	20
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	20	10
	· 合计	410	515

一. 模块与对应函数

模块	对应函数	作用
核心	main()	控制整个流程
读取	string readFile(const string& filePath)	读取指定文件中的信息
预处理	string preprocessText(const string& text)	去除标点符号并正确处理汉字、字母、数字
相似度计算	double calculateCosineSimilarity(const map<string, int>& origFreq, const map<string, int>& copyFreq)	根据前面计算好的中间变量计算最终的相似度
文件写入	void writeResult(const string& filePath, double similarityPercent)	将结果写入指定文件

二.计算模块接口的设计

1.接口设计：

把每个功能都拆成了独立的函数，他们之间通过参数来链接，具体程序执行的函数顺序为：

预处理：string preprocessText(const string& text)
分词：vector splitText(const string& text)
计算词频：map<string, int> calculateWordFrequency(const vector& words)
计算相似度：double calculateCosineSimilarity(const map<string, int>& origFreq, const map<string, int>& copyFreq)

2.独到之处：

采用余弦相似度算法

核心是 将文本转化为词频向量，通过向量夹角衡量相似性，具体步骤为：

构建统一的词集合（确保向量维度一致）
精确计算点积和模长（反映向量的重合度和自身特征）

在命令行中输出计算的中间结果，方便调试（在面对大文本时，由于输出文本上限，默认将该功能注释掉）

本项目在执行过程中会输出中间变量以便调试，发现问题

3.计算模块源代码：

分词模块：

// 分词
vector<string> splitText(const string& text) {
vector<string> tokens;
string currentWord;

for (size_t i = 0; i < text.size();) {
// 判断是否为汉字（UTF-8编码的汉字占3个字节）
if ((unsigned char)text[i] >= 0xE0 && i + 2 < text.size()) {
string chineseChar = text.substr(i, 3);
tokens.push_back(chineseChar);
i += 3;
}
// 处理空格（作为英文单词的分隔符）
else if (text[i] == ' ') {
if (!currentWord.empty()) {
tokens.push_back(currentWord);
currentWord.clear();
}
i++;
}
// 处理英文字母和数字
else {
currentWord += text[i];
i++;
}
}

// 添加最后一个单词
if (!currentWord.empty()) {
tokens.push_back(currentWord);
}

return tokens;

}

词频计算模块：

// 计算词频（优化后使用unordered_map替代map）： unordered_map<string, int> calculateWordFrequency(const vector<string>& words) { unordered_map<string, int> freq; for (const string& word : words) { freq[word]++; } return freq; }

总计算模块：
// 总计算
double calculateCosineSimilarity(const unordered_map<string, int>& origFreq,
const unordered_map<string, int>& copyFreq) {
// 用哈希集合收集所有词
unordered_set<string> allWords;
for (const auto& pair : origFreq) allWords.insert(pair.first);
for (const auto& pair : copyFreq) allWords.insert(pair.first);

// 计算分子：点积
double dotProduct = 0.0;
// 计算两个向量的模长
double origMagnitude = 0.0, copyMagnitude = 0.0;

for (const string& word : allWords) {
    int origCount = 0, copyCount = 0;

    auto origIt = origFreq.find(word);
    if (origIt != origFreq.end()) origCount = origIt->second;

    auto copyIt = copyFreq.find(word);
    if (copyIt != copyFreq.end()) copyCount = copyIt->second;

    dotProduct += origCount * copyCount;
    origMagnitude += origCount * origCount;
    copyMagnitude += copyCount * copyCount;
}

// 防止除零错误
if (origMagnitude == 0 || copyMagnitude == 0) {
    return 0.0;
}

// 计算余弦相似度
return dotProduct / (sqrt(origMagnitude) * sqrt(copyMagnitude));

}

三.计算模块接口部分的性能改进

改进思路：

原程序中采用map实现，而map是基于红黑树实现的，插入和查找操作的时间复杂度为 O (log n)，而unordered_map是基于哈希表实现，插入和查找操作的平均时间复杂度为 O (1)，在处理大文本时，哈希表的效率优势会很明显；当然，使用哈希表必然会造成内存上的增加，经典的内存换性能。

改进前耗时	改进后耗时	提升幅度
58ms	46ms	26.09%

四.计算模块测试展示

完全相同的文本
- 原文：今天天气暴雨，不适合外出。
- 抄袭版：今天天气暴雨，不适合外出。
- 预期相似度：100%
- 程序计算相似度：100%
部分修改的文本
- 原文：图形计算需要大量计算资源。
- 抄袭版：图形渲染需要大量 GPU 资源。
- 预期相似度：约 70%
- 程序计算相似度：63.01%
完全不同的文本
- 原文：读书学习看文章是必要的。
- 抄袭版：游戏真好玩。
- 预期相似度：0%
- 程序计算相似度：11.79%
给的测试文本
- 原文件名：orig
- 抄袭版文件名：orig_0.8_del
- 程序计算相似度：99.36%

五.异常处理

1.输入参数异常

应用场景：当用户的输入参数不符合要求时

处理逻辑代码：

if (argc != 4) { cerr << "参数错误：请提供3个文件路径参数" << endl; cerr << "正确用法：原文文件路径抄袭版论文路径输出结果路径" << endl; return 1; }

2.无法打开文件

应用场景：当用户由于输入错路径或其他原因导致无法打开文件时