第一次个人编程作业

这个作业属于哪个课程	课程链接
这个作业要求在哪里	作业要求
这个作业的目标	完成一个简易个人项目，学习使用代码分析工具和测试工具，养成良好开发习惯

作业GitHub连接：作业在这里

PSP表格（各模块开发预计消耗时间）

PSP2.1	Personal Software Process Stages	预估耗时（分钟）
Planning	计划	25
·Estimate	·估计这个任务需要多少时间	657
Development	开发	537
· Analysis	· 需求分析 (包括学习新技术)	15
· Design Spec	· 生成设计文档	20
· Design Review	· 设计复审	12
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	10
· Design	· 具体设计	60
· Coding	· 具体编码	270
· Code Review	· 代码复审	30
· Test	· 测试（自我测试，修改代码，提交修改）	120
Reporting	报告	120
· Test Repor	· 测试报告	40
· Size Measurement	· 计算工作量	40
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	40
	· 合计	657

类结构设计

类名	职责	核心方法
SimHash	核心计算模块	generateSimHash(), getSimilarity(), hammingDistance()
TextUtils	文件IO处理	readFile(), writeResult()
PlagiarismCheckerMain	主流程控制	execute(), validateInput()

类关系图

关键函数流程图

算法关键实现

分词与权重计算

使用HanLP进行中文分词
词频作为权重（出现次数越多权重越高）

哈希转换

# 算法公式
 hash = offset_basis
for each byte in input:
    hash = hash XOR byte
    hash = hash * FNV_prime

特征向量构建

64维向量初始化（对应64位哈希）
遍历每个词的哈希值，按位累加/减权重

指纹生成规则

for each bit in 64-dim vector:
    if bit > 0: fingerprint.set_bit(1)
    else: fingerprint.set_bit(0)

相似度计算优化

汉明距离加速：使用BigInteger.bitCount()替代逐位比较
归一化处理：相似度 = 1 - 汉明距离/64

独到之处

中文优化设计

停用词过滤：在分词后自动过滤"的"、"是"等无意义词
近义词合并：通过同义词词林合并相似词汇（如"电脑"和"计算机"）

抗干扰设计

局部敏感哈希：对插入/删除内容不敏感
权重衰减：长文本中重复词权重非线性增长（√n代替n）

性能改进

思路：用FNV-1a哈希算法替换原MD5算法
MD5算法下性能分析图：

此程序消优化耗最大的是主类的main函数，但是此处算法替换主要SimHash类

FNV-1a哈希算法下性能分析图

部分单元测试展示

各单元测试覆盖率截图

部分单元测试代码

#PlagiarismCheckerMainTest.java
package com.changlu.simhash;

import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;

import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;

import static org.junit.jupiter.api.Assertions.*;

/**
 * PlagiarismCheckerMain 单元测试类
 * 兼容Java 8的版本
 */
class PlagiarismCheckerMainTest {

    @TempDir
    Path tempDir;

    // 兼容Java 8的文件写入方法
    private void writeFile(Path filePath, String content) throws IOException {
        Files.write(filePath, content.getBytes(StandardCharsets.UTF_8));
    }

    // 兼容Java 8的文件读取方法
    private String readFile(Path filePath) throws IOException {
        byte[] bytes = Files.readAllBytes(filePath);
        return new String(bytes, StandardCharsets.UTF_8);
    }

    @Test
    void testMainWithCorrectArgs() throws IOException {
        // 准备测试文件
        Path origFile = tempDir.resolve("orig.txt");
        Path copiedFile = tempDir.resolve("copied.txt");
        Path outputFile = tempDir.resolve("result.txt");

        writeFile(origFile, "这是原始文本内容");
        writeFile(copiedFile, "这是原始文本内容"); // 完全相同

        // 执行测试
        String[] args = {origFile.toString(), copiedFile.toString(), outputFile.toString()};
        int exitCode = PlagiarismCheckerMain.execute(args);

        // 验证结果
        assertEquals(0, exitCode, "应该成功执行");
        assertTrue(Files.exists(outputFile), "应该生成结果文件");

        String result = readFile(outputFile);
        assertTrue(Double.parseDouble(result) > 0.99, "完全相同文本应该接近100%相似度");
    }

    @Test
    void testMainWithIncorrectArgCount() {
        String[] args = {"only_one_arg"}; // 参数不足
        int exitCode = PlagiarismCheckerMain.execute(args);

        assertEquals(1, exitCode, "参数不足应该返回错误码");
    }

    @Test
    void testMainWithNonExistentFile() {
        String[] args = {"nonexistent.txt", "nonexistent2.txt", "output.txt"};
        int exitCode = PlagiarismCheckerMain.execute(args);

        assertEquals(1, exitCode, "文件不存在应该返回错误码");
    }

    @Test
    void testMainWithEmptyFiles() throws IOException {
        // 准备空文件
        Path emptyFile1 = tempDir.resolve("empty1.txt");
        Path emptyFile2 = tempDir.resolve("empty2.txt");
        Path outputFile = tempDir.resolve("result.txt");

        writeFile(emptyFile1, "");
        writeFile(emptyFile2, "");

        String[] args = {emptyFile1.toString(), emptyFile2.toString(), outputFile.toString()};
        int exitCode = PlagiarismCheckerMain.execute(args);

        assertEquals(0, exitCode, "空文件应该成功处理");
        assertTrue(Files.exists(outputFile), "应该生成结果文件");
    }

    @Test
    void testMainWithDifferentTexts() throws IOException {
        Path origFile = tempDir.resolve("orig.txt");
        Path copiedFile = tempDir.resolve("copied.txt");
        Path outputFile = tempDir.resolve("result.txt");

        writeFile(origFile, "这是原始文本内容");
        writeFile(copiedFile, "这是完全不同的文本内容");

        String[] args = {origFile.toString(), copiedFile.toString(), outputFile.toString()};
        int exitCode = PlagiarismCheckerMain.execute(args);

        assertEquals(0, exitCode, "不同文本应该成功执行");

        String result = readFile(outputFile);
        double similarity = Double.parseDouble(result);
        assertTrue(similarity < 0.8, "完全不同文本应该低相似度");
    }

    @Test
    void testMainWithSpecialCharacters() throws IOException {
        Path origFile = tempDir.resolve("orig.txt");
        Path copiedFile = tempDir.resolve("copied.txt");
        Path outputFile = tempDir.resolve("result.txt");

        String textWithSpecialChars = "Hello! 你好！ 123 @#$% 特殊字符测试";
        writeFile(origFile, textWithSpecialChars);
        writeFile(copiedFile, textWithSpecialChars);

        String[] args = {origFile.toString(), copiedFile.toString(), outputFile.toString()};
        int exitCode = PlagiarismCheckerMain.execute(args);

        assertEquals(0, exitCode, "特殊字符文本应该成功处理");
    }

    @Test
    void testMainWithIdenticalFiles() throws IOException {
        Path origFile = tempDir.resolve("orig.txt");
        Path copiedFile = tempDir.resolve("copied.txt");
        Path outputFile = tempDir.resolve("result.txt");

        String content = "重复文本内容重复文本内容重复文本内容";
        writeFile(origFile, content);
        writeFile(copiedFile, content);

        String[] args = {origFile.toString(), copiedFile.toString(), outputFile.toString()};
        int exitCode = PlagiarismCheckerMain.execute(args);

        assertEquals(0, exitCode, "相同文件应该成功处理");

        String result = readFile(outputFile);
        double similarity = Double.parseDouble(result);
        assertEquals(1.0, similarity, 0.01, "相同文件应该100%相似");
    }

    @Test
    void testMainWithCompletelyDifferentFiles() throws IOException {
        Path origFile = tempDir.resolve("orig.txt");
        Path copiedFile = tempDir.resolve("copied.txt");
        Path outputFile = tempDir.resolve("result.txt");

        writeFile(origFile, "这是第一段完全不同的文本内容");
        writeFile(copiedFile, "这是第二段毫无关联的文本内容");

        String[] args = {origFile.toString(), copiedFile.toString(), outputFile.toString()};
        int exitCode = PlagiarismCheckerMain.execute(args);

        assertEquals(0, exitCode, "完全不同文件应该成功处理");

        String result = readFile(outputFile);
        double similarity = Double.parseDouble(result);
        assertTrue(similarity < 0.8, "完全不同文件应该低相似度");
    }
}

部分异常处理说明

特殊字符测试的隐含异常

@Test
public void testSpecialCharacters() {
    String text1 = "文本包含特殊符号：@#￥%……&*";
    String text2 = "文本包含特殊符号：@#￥%……&*";
    SimHash hash1 = new SimHash(text1);
    SimHash hash2 = new SimHash(text2);
    assertEquals(1.0, hash1.getSimilarity(hash2), 0.01, "特殊符号应能正确处理");
}

编码兼容性：确保特殊符号不会导致编码异常
哈希稳定性：相同特殊字符文本应产生相同哈希

特殊字符编码异常

static {
            try {
                // 强制设置标准输出为UTF-8
                System.setOut(new PrintStream(System.out, true, "UTF-8"));
                System.setErr(new PrintStream(System.err, true, "UTF-8"));
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
            }
        }

编码兼容性：确保UTF-8编码能正确处理中文、Emoji、特殊符号
静默处理：在static块中捕获并打印异常

文件操作异常

// 文件写入方法声明抛出IOException
private void writeFile(Path filePath, String content) throws IOException {
    Files.write(filePath, content.getBytes(StandardCharsets.UTF_8));
}

// 文件读取方法声明抛出IOException
private String readFile(Path filePath) throws IOException {
    byte[] bytes = Files.readAllBytes(filePath);
    return new String(bytes, StandardCharsets.UTF_8);
}

文件系统隔离：将文件读写错误限制在工具方法内
错误传递：通过异常通知调用方文件操作失败

参数校验异常

@Test
void testMainWithIncorrectArgCount() {
    String[] args = {"only_one_arg"}; 
    int exitCode = PlagiarismCheckerMain.execute(args);
    assertEquals(1, exitCode); // 预期返回错误码
}

输入验证：确保命令行参数数量正确
快速失败：立即拒绝无效参数组合

文件不存在异常

@Test
void testMainWithNonExistentFile() {
    String[] args = {"nonexistent.txt", "nonexistent2.txt", "output.txt"};
    int exitCode = PlagiarismCheckerMain.execute(args);
    assertEquals(1, exitCode); // 预期返回错误码
}

资源存在性检查：防止处理不存在的文件
优雅降级：返回错误码而非抛出异常

空文件处理

@Test
void testMainWithEmptyFiles() throws IOException {
    writeFile(emptyFile1, ""); // 写入空内容
    writeFile(emptyFile2, "");
    // 预期空文件能正常处理并生成结果
    assertEquals(0, PlagiarismCheckerMain.execute(args)); 
}

边界条件处理：允许空文件输入但产生合理结果
流程连续性：不因空文件中断程序

PSP表格（各模块开发实际消耗时间）

PSP2.1	Personal Software Process Stages	实际耗时（分钟）
Planning	计划	30
·Estimate	·估计这个任务需要多少时间	892
Development	开发	622
· Analysis	· 需求分析 (包括学习新技术)	20
· Design Spec	· 生成设计文档	20
· Design Review	· 设计复审	12
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	10
· Design	· 具体设计	60
· Coding	· 具体编码	270
· Code Review	· 代码复审	30
· Test	· 测试（自我测试，修改代码，提交修改）	200
Reporting	报告	270
· Test Repor	· 测试报告	80
· Size Measurement	· 计算工作量	80
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	110
	· 合计	892

posted @ 2025-09-22 20:19 常陆812 阅读(21) 评论(0) 收藏举报

刷新页面返回顶部

changLu812

第一次个人编程作业

作业GitHub连接：作业在这里

PSP表格（各模块开发预计消耗时间）

类结构设计

类关系图

关键函数流程图

算法关键实现

分词与权重计算

哈希转换

特征向量构建

指纹生成规则

相似度计算优化

独到之处

中文优化设计

抗干扰设计

性能改进

部分单元测试展示

各单元测试覆盖率截图

部分单元测试代码

部分异常处理说明

PSP表格（各模块开发实际消耗时间）

公告

changLu812

第一次个人编程作业

作业GitHub连接：作业在这里

PSP表格（各模块开发预计消耗时间）

类结构设计

类关系图

关键函数流程图

算法关键实现

分词与权重计算​

哈希转换

特征向量构建

指纹生成规则

相似度计算优化

独到之处

中文优化设计

抗干扰设计

性能改进

部分单元测试展示

各单元测试覆盖率截图

部分单元测试代码

部分异常处理说明

PSP表格（各模块开发实际消耗时间）

公告

分词与权重计算