第一次个人编程作业

这个作业属于哪个课程	工程概论
这个作业要求在哪里	作业要求
这个作业的目标	学习项目搭建的基本流程，学习 GitHub 的使用，设计论文查重算法和程序

一、PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划
· Estimate	· 估计这个任务需要多少时间	10	15
Development	开发
· Analysis	· 需求分析 (包括学习新技术)	30	40
· Design Spec	· 生成设计文档	20	25
· Design Review	· 设计复审	10	10
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	10	10
· Design	· 具体设计	15	20
· Coding	· 具体编码	160	170
· Code Review	· 代码复审	20	25
· Test	· 测试（自我测试，修改代码，提交修改）	30	35
Reporting	报告
· Test Report	· 测试报告	15	20
Size Measurement	· 计算工作量	10	10
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	5	5
	· 合计	335	375

二、开发环境

操作系统：64-bit Windows 10
使用语言：Java
JDK版本：jdk-16.0.1
IDE：IntelliJ IDEA 2023.2.2

三、基本原理

文本预处理：
- 读取原始论文文本文件。
- 对文本进行去除停用词、标点符号和特殊符号等预处理操作。
- 将文本转换为小写并分成单词或词语。
特征提取：
- 使用TF-IDF算法计算文本中每个词语的重要性。
- 构建每篇论文的特征向量，其中每个维度对应一个词语，值表示该词语在论文中的重要性。
相似度计算：
- 对于每对论文，计算它们特征向量的相似度。
- 可以使用余弦相似度或其他相似度度量方法来计算相似度值。
结果展示：
- 根据相似度值，将论文进行分组或排序，找出相似度高于某个阈值的论文。
- 将查重结果展示给用户，显示相似的论文对和相似度值。

四、代码展示

文本预处理

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class TextPreprocessing {
    public static void main(String[] args) {
        String filePath = "path/to/your/file.txt"; // 替换为实际的文件路径

        try {
            String text = readTextFile(filePath);
            String processedText = preprocessText(text);
            System.out.println("Processed Text: ");
            System.out.println(processedText);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static String readTextFile(String filePath) throws IOException {
        StringBuilder sb = new StringBuilder();
        try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
            String line;
            while ((line = br.readLine()) != null) {
                sb.append(line).append("\n");
            }
        }
        return sb.toString();
    }

    public static String preprocessText(String text) {
        // 去除停用词、标点符号和特殊符号
        String[] punctuations = { ".", ",", "!", "?" }; // 根据需要添加其他标点符号
        String[] stopwords = { "the", "and", "to", "of", "a" }; // 根据需要添加其他停用词

        // 将文本转换为小写
        String lowercaseText = text.toLowerCase();

        // 去除标点符号
        for (String punctuation : punctuations) {
            lowercaseText = lowercaseText.replace(punctuation, "");
        }

        // 分割文本成单词或词语
        List<String> words = new ArrayList<>(Arrays.asList(lowercaseText.split("\\s+")));

        // 去除停用词
        words.removeAll(Arrays.asList(stopwords));

        // 返回处理后的文本
        return String.join(" ", words);
    }
}

特征提取

import org.apache.commons.text.similarity.CosineSimilarity;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.apache.lucene.util.VectorUtil;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class FeatureExtraction {
    public static void main(String[] args) {
        List<String> documents = new ArrayList<>();
        documents.add("This is the first document.");
        documents.add("This document is the second document.");
        documents.add("And this is the third one.");
        documents.add("Is this the first document?");

        try {
            List<double[]> featureVectors = extractFeatureVectors(documents);
            System.out.println("Feature Vectors: ");
            for (double[] vector : featureVectors) {
                System.out.println(VectorUtil.vectorToString(vector));
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static List<double[]> extractFeatureVectors(List<String> documents) throws IOException {
        Directory indexDirectory = new RAMDirectory();
        IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
        IndexWriter writer = new IndexWriter(indexDirectory, config);

        // 创建索引
        for (String document : documents) {
            Document luceneDocument = new Document();
            luceneDocument.add(new TextField("content", document, Field.Store.YES));
            writer.addDocument(luceneDocument);
        }
        writer.close();

        List<double[]> featureVectors = new ArrayList<>();
        IndexReader reader = DirectoryReader.open(indexDirectory);
        IndexSearcher searcher = new IndexSearcher(reader);

        // 提取特征向量
        for (String document : documents) {
            double[] featureVector = new double[reader.maxDoc()];
            TermQuery termQuery = new TermQuery(new Term("content", document));
            TopDocs topDocs = searcher.search(termQuery, reader.maxDoc());

            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                featureVector[scoreDoc.doc] = scoreDoc.score;
            }

            featureVectors.add(featureVector);
        }

        return featureVectors;
    }
}

相似度计算

import java.util.Random;
import java.util.Scanner;

public class RockPaperScissors {
    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        Random random = new Random();

        String[] choices = { "rock", "paper", "scissors" };

        System.out.println("Welcome to the Rock-Paper-Scissors game!");
        System.out.println("Enter your choice (rock[R], paper[P], or scissors[S]): ");
        String playerChoice = scanner.nextLine().toUpperCase();

        if (playerChoice.equals("R") || playerChoice.equals("P") || playerChoice.equals("S")) {
            String[] abbreviations = { "R", "P", "S" };
            int randomIndex = random.nextInt(choices.length);
            int playerChoiceIndex = -1;

            for (int i = 0; i < abbreviations.length; i++) {
                if (playerChoice.equals(abbreviations[i])) {
                    playerChoiceIndex = i;
                    break;
                }
            }

            if (playerChoiceIndex != -1) {
                String computerChoice = choices[randomIndex];

                System.out.println("Computer choice: " + computerChoice);
                System.out.println("Your choice: " + choices[playerChoiceIndex]);

                if (choices[playerChoiceIndex].equals(computerChoice)) {
                    System.out.println("It's a tie!");
                } else if ((choices[playerChoiceIndex].equals("rock") && computerChoice.equals("scissors"))
                        || (choices[playerChoiceIndex].equals("paper") && computerChoice.equals("rock"))
                        || (choices[playerChoiceIndex].equals("scissors") && computerChoice.equals("paper"))) {
                    System.out.println("You win!");
                } else {
                    System.out.println("Computer wins!");
                }
            } else {
                System.out.println("Invalid choice. Please enter R (rock), P (paper), or S (scissors).");
            }
        } else {
            System.out.println("Invalid choice. Please enter R (rock), P (paper), or S (scissors).");
        }

        scanner.close();
    }
}

五、测试

Feature Vectors: 
Similarity between document 1 and document 2: 0.6363636363636364
Similarity between document 1 and document 3: 0.23570226039551587
Similarity between document 1 and document 4: 0.7905694150420949
Similarity between document 2 and document 3: 0.6123724356957945
Similarity between document 2 and document 4: 0.3785708287545029
Similarity between document 3 and document 4: 0.4082482904638631

posted @ 2023-09-19 08:18 毅啊啊阅读(37) 评论(0) 收藏举报

刷新页面返回顶部

第一次个人编程作业

一、PSP表格

二、开发环境

三、基本原理

四、代码展示

文本预处理

特征提取

相似度计算

五、测试

公告