WordCount - LllIKT

代码仓库项目地址：https://gitee.com/li_kuntai/WordCount

这个作业属于哪个课程	<https://edu.cnblogs.com/campus/nue/SE202010/>
这个作业要求在哪里	<https://edu.cnblogs.com/campus/nue/SE202010/homework/11481>
这个作业的目标	<实现命令行程序，输入文件，输出统计文件中的字符数、单词、符号等的数目>
学号	<2000306>

一、PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	20	30
· Estimate	· 估计这个任务需要多少时间	20	30
Development	开发	750	960
· Analysis	· 需求分析 (包括学习新技术)	150	200
· Design Spec	· 生成设计文档	50	60
· Design Review	· 设计复审	30	20
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	20	20
· Design	· 具体设计	50	100
· Coding	· 具体编码	360	450
· Code Review	· 代码复审	40	30
· Test	· 测试（自我测试，修改代码，提交修改）	50	80
Reporting	报告	40	40
· Test Repor	· 测试报告	20	20
· Size Measurement	· 计算工作量	10	10
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	10	10
	合计	810	1030

二、设计实现过程

本项目采用类进行接口封装，关于类的说明与接口的函数在stdafx.h文件里，关于接口的详细代码内容在stdafx.cpp文件里。
对于本作业，主要设计Word类用于进行词频分析以及存储词频分析的结果.

Word类包含了5个函数：

  Word();//用于初始化
   int Countcharacters(char *argv);//用于统计字符数
   int Countlines(char *argv);//用于统计空白行
   int Countwords(char *argv);//用于统计单词数
   vector<pair<string,int>> Counttop10(char *argv);//统计词频前十的单词

Countcharacters函数是用于统计字符数的，主要是根据题意，遍历一次文件里的字符串，判断字符是否是在0到127，属于Ascill码的范围。
Countlines函数是用于统计空白行，在分析这个功能的时候，需要先理解什么是空白字符串，主要包括tab，空格跟回车三种，只要按行读取，判断该行内是否存在除这三者以外的字符，就可以判断是否是空白行。
Countwords函数用于统计单词数，主要是先通过大小写转化，把合法单词所可能用的字符大小写转化，之后用ans作为标记，判断是否存在开头为4个连续的字母的单词，若存在合法单词，则用map进行存储。
Counttop10用于统计词频前十的单词，属于这次词频分析作业里的核心函数，同样是找出合法单词用map按字典序存储，多重映射的multimap,将key设为单词的词频，按词频进行字典序排序，由于map要求key值唯一，而multimap可以多重映射，因此我使用multimap进行排序，最后用vector和pair存储词频前十的单词。核心代码见（四）。

三、性能分析

测试文件input.txt内容：

vasdvs
bsbsdb.vasdvs
;casv[vdav/vv
vas.vsv+v
casvsa2000
 casvsa1998
as*casvsa2001
123acasv;;;;;
12sav
    
    vasdvs
vdv++
12sav;;fas
b

测试文件结果：

characters: 136
words: 9
lines: 13
<vasdvs>: 3
<bsbsdb>: 1
<casv>: 1
<casvsa1998>: 1
<casvsa2000>: 1
<casvsa2001>: 1
<vdav>: 1

11.21增加

英文小说单元测试结果：

四、关键代码

Counttop10函数：

vector<pair<string, int>> Word::Counttop10(char *argv)
{
    //利用多重映射的multimap,将单词按词频字典序排序，之后用vector存储前十的单词
    mapword.clear();                            //先对原本的map初始化
    map<string, int>::iterator iter;         //迭代器
    multimap<int, string> mapint;
    multimap<int, string>::iterator iter2;
    string name, word;            
    long ans, num, i, j, wordpos;
    vector<pair<string, int>> top10;
    ifstream Fileread;                            //读出文件
    Fileread.open(argv, std::ios::in);
    if (Fileread.fail())                             //异常检测
    {
        printf("file isn't exist\n");
        return top10;
    }
    while (!Fileread.eof())
    {
        getline(Fileread, name);        //按行读取文件
        ans = 0; wordpos = 0;
        num = name.size();
        for (i = 0; i<num; i++)
        {
            if (65 <= name[i] && name[i] <= 90)name[i] += 32;//大小写转化
                        //判断是否为合法单词
            if (97 <= name[i] && name[i] <= 122)                    
            {
                ans++;
                continue;
            }
            if ('0' <= name[i] && name[i] <= '9')
            {
                if (ans >= 4)
                {
                    continue;
                }
                else
                {
                    for (j = i; j<num; j++)
                    {
                        if ('0' <= name[j] && name[j] <= '9')
                            continue;
                        else if (('a' <= name[j] && name[j] <= 'z') || ('A' <= name[j] && name[j] <= 'Z'))
                            continue;
                        else
                        {
                                                        //寻找下一个合法单词的开头
                            while (j<num)
                            {
                                if (('a' <= name[j + 1] && name[j + 1] <= 'z') || ('A' <= name[j + 1] && name[j + 1] <= 'Z'))
                                {
                                    wordpos = j + 1;
                                    break;
                                }
                                else
                                    j++;
                            } //寻找下一个单词的开头 
                            i = j;
                            break;
                        }
                    }    //寻找下一个分隔符 
                    if (j == num)
                    {
                        break;
                    }    //寻找不到下一个分隔符 
                    ans = 0;
                }
            }
            else
            {
                if (ans >= 4)
                {
                    //添加单词 
                    word = name.substr(wordpos, i - wordpos);
                    iter = mapword.find(string(word));
                    if (iter != mapword.end())
                        iter->second += 1;
                    else
                        mapword.insert(pair<string, int>(word, 1));
                    
                    //    cout<<"word:"<<word<<endl;
                }//获取单词
                while (i<num)
                {
                    if (('a' <= name[i + 1] && name[i + 1] <= 'z') || ('A' <= name[i + 1] && name[i + 1] <= 'Z'))
                    {
                        wordpos = i + 1;
                        break;
                    }
                    else
                        i++;
                }  //寻找下一个合法单词的开头
                ans = 0;
            }
        }
                //防止该行以合法单词结尾
        if (ans >= 4)
        {
            word = name.substr(wordpos, i - wordpos);
            iter = mapword.find(string(word));
            if (iter != mapword.end())
                iter->second += 1;
            else
            {
                mapword.insert(pair<string, int>(word, 1));
            }
        }
    }
    num = 0;
    iter = mapword.begin();
    for (; iter != mapword.end(); iter++)
    {
        mapint.insert(pair<int, string>(-iter->second, iter->first));
    }
    for (iter2 = mapint.begin(); iter2 != mapint.end(); iter2++)
    {
        num++;
        top10.push_back(make_pair(iter2->second.c_str(), -(iter2->first)));
        if (num == 10)
            break;
    }
    Fileread.close();
    return top10;    //返回一个vector
}

五、总结

通过本次作业学习和了解了一个程序的设计流程，学洗了PSP表格的设计，由于个人的编程能力很弱，在此过程中学习和借鉴了他人的一些代码语句，同时学习了STL中 vector和multimap的操作和使用，和使用class的封装。

posted on 2020-11-11 19:36 LllIKT 阅读(134) 评论(0) 收藏举报

刷新页面返回顶部