【软件工程】统计文章字数和出现频率较高词汇的程序实现

代码功能：

能够实现统计所给任意段落单词的总数和排序频率较高词汇的功能。

文章字数不限，高频词汇取前十个。

编程语言：c语言

具体代码：

// 字数统计_1.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <malloc.h>
FILE *fp;

//结构体：单词和单词出现的次数
struct wordnum
{
    char str1[20];
    int k;
};


//主函数
int main(int argc, char* argv[])
{
    char str;
    int num=0,i=-1;
    wordnum word[200];

    if((fp=fopen("字数统计.txt","r"))==NULL)
    {
        printf("文件无法打开！");
    }
    else
    {
        str=fgetc(fp);
        while(str!=EOF)
        {
            if(str==' ')
            {
                word[num].str1[i+1]='\0';
                num++;
                i=-1;
            }
            else
            {
                word[num].str1[++i]=str;
                word[num].k=1;
            }
                
            str=fgetc(fp);
        }
        printf("文章的总字数为：%d\n\n\n",num+1);

        //确定每个单词的出现最多次数
    
        for(int r1=0;r1<num-1;r1++)
        {
            for(int r2=r1+1;r2<num;r2++)
            {
                if(!strcmp(word[r1].str1,word[r2].str1))
                {
                    word[r1].k=word[r1].k+1;
                }
            }
        }
         
        for(r1=num-1;r1>0;r1--)
        {
            for(int r2=r1-1;r2>=0;r2--)
            {
                if(!strcmp(word[r1].str1,word[r2].str1))
                {
                    word[r1].k=1;
                }
            }
        }

        //每个单词都已经有了具体个数，接下来排序即可,冒泡排序。

        wordnum temp;
        for(int r4=0;r4<num-1;r4++)
        {
            for(int r3=0;r3<num-r4;r3++)
            {
                if(word[r3].k<word[r3+1].k)
                {
                    temp=word[r3];
                    word[r3]=word[r3+1];
                    word[r3+1]=temp;
                }
            }
        }
         //未检测标点符号与单词相连接的情况。


        printf("==单词=======个数=====\n");
        for(int j=0;j<10;j++)
        {
            printf("%s     %d\n",word[j].str1,word[j].k);
        }
    }
    printf("\n");
    return 0;
}

功能实现：

为了方便调试，选择从文件读取相应的段落。先在工程项目目录下建立：字数统计.txt。输入编译测试需要的段落文章。

这里我随机从电脑里找的英语文章段落：

Of the forces shaping higher education none is more sweeping than the movement across borders. Over the past three decades the number of students leaving home each year to study abroad has grown at an annual rate of 3.9 percent, from 800,000 in 1975 to 2.5 million in 2004. Most travel from one developed nation to another, but the flow from developing to developed countries is growing rapidly. The reverse flow, from developed to developing countries, is on the rise, too. Today foreign students earn 30 percent of the doctoral degrees awarded in the United States and 38 percent of those in the United Kingdom. And the number crossing borders for undergraduate study is growing as well, to 8 percent of the undergraduates at America’s best institutions and 10 percent of all undergraduates in the U.K. In the United States, 20 percent of the newly hired professors in science and engineering are foreign-born, and in China many newly hired faculty members at the top research universities received their graduate education abroad

准备工作就绪后，编译测试就可以。

可以看到，程序实现了统计文字总数和频率较高的10个词汇。

实现思路：首先是实现统计段落单词总数。这里我是依靠的空格来确定的，因为每出现一个空格，就会对应一个单词。但是有的时候并不是如此，还要考虑标点符号带来的影响。如果是一个句子的结束，都是由标点符号结尾，并不是空格键。先按照空格键来区分，标点符号再考虑。实现后就要考虑，统计每个单词出现的次数了，首先想到了结构体来定义单词，由单词本身和出现次数组成。

//结构体：单词和单词出现的次数
struct wordnum
{
    char str1[20];
    int k;
};

在统计单词时，我是每次都读取一个字符，然后根据是否是空格来判断是否继续加入到结构体中。

while(str!=EOF)
        {
            if(str==' ')
            {
                word[num].str1[i+1]='\0';
                num++;
                i=-1;
            }
            else
            {
                word[num].str1[++i]=str;
                word[num].k=1;
            }
                
            str=fgetc(fp);
        }

依靠上面的代码，就能将段落分析到每个单词。此时打印的话，会出现每个单词一行，出现次数为1.此时有重复单词出现，我们就需要接着想想如何统计一个单词的出现次数。

我们可以从第一个开始依次往后对，如果单词相同，则对比的单词结构体int 值k相应加1.直到对比完成。依次对比，首先出现的单词和后面出现的同一个单词相比，次数统计的最完善。

例如：my name is ren guo qing. what is your name?

这里name出现两次，我们按照上面的方法排序后第一个name结构体的k=2，第二个k=1.is的类似。

这样好像找到了最大次数，但是新的问题出现了。

例如：my name is ren guo qing. what is your name? his name is zhang san.

这里name的k值从左到右依次是3、2、1.等到排序后发现name=2的值排在了第二位，这又是重复了。name只需要一个k=3即可，其他的不需要出现。所以我们需要改变除了第一个出现的单词的k以外的其他所有同个单词的k值。我们可以倒序对比，然后将他们置为1.这样就实现了只有一个最大值，在排序后不会出现重复了。

//确定每个单词的出现最多次数
    
        for(int r1=0;r1<num-1;r1++)
        {
            for(int r2=r1+1;r2<num;r2++)
            {
                if(!strcmp(word[r1].str1,word[r2].str1))
                {
                    word[r1].k=word[r1].k+1;
                }
            }
        }
         
        for(r1=num-1;r1>0;r1--)
        {
            for(int r2=r1-1;r2>=0;r2--)
            {
                if(!strcmp(word[r1].str1,word[r2].str1))
                {
                    word[r1].k=1;
                }
            }
        }

接下来就是选择一种排序方法，按照k的值，排序输出前十个高频词汇。这里选择的是冒泡排序。

//每个单词都已经有了具体个数，接下来排序即可,冒泡排序。

        wordnum temp;
        for(int r4=0;r4<num-1;r4++)
        {
            for(int r3=0;r3<num-r4;r3++)
            {
                if(word[r3].k<word[r3+1].k)
                {
                    temp=word[r3];
                    word[r3]=word[r3+1];
                    word[r3+1]=temp;
                }
            }

到这里为止，貌似大功告成了。但是还有一些缺陷。

①我们以空格符分析单词的时候，没有考虑标点符号，这会引起误差。单词与符号组合和多个空格出现的情况。

②文件最后必须多加一个空格符才能正确读取到文件结尾，这个不符合用户习惯。

=====================2014/2/26 19:50=========================

如果考虑标点符号，那么就需要对单词的每个字符都要做到精确选择。所以我们在读取每个字符的时候，进行判断是否需要加入单词。

while(str!=EOF)
        {
            if((str>47&&str<58)||(str>64&&str<91)||(str>96&&str<123))
                {
                    word[num].str1[++i]=str;
                    word[num].k=1;
                }
            else
            {
                word[num].str1[i+1]='\0';
                num++;
                i=-1;
            }
                
            str=fgetc(fp);
        }

这里我们只选取0-9、a-z、A-Z的字符，如果是其他的字符，则说明不属于单词的一部分，并且到此可以读取出一个单词。并且此时我们也解决了最后字符的问题。因为一篇文章最后总是以符号结尾，问号、感叹号或者句号等等。这样我们输出的时候就不需要最后是否需要输入空格键了。

另外也对打印结果进行了重新的布局，单词左对齐，出现次数右对齐。

修改后代码：

// 字数统计_1.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <malloc.h>
#include <conio.h>

FILE *fp;

//结构体：单词和单词出现的次数
struct wordnum
{
    char str1[20];
    int k;
};


//主函数
int main(int argc, char* argv[])
{
    char str;
    int num=0,i=-1;
    wordnum word[200];

    if((fp=fopen("字数统计.txt","r"))==NULL)
    {
        printf("文件无法打开！");
    }
    else
    {
        str=fgetc(fp);
        while(str!=EOF)
        {
            if((str>47&&str<58)||(str>64&&str<91)||(str>96&&str<123))
                {
                    word[num].str1[++i]=str;
                    word[num].k=1;
                }
            else
            {
                word[num].str1[i+1]='\0';
                num++;
                i=-1;
            }
                
            str=fgetc(fp);
        }
        printf("文章的总字数为：%d\n\n\n",num+1);
        

        //确定每个单词的出现最多次数
    
        for(int r1=0;r1<num-1;r1++)
        {
            for(int r2=r1+1;r2<num;r2++)
            {
                if(!strcmp(word[r1].str1,word[r2].str1))
                {
                    word[r1].k=word[r1].k+1;
                }
            }
        }
         
        for(r1=num-1;r1>0;r1--)
        {
            for(int r2=r1-1;r2>=0;r2--)
            {
                if(!strcmp(word[r1].str1,word[r2].str1))
                {
                    word[r1].k=1;
                }
            }
        }

        //每个单词都已经有了具体个数，接下来排序即可,冒泡排序。

        wordnum temp;
        for(int r4=0;r4<num-1;r4++)
        {
            for(int r3=0;r3<num-r4;r3++)
            {
                if(word[r3].k<word[r3+1].k)
                {
                    temp=word[r3];
                    word[r3]=word[r3+1];
                    word[r3+1]=temp;
                }
            }
        }
        
         //未检测标点符号与单词相连接的情况。


        printf("=单词=================个数=====\n");
        for(int j=0;j<10;j++)
        {
            printf("%-10s     %10d\n",word[j].str1,word[j].k);
        }
    }
    printf("\n");
    return 0;
}

打印结果：

此时的问题是，如果同时出现多个符号就会出现单词数的计算错误。

如：my name is ren guo qing,what is your name???

这里的三个问号并不符合语法规范。但是如果出现在文章里，这段程序并不能检测出来。其他字符类似。

存在的问题：

1.多个标点符号时，除第一个外，其余的都记为单词。影响了计算总数。

2.存在数字读取的误差。如3,000容易被读取为两个单词。

========================2014.2.27 15：50================================

解决多个字符的问题，我们可以设置一个标志位。当出现多余的标点符号时，可以识别出后面的其他多个标点符号。

定义flag标志位，当读取的是字母后者数字后面的标点符号时，设置flag=1，其他情况设置为0.

代码：

while(str!=EOF)
        {
            if((str>47&&str<58)||(str>64&&str<91)||(str>96&&str<123))
                {
                    word[num].str1[++i]=str;
                    word[num].k=1;
                    flag=1;
                }
            else
            {
                if(flag==1)
                {
                    word[num].str1[i+1]='\0';
                    num++;
                    i=-1;
                    flag=0;
                }
            }
                
            str=fgetc(fp);
        }

这样就解决了多个字符链接的问题了。将多余字符排出后，我们就能够得到真实的单词个数。
最后只剩下一个问题了，就是数字的问题，因为在英文中数字的表达可能会掺杂着标点符号，我们不能将它标志为两个单词。

例如：Over the past three decades the number of students leaving home each year to study abroad has grown at an annual rate of 3.9 percent, from 800,000 in 1975 to 2.5 million in 2004.

这一句子里，3.9，800,000和2.5都因为标点符号的加入，读取的时候，程序将其分为两个单词计算。我们就要想办法解决数字的问题。

=========================2014.2.28 15:53===========================

数字问题怎么解决啊。(⊙o⊙)…

=========================2014.3.2 22.25============================

posted @ 2014-02-26 19:50 任国庆阅读(1072) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

没有女友就成了半吊子程序员

唉、只能写写博客解解闷

【软件工程】统计文章字数和出现频率较高词汇的程序实现

公告