第二次作业---词频统计

看到这个作业决定用c#完成，虽然没学过，但是凡事都有第一次嘛，借了书，加上万能的百度，愉快的（？）开始了自学。

功能一 重点：

我对功能一的理解是可以从键盘输入文件名，然后读取该文件（难点），并统计文件中出现的单词总数（重复的单词算作一个），和每个单词出现的次数（难点）。

首先，我的想法是，读取文件需要得到文件的路径path，所以需要一种方法，能够根据文件名获得文件的路径，我在百度上查找，如下图：

然后我又搜索了Console的用法，包括ReadLine和Read的区别等：

实现了输入文件名读取文件的功能，部分代码如下：

class Program
    {
        static void Main(string[] args)
        {
            Console.Write(">type ");
            string filename = Console.ReadLine();
            string path = Path.GetFullPath(filename);
            StreamReader sr = new StreamReader(path);

为了对文本进行预处理，我们要读入英文文档，全部转换成小写字母，将其中的标点、特殊符号等以空格代替，以空格为分隔符将所有单词分隔开并存入数组。我搜索了字符的大小写转换方法，如下图：

部分代码如下：

 string document = sr.ReadLine();
            Console.WriteLine(document.ToString());   
            document = document.ToLower();
            char[] s = { ' ', ',', '.', '?', '!', ':', ';', '\'', '\"' };
            string[] S = document.Split(s);

我认为哈希表的使用是难点之一，使用哈希表的部分参考了百度上的一部分代码：http://www.bubuko.com/infodetail-1444663.html。hashtable通过Key-Value来确定每个单词出现的次数，将hashtable结果按照Value值的大小利用array数组进行排序，并将结果输出为TXT形式。

部分代码如下：

Hashtable ha = new Hashtable();
            for (int j = 0; j < S.Length; j++)
            {
                if (ha.ContainsKey(S[j]))
                {
                    ha[S[j]] = (int)ha[S[j]] + 1;
                }
                else
                {
                    ha.Add(S[j], 1);
                }
            }
            string[] haKey = new string[ha.Count];
            int[] haValue = new int[ha.Count];
            ha.Keys.CopyTo(haKey, 0);
            ha.Values.CopyTo(haValue, 0);
            Console.WriteLine();
            Console.WriteLine(">wf -s test.txt");
            Console.WriteLine("total " + ha.Count);
            Console.WriteLine();
            Array.Sort(haValue, haKey);
            for (int j = haKey.Length - 1; j >= 0; j--)
            {
                if ((string)haKey[j] != "")
                {
                    Console.Write(haKey[j].ToString() + " ");
                    Console.WriteLine(haValue[j].ToString());
                }
            }

经过一系列的折腾调试之后......最终代码：

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;


namespace consoleApplication2
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.Write(">type ");
            string filename = Console.ReadLine();
            string path = Path.GetFullPath(filename);
            StreamReader sr = new StreamReader(path);
            string document = sr.ReadLine();
            Console.WriteLine(document.ToString());   
            document = document.ToLower();
            char[] s = { ' ', ',', '.', '?', '!', ':', ';', '\'', '\"' };
            string[] S = document.Split(s);
            Hashtable ha = new Hashtable();
            for (int j = 0; j < S.Length; j++)
            {
                if (ha.ContainsKey(S[j]))
                {
                    ha[S[j]] = (int)ha[S[j]] + 1;
                }
                else
                {
                    ha.Add(S[j], 1);
                }
            }
            string[] haKey = new string[ha.Count];
            int[] haValue = new int[ha.Count];
            ha.Keys.CopyTo(haKey, 0);
            ha.Values.CopyTo(haValue, 0);
            Console.WriteLine();
            Console.WriteLine(">wf -s test.txt");
            Console.WriteLine("total " + ha.Count);
            Console.WriteLine();
            Array.Sort(haValue, haKey);
            for (int j = haKey.Length - 1; j >= 0; j--)
            {
                if ((string)haKey[j] != "")
                {
                    Console.Write(haKey[j].ToString() + " ");
                    Console.WriteLine(haValue[j].ToString());
                }
            }

        }
    }
}

运行结果截图：

其实这时候只是我以为功能一完成了，但是我忽略了输出格式的对齐问题，在实现功能二时，我才发现了使输出格式对齐的方法。

功能一修改：

 Console.Write(haKey[j].ToString().PadRight(10, ' ') + " ");

修改后的运行结果截图：

功能一的完成耗费了较长时间，但是有功能一做基础，功能二应该更容易实现。

功能二：

功能二的代码我是在功能一的基础上改的，首先改动的是这一部分，把ReadLine改为ReadToEnd，这样才能保证读入整篇文件，部分代码如下：

class Program
    {
        static void Main(string[] args)
        {
            Console.Write(">wf ");
            string filename = Console.ReadLine();
            string path = Path.GetFullPath(filename);
            StreamReader sr = new StreamReader(path);
            string document = sr.ReadToEnd();
            document = document.ToLower();

在结尾添加了一段代码，保证按题目要求输出前10个词频最高的单词，部分代码如下：

 Array.Sort(haValue, haKey);
            int n = 0;
            for (int j = haKey.Length - 1; j >= 0; j--)
            {
                if ((string)haKey[j] != "")
                {
                    if (n < 10)
                    {
                        Console.Write(haKey[j].ToString().PadRight(10, ' '));
                        Console.WriteLine(haValue[j]);
                        n++;
                    }
                }

关于.PadRight的用法，也是度娘教的，见下图：

最终代码：

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace consoleApplication2
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.Write(">wf ");
            string filename = Console.ReadLine();
            string path = Path.GetFullPath(filename);
            StreamReader sr = new StreamReader(path);
            string document = sr.ReadToEnd();
            document = document.ToLower();
            char[] s = { ' ', ',', '.', '?', '!', ':', ';', '\'', '\"' };
            string[] S = document.Split(s);
            Hashtable ha = new Hashtable();
            for (int j = 0; j < S.Length; j++)
            {
                if (ha.ContainsKey(S[j]))
                {
                    ha[S[j]] = (int)ha[S[j]] + 1;
                }
                else
                {
                    ha.Add(S[j], 1);
                }
            }
            string[] haKey = new string[ha.Count];
            int[] haValue = new int[ha.Count];
            ha.Keys.CopyTo(haKey, 0);
            ha.Values.CopyTo(haValue, 0);
            Console.WriteLine("total " + ha.Count + " words");
            Console.WriteLine();
            Array.Sort(haValue, haKey);
            int n = 0;
            for (int j = haKey.Length - 1; j >= 0; j--)
            {
                if ((string)haKey[j] != "")
                {
                    if (n < 10)
                    {
                        Console.Write(haKey[j].ToString().PadRight(10, ' '));
                        Console.WriteLine(haValue[j]);
                        n++;
                    }
                }

            }
        }
    }
}

运行结果截图如下：

对于功能三和功能四，由于能力有限，多线程还没完全弄懂，暂时还没有实现，编的代码还有一堆bug ，就不放了。

本周psp

代码及版本控制：https://coding.net/u/rensijia/p/count-words/git

发表于 2017-09-18 00:51 rrrsssjjj 阅读(235) 评论(0) 收藏举报

刷新页面返回顶部

第二次作业---词频统计

公告