PTA 题解：jmu-Java&Python-统计文字中的单词数量并按出现次数排序

题目说明
题目分析
去除标点符号
统计词频
单词排序

题目说明

题干

测试数据 1

输入样例

failure is probably the fortification in your pole

it is like a peek your wallet as the thief when you
are thinking how to spend several hard-won lepta

when you are wondering whether new money it has laid
background because of you then at the heart of the

most lax alert and most low awareness and left it

godsend failed
!!!!!

输出样例

46
the=4
it=3
you=3
and=2
are=2
is=2
most=2
of=2
when=2
your=2

测试数据 2

输入样例

Failure is probably The fortification in your pole!

It is like a peek your wallet as the thief when You
are thinking how to. spend several hard-won lepta.

when yoU are? wondering whether new money it has laid
background Because of: yOu?, then at the heart of the
Tom say: Who is the best? No one dare to say yes.
most lax alert and! most low awareness and* left it

godsend failed
!!!!!

输出样例

54
the=5
is=3
it=3
you=3
and=2
are=2
most=2
of=2
say=2
to=2

题目分析

这道题目的综合性很强，涉及了多个 Java 的基本语法和用法，通过这道题的练习可以一次掌握多个知识点。题目需要解决的问题有：

单词输入并去除标点符号：程序需要用不定行输入的方式输入多行字符串，并且将字符串分割成多个单词。其中字符串会存在不少标点符号和空格，这些标点符号可能和单词相邻，因此不能简单地直接分割字符串。
统计词频：对于输入的字符串，不同的单词需要分别统计其词频。
单词按照词频和字典序排序：对于词频统计的结果，按照“次数按照降序排序，如果次数相同则按照键值的字母升序排序”的规则排序后输出。

去除标点符号

首先对于输入的单词，需要先删除标点符号，需要注意的是这些标点符号即使按照空格分割了整行子串，也是是紧挨在单词之间的。因此删除标点符号需要遍历字符串，然后把题目要求去除的标点删除。这里也可以使用 StringBuilder 类创建字符串，将英文字符依次组成一个字符串后返回。返回的使用 StringBuilder 类的 toString() 方法生成 String，然后使用 String 类的 toLowerCase() 方法将所有大写字母转换为小写。此处封装成方法供 main 方法调用：

public static String removePunctuation(String str){

      StringBuilder strbld = new StringBuilder();
      for (int i = 0; i < str.length(); i++) {
            if (str.charAt(i) != '!' && str.charAt(i) != '.' && str.charAt(i) != ','
             && str.charAt(i) != ':' && str.charAt(i) != '*' && str.charAt(i) != '?') {
                  strbld.append(str.charAt(i));
	    }
      }
      return strbld.toString().toLowerCase();
}

统计词频

根据题意，我们需要用不定行输入的方式输入一整行，并且将整行文本分割成多个单词。此处可以使用 String 类的 split() 方法，将单行输入的字符串按照空格分割成多个单词，并且用 removePunctuation()方法去除标点符号。注意到可能出现输入空行，分割或去标点后出现空字符串的特殊情况，需要对这种情况进行额外的判断。
统计词频可以封装一个单词类，单词类有单词和词频 2 个属性，然后使用足够大的数组或 List 进行存储。也可以使用 Map 容器对 String 和 Integer 进行映射，此处使用 HashMap 进行词频的统计，使用 HashMap 的 put() 方法进行单词词频的更新，若单词还未被统计过就使用添加新单词和词频为 1 的映射，若单词使用 containsKey() 方法检查已经存在，则先使用 get() 方法获取当前词频，加一后再调用 put 方法。

Scanner sc = new Scanner(System.in);
Map<String,Integer> fre_map = new HashMap<String,Integer>();
while(true){
      String str = sc.nextLine();
      if(str.equals("!!!!!")) {
            break;
      }
      if (str != null && str.equals("")) {
            continue;
      }
      String[] words = str.split(" ");    //分割出单个单词
      for(int i = 0; i < words.length; i++){
            String a_word = Main.removePunctuation(words[i]);    //获取单个去除标点的单词
            if(a_word == null || a_word.length() == 0) {
                  continue;
            }
            if(!fre_map.containsKey(a_word)) {    //单词未被统计，建立新映射
                  fre_map.put(a_word, 1);
            }
            else{    //单词已被统计过，更新数据
                  int num = fre_map.get(a_word) + 1;
        	  fre_map.put(a_word, num);
            }
      }
}

单词排序

在程序的最后需要输出词频前 10 的单词，次数按照降序排序，如果次数相同则按照键值的字母升序排序。这个功能可以使用 HashMap 的 entrySet() 方法返回此 Map 所包含的映射关系的视图，使用 List<Map.Entry<String, Integer>> 存储之后用 Collections.sort() 方法进行排序。

//返回 fre_map 中所有映射关系的视图，存储入 1 个 ArrayList 中
List<Map.Entry<String, Integer>> fre_list = new ArrayList<Map.Entry<String, Integer>>(fre_map.entrySet());
Collections.sort(fre_list, new WordComparator());
        
System.out.println(fre_list.size());
int num = 0;
for (Map.Entry<String, Integer> e : fre_list) {
      System.out.println(e.getKey() + "=" + e.getValue());
      if(++num == 10) {
            break;
      }
}

为了支持 Collections.sort() 方法进行排序，需要写一个实现了 Comparator 接口的 WordComparator 类，WordComparator 类的 compare() 界定了排序的规则。排序的规则为当单词的词频相等时，使用 String 类的 compareTo() 方法实现字典序排序，否则调用 Integer 的 compareTo() 方法即可。

class WordComparator implements Comparator<Map.Entry<String, Integer>>{

      @Override
      public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
            if (o1.getValue().equals(o2.getValue()))    //2 个单词词频相同，按照字典序进行排序
                  return o1.getKey().compareTo(o2.getKey());
	    else    //词频不同，按词频排序
                  return o2.getValue().compareTo(o1.getValue());
      }
}

posted @ 2021-01-15 02:02 乌漆WhiteMoon 阅读(2025) 评论(0) 收藏举报

刷新页面返回顶部

乌漆 WhiteMoon

PTA 题解：jmu-Java&Python-统计文字中的单词数量并按出现次数排序

题目说明

题干

测试数据 1

输入样例

输出样例

测试数据 2

输入样例

输出样例

题目分析

去除标点符号

统计词频

单词排序

公告