jmu-Java-PTA题解 (6.4 - 集合框架(Map)-统计文字中的单词数量并按出现次数排序) 网安2312陈卓

问题要求

现在需要统计若干段文字(英文)中的单词数量，并且还需统计每个单词出现的次数。

注1：单词之间以空格(1个或多个空格)为间隔。
注2：忽略空行或者空格行。

基本版:
统计时，区分字母大小写，且不删除指定标点符号。

进阶版:

统计前，需要从文字中删除指定标点符号! . , : * ?。注意：所谓的删除，就是用1个空格替换掉相应字符。
统计单词时需要忽略单词的大小写。

输入格式:

若干行英文，最后以!!!!!为结束。

输出格式:

单词数量
出现次数排名前10的单词（次数按照降序排序，如果次数相同，则按照键值的字母升序排序）及出现次数。

输入样例:

failure is probably the fortification in your pole

it is like a peek your wallet as the thief when you
are thinking how to spend several hard-won lepta
          
when you are wondering whether new money it has laid
background because of you then at the heart of the
     
most lax alert and most low awareness and left it

godsend failed
!!!!!

Failure is probably The fortification in your pole!

It is like a peek your wallet as the thief when You
are thinking how to. spend several hard-won lepta.

when yoU are? wondering whether new money it has laid
background Because of: yOu?, then at the heart of the
Tom say: Who is the best? No one dare to say yes.
most lax alert and! most low awareness and* left it

godsend failed
!!!!!

输出样例:

46
the=4
it=3
you=3
and=2
are=2
is=2
most=2
of=2
when=2
your=2

54
the=5
is=3
it=3
you=3
and=2
are=2
most=2
of=2
say=2
to=2

关键点

文本预处理：需要对输入文本进行标点符号替换和大小写转换，为准确统计单词做准备。
单词统计与存储：使用合适的数据结构来存储每个单词及其出现次数，便于后续分析。
结果排序与输出：对统计结果进行排序，按照要求输出单词总数和出现次数排名前 10 的单词及次数。

解题步骤

第一步：输入读取与初始化。

创建TreeMap集合map，用于存储单词及其出现次数，初始采用逆序排序（后续会重新排序）。读取第一行输入文本到s变量，并定义用于匹配指定标点符号的正则表达式spots。

Scanner in = new Scanner(System.in);
Map<String, Integer> map = new TreeMap<>(Comparator.reverseOrder());
String s = in.nextLine();
String spots = "[!.,:*?]";

第二步：文本预处理。

遍历words数组，若单词不为空，若map中不存在该单词，则将其存入map并设出现次数为 1；若已存在，则将其出现次数加 1。

while(!s.equals("!!!!!")){
    String ss = s.toLowerCase();
    ss = ss.replaceAll(spots," ");
    String[] words = ss.split("\\s+");
    for(String e:words){
        if(e.isEmpty()){
            continue;
        }
        if(!map.containsKey(e)){
            map.put(e,1);
        }else{
            map.put(e,map.get(e)+1);
        }
    }
    s = in.nextLine();
}

第三步：实现结果算法

将map中的元素转换为List<Map.Entry<String, Integer>>类型的entryList，使用Collections.sort对entryList进行排序，先按出现次数降序，若次数相同，则按单词字母升序。

int count = 0;
System.out.println(map.size());
List<Map.Entry<String, Integer>> entryList = new ArrayList<>(map.entrySet());
Collections.sort(entryList, new Comparator<Map.Entry<String, Integer>>() {
    public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
        if((o1.getValue()-o2.getValue())!=0){
            return -(o1.getValue()-o2.getValue());
        }else{
            return o1.getKey().compareTo(o2.getKey());
        }
    }
});
for (Map.Entry<String, Integer> entry : entryList) {
    if(count != 10){
        System.out.println(entry.getKey() + "=" + entry.getValue());
        count++;
    }else{
        break;
    }
}

整体流程图：

整体代码：

import java.util.*;

public class Main{
    public static void main(String[] args){
        Scanner in = new Scanner(System.in);
        Map<String, Integer> map = new TreeMap<>(Comparator.reverseOrder());
        String s = in.nextLine();
        String spots = "[!.,:*?]";
        while(!s.equals("!!!!!")){
            String ss = s.toLowerCase();
            ss = ss.replaceAll(spots," ");
            String[] words = ss.split("\\s+");
            for(String e:words){
                if(e.isEmpty()){
                    continue;
                }
                if(!map.containsKey(e)){
                    map.put(e,1);
                }else{
                    map.put(e,map.get(e)+1);
                }
            }
            s = in.nextLine();
        }
        int count = 0;
        System.out.println(map.size());
        List<Map.Entry<String, Integer>> entryList = new ArrayList<>(map.entrySet());
        Collections.sort(entryList, new Comparator<Map.Entry<String, Integer>>() {
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                if((o1.getValue()-o2.getValue())!=0){
                    return -(o1.getValue()-o2.getValue());
                }else{
                    return o1.getKey().compareTo(o2.getKey());
                }
            }
        });
        for (Map.Entry<String, Integer> entry : entryList) {
            if(count != 10){
                System.out.println(entry.getKey() + "=" + entry.getValue());
                count++;
            }else{
                break;
            }
        }
    }
}

思考：本题采用TreeMap存储单词及其出现次数，该集合框架能保证单词的有序性，但插入和查询操作时间复杂度为O(logn)。因此对于大规模文本处理，可考虑先使用HashMap存储单词计数，因为其插入和查询时间复杂度为O(1)，最后再将HashMap元素转换为List进行排序，以提升整体性能。此外，还可采用分布式计算框架，如 Apache Hadoop 或 Spark，利用集群资源并行处理大量文本数据。当然，本题可以进一步拓展功能，如统计单词在文本中的位置，或者根据单词出现频次进行文本分类等，以满足更复杂的场景处理需求。

posted @ 2025-04-16 10:44 取名字比写博客还难阅读(37) 评论(0) 收藏举报

刷新页面返回顶部

cm58