力扣 192. 统计词频解决之道

Offer 驾到，掘友接招！我正在参与2022春招打卡活动，点击查看活动详情。

题目

问题解决

该题目需要将 words.txt的内容进行统计个数，并且按照降序排序

the day is sunny the

the the sunny is is

方案1

想法

如果能够将文中单词遍历出来，将单词作为key存入map，次数为map的值，这样只需要遍历map不就可以解决该问题了么

实践

要将单词作为key存入map，需要先将单词分割开来

脚本

#!/bin/bash

while read line
do
        for xx in $line
        do
                echo $xx
        done
done < words.txt

执行结果

# bash shell_192.sh
the
day
is
sunny
the
the
the
sunny
is
is
#

存入map，并且对结果进行排序

#!/bin/bash

# 定义 map 
declare -A dict

# 遍历数据， 填充 map
while read line
do
        for xx in $line
        do
                dict[$xx]=`expr ${dict[$xx]} + 1`
        done
done < words.txt

# 打印 map 中的数据
for key in ${!dict[*]}
do
        echo -e "$key ${dict[$key]}"
done | sort -k 2 -rn

结果

# bash shell_192.sh
the 4
is 3
sunny 2
day 1
#

方法2

想法

既然上诉我们已经将单词给分割开来了，那么能否用命令uniq去统计次数呢

实践

先对文章单词进行排序

while read line; do for xx in $line; do echo $xx; done; done < words.txt | sort

结果

day
is
is
is
sunny
sunny
the
the
the
the

然后对文章单词进行数据统计(uniq)

 while read line; do for xx in $line; do echo $xx; done; done < words.txt | sort | uniq -c

结果

1 day
3 is
2 sunny
4 the

我们的结果是单词在前次数在后，所以还需要 awk 进行输出

while read line; do for xx in $line; do echo $xx; done; done < words.txt | sort | uniq -c | awk '{print $2 " " $1}'

结果

day 1
is 3
sunny 2
the 4

然后再对第二行进行排序

while read line; do for xx in $line; do echo $xx; done; done < words.txt | sort | uniq -c | awk '{print $2 " " $1}' | sort -k 2 -rn

结果

the 4
is 3
sunny 2
day 1

能不能将 while read line; do for xx in $line; do echo $xx; done; done < words.txt 转化为命令呢

sed 's# #\n#g' words.txt |  sort | uniq -c | awk '{print $2 " " $1}' | sort -k 2 -rn

但是提交失败了。。。

失败用例

emmm ,原来有些单词间隔有多个空格，修改下

在 sed 后面将空行排除掉即可

 sed 's# #\n#g' words.txt | grep -v '^$' |  sort | uniq -c | awk '{print $2 " " $1}' | sort -k 2 -rn

再次提交，正常了

方法3

想法

既然 bash 脚本能够编写并且通过测试，那么使用 awk 脚本，能否可以呢？

使用awk将单词分割开来

gawk '
{
        for (i=1;i<=NF;i++) {
                print ($i)
        }
}' words.txt

结果

the
day
is
sunny
the
the
the
sunny
is
is

使用数组统计次数，和方案1 的脚本有异曲同工之处，只不过 awk 是放在数组里面

脚本

gawk '
{
        for (i=1;i<=NF;i++) {
                S[$i]=S[$i]+1
        }
}

END {
        for (a in S) {
                print (a,S[a])
        }
}
' words.txt | sort -rn -k 2

结果

the 4
is 3
sunny 2
day 1

技术细节探寻

以 words.txt 文件为例

对文件进行按行读取

方法1

cat words.txt | while read line
do
  ...
done

方法2

while read line
do
  ...
done < words.txt

方法1 的弊端: 当脚本遇到管道符 |会开启一个子shell 进行执行，导致的结果是方法1 中的变量更改，在循环后就失效了。

bash map

map 和数组不同，需要先申请，在使用

使用 declare -A申请

获取单个map值

${mapName[key]}

获取所有key

${!mapName[@]}

获取所有值

${mapName[@]}

获取map长度

${#mapName[@]}

awk

变量 NF是长度

数组可以直接使用，不用预先申请

其他命令

grep -v '^$'

-v: 取反

'^ $': ^代表以什么开头$ 代表以什么结尾，连接起来则为空行的意思

uniq -c

-c: 统计次数

sort -k 2 -rn

-k: 以多少行进行排序

-r: 降序排序

-n: 依照数字大小排序

所见/所闻/所想

题目

问题解决

方案1

想法

实践

方法2

想法

实践

方法3

想法

技术细节探寻

对文件进行按行读取

bash map

awk

其他命令

公告

所见/所闻/所想

力扣 192. 统计词频 解决之道

题目

问题解决

方案1

想法

实践

方法2

想法

实践

方法3

想法

技术细节探寻

对文件进行按行读取

bash map

awk

其他命令

公告

力扣 192. 统计词频解决之道