第四次作业——词频统计

一、基本信息

　　1、本次作业的地址：https://edu.cnblogs.com/campus/ntu/Embedded_Application/homework/2088

　 2、项目Git地址：https://gitee.com/ntucs/PairProg/tree/SE018_019

3.开发环境：Pycharm2018、Python3.7

4.结对成员:1613072018邱子良、1613072019李想

二、项目（方法）分析

Task 1 :基本任务

（1）读取文件

def process_file(dst):     
    try:     
        f = open(dst, "r")
    except IOError as s:
        print(s)
        return None
    try:     
        bvffer = f.read()
    except:
        print("Read File Error!")
        return None
    f.close()
    return bvffer

（2）计算行数

def process_line(dst):  
    count = 0
    for line in open(dst, 'r').readlines():
        if line != '' and line != '\n':
            count += 1
    return 'lines:', count

（3）停词表筛选单词

def process_buffer(bvffer):
    if bvffer:
        word_freq = {}
        bvffer = bvffer.lower()
        words = bvffer.replace(punctuation, ' ').split(' '）
        regex_word = "^[a-z]{4}(\w)*"
        txtWords = open("stopwords.txt", 'r').readlines()  
       stopWords = []  
        for i in range(len(txtWords)):
            txtWords[i] = txtWords[i].replace('\n', '')
            stopWords.append(txtWords[i])

        for word in words:
            if word not in stopWords:  
                if re.match(regex_word, word):
                     if word in word_freq.keys():
                        word_freq[word] = word_freq[word] + 1
                    else:
                        word_freq[word] = 1
        return word_freq, len(words)

（4）输出方法

def output_result(word_freq):
    if word_freq:
        sorted_word_freq = sorted(word_freq.items(), key=lambda v: v[1], reverse=True)
        for item in sorted_word_freq[:10]:  
            print("<%s>:%d " % (item[0], item[1]))
            f = open("result.txt", 'w')
            print("<%s>:%d " % (item[0], item[1]), file=f)
            f.close()

（5）主函数

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('dst')
    args = parser.parse_args()
    dst = args.dst
    bvffer = process_file(dst)
    word_freq = process_buffer(bvffer)
    output_result(word_freq)

三、程序运行截图

1.性能分析

2.运行结果

3、时间复杂度O（N+n）

四、其他

（1）结队编程时间开销：一周多，两个人的基础都比较差，因为之前没接触过python语言，大多数时间都用于查找资料了解python的程序格式和终端命令

（2）结队

五、事后分析和总结

（1）因为两个人编程基础比较差，用了很多的时间去查找资料了解python程序格式和程序注释和执行

（2）邱子良对李想：李想同学积极好学，积极查找相关资料和教学视频来帮助我完成作业

李想对邱子良：邱子良同学勤于动手，把自己想法肯于去尝试并完成了博客的内容

（3）两个人的合作让我们都意识到了彼此的不足和有点，相信下次合作会更有效率，两个人也都需要更多的努力

posted on 2018-10-21 19:57 邱子良阅读(132) 评论(1) 收藏举报

刷新页面返回顶部

软嵌邱子良

第四次作业——词频统计

导航

公告