第一周作业续

首先，感谢邹老师的关注和回复。根据邹老师和杨老师的回复，我更新了周六的程序。

邹老师的问题我理解为如下两方面：

1. 引入stop word 该如何改进程序

2. 程序耗时最多的地方是哪里？如何解释？

杨老师的问题我的理解有如下方面：

对hash 和 array性能进行对比
列特性和需求，考虑核心接口。

根据老师们的建议，目前对程序进行了如下修改：

将读写文件的操作单独用一个类实现。MyFile.java
修改了读源文件时候的处理方式，将String 改成了StringBuffer，原因是计算版本1程序的运行时间时，发现读文件花费了很长的时间，看程序发现是因为读文件时用到了String，影响了性能。
将程序改进，不同功能的模块由不同函数完成。
根据邹老师提出的问题，添加了新的函数用来加载stoplist.txt。同时在词频统计函数中进行判断。

注：因为之前对stop word这个术语不理解，通过上网查询获知stop word的意思。换成自己的理解就是：对于一个英文著作中，有很多如am is are之类的词，我们统计这样词的出现频率并没有太大意义，所以这样的词可以出现在stoplist列表中，统计的时候只算一次，或者忽略不计就可以了。我猜想这可能是邹老师出这个问题的初衷，是这样吗？
修改后，主函数直接按照算法的流程图调用各个函数：
1. 获取stop列表
2. 获取词频统计源文件内容
3. 预处理源文件内容
4. 词频统计
5. 将统计结果写入结果文件

思考：

对于邹老师提到的问题2，通过运行程序，发现耗时最多的地方是预处理这个函数，在我的电脑上，各个函数的运行时间如下：

getStopList	2ms
getSourceContent	28ms
pretreatmentContent	447ms
getFreq	53ms
writeFile	90ms

关于杨老师提到的比较hash和array性能这个思考，待有时间时继续。

最后附上程序：

MyFile.java

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

import java.util.*;

public class MyFile {

public static String readFile(String path) {

StringBuffer result = new StringBuffer();

try {

BufferedReader br = new BufferedReader(new FileReader(

new File(path)));

String tmp = null;

while ((tmp = br.readLine()) != null) {

result.append(tmp);

result.append(" ");

}

br.close();

} catch (IOException e) {

e.printStackTrace();

}

return result.toString();

}

public static void writeFile(List<Map.Entry<String, Integer>> lst) {

try {

BufferedWriter bw = new BufferedWriter(new FileWriter(new File(

System.getProperty("user.dir") + "//result.txt")));

for(int i=0;i<lst.size();i++)

{

bw.append(lst.get(i).getKey()+":"+lst.get(i).getValue()+"\n");

}

bw.flush();

bw.close();

} catch (IOException e) {

}

WordFreqStatistics.java

import java.util.ArrayList;

import java.util.Collections;

import java.util.Comparator;

import java.util.HashMap;

import java.util.HashSet;

import java.util.List;

import java.util.Map;

import java.util.Set;

import java.util.Map.Entry;

public class WordFreqStatistics {

public static String sourceFilePath = System.getProperty("user.dir")

+ "//anna.txt";

public static String stopWordFilePath = System.getProperty("user.dir")

+ "//stoplist.txt";

public static Map<String,Integer> mp = new HashMap<String, Integer>();

public static Set<String> stop = new HashSet<String>();

public static String words[] = null;

public static String sourceContent =null;

public static void getStopList()

{

String stopContent = MyFile.readFile(stopWordFilePath);

String stopWords[] = stopContent.split("\\s+|\\r|\\n|\\t");

for(String word : stopWords)

{

stop.add(word);

}

public static String getSourceContent(String filepath)

{

return MyFile.readFile(filepath);

}

public static String[] pretreatmentContent(String content)

{

content = content.toLowerCase();

content = content.replaceAll("[^A-Za-z]", " ");

content = content.replaceAll("\\s+", " ");

return content.split("\\s+");

}

public static void getFreq(String[] words)

{

for(int i=0;i<words.length;i++)

{

if(!stop.contains(words[i]))

{

if((mp.get(words[i]))!=null)

{

int value = ((Integer)mp.get(words[i])).intValue();

value++;

mp.put(words[i], new Integer(value));

}

else{

mp.put(words[i], new Integer(1));

}

public static List<Map.Entry<String,Integer>> sort()

{

ArrayList<Entry<String, Integer>> lst = new ArrayList<Entry<String, Integer>>(

mp.entrySet());

Collections.sort(lst,new Comparator<Object>() {

public int compare(Object e1,Object e2) {

int v1 = Integer.parseInt(((Entry<String, Integer>) e1)

.getValue().toString());

int v2 = Integer.parseInt(((Entry<String, Integer>) e2)

.getValue().toString());

return v2-v1;

}

});

return lst;

}

public static void main(String[] args)

{

getStopList();

sourceContent = getSourceContent(sourceFilePath);

words = pretreatmentContent(sourceContent);

getFreq(words);

MyFile.writeFile(sort());

}

待续

posted @ 2016-03-07 13:35 巴格里斯阅读(245) 评论(12) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

巴格里斯

End of the road is still the way if you want to go

第一周作业续

公告

巴格里斯

End of the road is still the way if you want to go

第一周作业 续

公告

第一周作业续