java中使用Vector和HashMap中碰到的一个问题

公告

View Post

定义一个Map变量，Map<String,Vector<Double>> docMatrix = new HashMap<String,Vector<Double>>();

定义一个Vector变量，Vector<Double> tempVSM = new Vector<Double>();

tempVSM初始化后存储一系列的double型数据，现在想把tempVSM put进docMatrix变量，然后调用tempVSM.clear()方法将tempVSM清空。但是这样操作之后，发现docMatrix添加进去的元素也为空了。所以想问下，是不是Vector类型的数据传递是传的引用，而不是值传递。貌似C++中vector类型的数据clear之后就不会出现上述问题。

我的应用是在文本分类处理中，建立文档集合的向量空间模型（Vector Space Model,VSM），docMatrix存储的是每篇文档的特征向量，所以在对每篇文档构建特征向量时，都先存储在tempVSM中，然后再添加到docMatrix中。我是在每个循环中都新建一个tempVSM对象，所以在循环处理3000多篇文档之后就会出现堆栈溢出或是内存不足的错误，想问下，如果不在每个循环中新建tempVSM对象的话，应该怎么实现上述的功能。

具体代码如下，可能导致出问题的代码已用红色标出：

/*
*构建训练集的VSM模型,将VSM模型存储到docMatrix中，词典模型从dictionary中读取
* @param filename 存储特征词的文件名
* @param root 存放语料库的根目录文件
* @return null
*/
public void construnctTrainingVSM(String filename,String root){
    float start,end;//计时器，统计程序运行时间
    start = System.currentTimeMillis();
    Vector<String> keywords = getFinalKeyWords(filename);//读取特征词
    if(keywords.isEmpty()){
        System.out.println("construnctTrainingVSM中无法获取特征词表！error！");
        return;
    }
    Vector<Map<Integer,Integer>> maxTFandDF = getFinalKeysMaxTFDF(filename);//获取最大TF和DF
    if(maxTFandDF.isEmpty()){
        System.out.println("construnctTrainingVSM中无法获取最大词频和文档频率表！error！");
        return;
    }
    Vector<File> filelists= new Vector<File>();//存储语料库所有文件
    recursion(root,filelists);

    Vector<Map<String,Integer>> articleTF = new Vector<Map<String,Integer>>();
    Map<Integer,Integer> TFandDF = new HashMap<Integer,Integer>();

    for(File file:filelists){
        float startfile,endfile;
        startfile = System.currentTimeMillis();
        //开始构建一篇文档的特征向量
        String parent = file.getParent();//获取文件的父目录
        String fileName = file.getName()+"_"+parent.substring(parent.lastIndexOf("\\")+1);//倒排表中存储的文件名格式为756.txt_C000024
        Vector<Map<Integer,Double>> tempVSM = new Vector<Map<Integer,Double>>();//Vector<Map<特征词的序号，该特征词权重>>,此两行注释的代码移到循环的外面，以减少对内存的占用
        Vector<Double> tempVSM2 = new Vector<Double>();
        for(int i=0;i<keywords.size();i++){
            //开始计算一个特征词在该篇文档中的权重
            String keyword = keywords.get(i);

            double TF = 0;
            double maxTF = 0;
            double DF = 0;
            double IDF = 0;
            double tfidf = 0;
            //获取该特征词在该文档中的词频
            articleTF = dictionary.get(keyword);
            for(Map<String,Integer> tf:articleTF){
                if(tf.containsKey(fileName)){
                    TF = (double)tf.get(fileName);
                    break;
                }
            }

            //获取该特征词的最大词频和文档频率DF
              TFandDF = maxTFandDF.get(i);
            for(Iterator it =TFandDF.entrySet().iterator();it.hasNext();){//分别取出Map中的key和value值
                Map.Entry e =(Map.Entry)it.next();
                 maxTF = Double.parseDouble(e.getKey().toString());
                 DF = Double.parseDouble(e.getValue().toString());
            }


            TF = 0.5+(double)TF/maxTF;
            IDF = Math.log((double)corpus_N/DF);
            //计算出该特征词在该文档中的权重，tf*idf
            tfidf = TF*IDF;
            Map<Integer,Double> tempKeyWeight = new HashMap<Integer,Double>();//比较怀疑是因为每次循环都新建这样一个对象，所以出现OutOfMemory的错误

            tempKeyWeight.put(i, tfidf);
            tempVSM.add(tempKeyWeight);//将该特征词id及其在本篇文档的权值添加到向量，循环所有特征词，直到计算出所有权重，并将所有权重添加到该向量
        }
        //一篇文档的特征向量构建完毕，下面要将其向量添加到VSM模型中
        if(!tempVSM.isEmpty()){
            //文档向量模型归一化
            tempVSM = normalizationVSM(tempVSM);
            //将上述文档向量转存到tempVSM2中
            for(Map<Integer,Double> weight:tempVSM){
                Map.Entry e =(Map.Entry)weight.entrySet().iterator().next();
                tempVSM2.add((Double)e.getValue());
            }
             //将该文档的特征向量添加到docMatrix中
              docMatrix.put(fileName, tempVSM2);
        }
//       tempVSM.clear();//此处如果调用这两句代码的话，打印输出的docMatrix对象中的value中都是空的元素
//       tempVSM2.clear();
       endfile = System.currentTimeMillis();
       System.out.println(fileName+"特征向量构建完毕！所用时间："+(endfile-startfile)+"ms");
    }
    end = System.currentTimeMillis();
    System.out.println("为训练语料库集建立VSM模型共用了"+(end-start)+"ms");
}

posted on 2011-11-28 20:39 jackzuo 阅读(1061) 评论(0) 收藏举报

刷新页面返回顶部

Love Ran

公告

View Post

java中使用Vector和HashMap中碰到的一个问题