爬虫使用的simhash网页去重算法-----项目改造使用-----java.lang.IllegalStateException: TokenStream contract violation: reset()/close()问题

simhash项目地址:https://github.com/CreekLou/simhash.git

这个项目不能直接使用,得改造,一般需要改造两个地方:

1、luecene的核心jar包版本与其他包及分词器ikanalyzer中使用的版本不一致,注意检查

 

 

 

    此项目将核心包版本从3.6.1改为4.7.2

------------------------------至此改完后可用测试类TEST调试,调试前需给程序抛异常IOException-------注意有多处,不一一列举

2、抛完异常后,运行会报java.lang.IllegalStateException: TokenStream contract violation: reset()/close()问题,因为分词analyzer.tokenStream方法生成的对象添加完属性后需要重置

按下图顺序寻找位置:

 

 

 

 

-----------------------找到G点

 

 然后运行正常

在springboot项目中测试:

1、搭好springboot环境

POM.XML

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.2.RELEASE</version>
    </parent>
    <groupId>com.mwq</groupId>
    <artifactId>com-mwq-crawler-webmagic</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <java.version>1.8</java.version>
    </properties>
<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-jpa</artifactId>
    </dependency>
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
    </dependency>
    <!-- https://mvnrepository.com/artifact/us.codecraft/webmagic-core -->
    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-core</artifactId>
        <version>0.7.3</version>
        <exclusions>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
    <!-- https://mvnrepository.com/artifact/us.codecraft/webmagic-extension -->
    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-extension</artifactId>
        <version>0.7.3</version>
    </dependency>
    <dependency>
        <groupId>com.google.guava</groupId>//布隆过滤器的依赖
        <artifactId>guava</artifactId>
        <version>16.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.7</version>
    </dependency>
    <dependency>
        <groupId>com.lou</groupId>
        <artifactId>simhasher</artifactId>
        <version>0.0.1-SNAPSHOT</version>
    </dependency>
</dependencies>
</project>

simhash是单独的项目我们可以这么导入

 

 

 

 然后下一步,下一步就可以了,导入模型后要设置SDK,按上述调试好包后,进行安装:

 

 然后再POM.XML中配置依赖即可

然后测试:

package com.mwq.job.task;

import com.lou.simhasher.SimHasher;
import org.apache.commons.io.IOUtils;
import org.springframework.stereotype.Component;

import java.io.FileInputStream;
import java.io.IOException;

@Component
public class Test {
 // @Scheduled(cron = "0/5 * * * * *")
    public void testDistance() throws IOException {
        String str1 = readAllFile("D:/test/testin2.txt");
        SimHasher hash1 = new SimHasher(str1);
        System.out.println(hash1.getSignature());
        System.out.println("============================");

        String str2 = readAllFile("D:/test/testin.txt");
        SimHasher hash2 = new SimHasher(str2);
        System.out.println(hash2.getSignature());
        System.out.println("============================");

        System.out.println(hash1.getHammingDistance(hash2.getSignature()));

    }
    /**
     * 测试用
     * @param filename 名字
     * @return
     */
    public static String readAllFile(String filename) {
        String everything = "";
        try {
            FileInputStream inputStream = new FileInputStream(filename);
            everything = IOUtils.toString(inputStream);
            inputStream.close();
        } catch (IOException e) {
        }

        return everything;
    }
}

 

posted @ 2020-12-31 22:41  我怎么这么好看  阅读(296)  评论(0编辑  收藏  举报