第一篇：Lucene介绍与使用

Lucene是Apache软件基金会基金会项目，是文本搜搜引擎库。提供基于Java的索引和搜索技术，以及拼写检查，命中突出显示和高级分析/标记化功能。

本文是基于Lucene 7.2.1，文章中的代码在本版本通过测试。

使用开发工具是idea，JDK8，需要在idea做如下配置。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.ants</groupId>
    <artifactId>dc_search</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <lucene.version>7.2.1</lucene.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!-- ikanalyzer 中文分词器,稍后会讲如何本地安装
        <dependency>
            <groupId>com.chulung</groupId>
            <artifactId>IK-Analyzer</artifactId>
            <version>1.0-SNAPSHOT</version>
        </dependency>-->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
    </dependencies>
</project>

1. Lucene实现全文检索的流程

绿色代表索引建立过程，红色代表查询索引过程。

索引库：存储文档(不是原始文档，是构建文档)和索引两部分

索引建立过程：获得文档-->构建文档对象-->分析文档-->创建索引

查询索引过程：用户查询接口-->创建查询-->执行查询-->渲染结果

1.1 索引创建步骤解析

①原始文档：原始文档是指索引和搜索的内容，包括磁盘文件、互联网内容等。

lucene不做爬虫，可使用(Nutch、jsoup、heritrix)网络爬虫，获得互联网数据。

②构建文档对象：索引前需要将原始文档创建成文档，文档包含很多域，域名就像数据库中的列名，是一个确定的值；域值就是在原始文档中的全部或部分内容。

同一个文档可以有多个域，不同文档有不同域，同一文档可以有相同的域。

每个文档有唯一编号(id)，不能修改，自增+1。

③分析文档：将原内容进行单词提取，忽略大小写，去除标点符号，去掉停用词(词语不被索引)形成语汇单元(term).

例如"Hello world,this is a cat" 将会被拆分为3个不同的语汇单元，分别为"hello","world","cat"

不同的域拆分出的语汇单元是不同的语汇单元，例如文件名和文件内容拆分出term是两个term。

term将作为索引存储到索引库中。

同一term：同一域中并且拆分的词相同。

//分析文档，其中test可以填写任意字符，后面接着是需要拆分成语汇单元的原始内容
@Test
    public void testTokenStream() throws Exception{
        //创建一个标准分词对象
        Analyzer analyzer = new StandardAnalyzer();//标准分词器
        //获得TokenStream对象 (域名-随便,分析文档的内容)
        TokenStream tokenStream = analyzer.tokenStream("test","Hello world,this is a cat cat cat cat");
        //添加引用，获得每个关键词
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //添加引用，记录关键词的开始和结束位置
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        //指针调整列表头部
        tokenStream.reset();
        //遍历关键词列表，通过incrementToken方法列表是否结束
        while (tokenStream.incrementToken()){
            //关键词的起始位置
            System.out.println("start-->"+offsetAttribute.startOffset());
            //取关键词
            System.out.println(charTermAttribute);
            //关键词的结束位置
            System.out.println("end-->"+offsetAttribute.endOffset());
        }
        //关闭流
        tokenStream.close();
    }

④创建索引：索引的目的是通过语汇单元搜索Document(不是一开始的原始文档)。

倒排索引：拆分的语汇单元作为索引找出文档，即通过单词找文档，速度快。

正排索引：文件内容中匹配搜索关键字，即通过文档找单词，速度慢。

1.2 索引创建具体步骤

1.2.1 创建IndexWrite对象；

1.2.2.1 指定索引库Directory对象；

1.2.2.2 指定一个分词器，本例使用标准分词器；

1.2.3 创建Document对象；

1.2.4 创建Field对象，将Field对象加入到Document对象中；

1.2.5 使用IndexWrite对象将将document对象写到索引库，此过程需要建立索引，并将索引的document存到索引库中；

1.2.6 关闭流；

//索引库位置
    private String directoryPath = "F:\\search\\index";
    //文件位置
    private String srcFilePath ="E:\\workspace";

    //创建索引
    @Test
    public void testName() throws Exception {
        //1 创建IndexWrite对象
        //2.1 指定索引库Directory对象
        Directory directory = FSDirectory.open(Paths.get(directoryPath));//文件系统  硬盘索引
        //2.2 指定一个分析器，对文件进行分析  标准分析其
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        IndexWriter indexWriter = new IndexWriter(directory,indexWriterConfig);
        //3 创建Document对象
        Document document=null;
        //4 创建Field对象，将Field对象添加到Document对象中
        File file = new File(srcFilePath);
        File[] files = file.listFiles();
        for (File f: files){
            document = new Document();//TextField LongPoint StringField
            document.add(new TextField("fileName",f.getName(), Field.Store.YES));
            // 即 IntPoint,DoublePoint等
            document.add(new LongPoint("fileSize", file.length()));
            //大小
            document.add(new StoredField("fileSize", file.length()));
            //同时添加排序支持
            document.add(new NumericDocValuesField("fileSize",file.length()));
            document.add(new StringField("filePath",f.getPath(), Field.Store.YES));
            document.add(new TextField("fileContent",FileUtils.readFileToString(f, "utf-8"), Field.Store.YES));
            //5 使用ndexWrite对象将将document对象写到索引库，此过程需要建立索引，并将索引的document存到索引库中
            indexWriter.addDocument(document);
        }
        //6 关闭IndexWrite对象
        indexWriter.close();
    }

1.3 查询具体步骤

1.3.1 创建一个Directory，指定索引库的位置；

1.3.2 创建IndexReader对象，指定索引库Directory对象；

1.3.3 创建IndexSearch对象，指定IndexReader对象；

1.3.4 创建Query对象，指定查询域和查询的关键词；

1.3.5 执行查询；

1.3.6 返回结果，遍历并输出；

1.3.7 关闭流；

@Test
    public void testSearch() throws IOException {
        //1 创建一个Directory，索引库中的位置
        Directory directory = FSDirectory.open(Paths.get(directoryPath));//文件系统
        //2 创建IndexReader对象，需要指定Directory对象
        IndexReader indexReader = DirectoryReader.open(directory);
        //3 创建indexSearch对象，需要指定IndexReader对像
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //4 创建TermQuery对象，指定查询域和查询的关键词
        Query query =new TermQuery(new Term("fileContent","class"));
        //5 执行查询
        TopDocs topDocs = indexSearcher.search(query,1);//查处评分最高的的n条记录
        //6 返回查询结果，遍历并输出
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc s:scoreDocs){//文档类型
            int doc = s.doc;
            Document doc1 = indexSearcher.doc(doc);
            System.out.println(doc1.get("fileName"));
            System.out.println(doc1.get("fileSize"));
            System.out.println(doc1.get("filePath"));
            System.out.println(doc1.get("fileContent"));
            System.out.println("-------------------------");
        }
        //7 关闭流
        indexReader.close();
    }

1.4 分词器(中文)的分类

        Analyzer analyzer = new StandardAnalyzer();//标准分词器
        Analyzer analyzer = new CJKAnalyzer();//中日韩二分分词器，两个单词分
        Analyzer analyzer = new SmartChineseAnalyzer();//中文较好，扩展性差
        Analyzer analyzer = new IKAnalyzer();//IK分词器，扩展性高

目前最好用的中文分词器是IK分词器，可以自定义停用词和扩展词。

1.4.1 首先在github下载duiIK分词器的源码，使用mvn或idea安装到本地。

github地址：https://github.com/silentwolfyh/IK-Analyzer

1.4.2 在resoures下新建 ext.dic，ext_stopwords.dic，IKAnalyzer.cfg.xml。

扩展词：比如说新增词，或者你想让其中的词语不被拆分，可以将词语放入到ext.dic文件中；

停用词：比如不想让拆分的词，或者不想让词在索引出现，可以将词语放入到ext_stopwords.dic文件中；

IKAnalyzer.cfg.xml：配置扩展词和停用词的文件。

其中 ext.dic，ext_stop.dic可任意命名，代表的是扩展词和停用词；文件每一行代表扩展或停用词。

例如：

ext.dic文件

空谈误国
实干兴邦

ext_stopwords.dic

空谈
误国
实干
兴邦

IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">ext.dic</entry>

    <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords">ext_stopwords.dic</entry>
</properties>

1.4.3 使用，新建索引和查询时用到的分词器要一致。

TokenStream tokenStream = analyzer.tokenStream("test","空谈误国，实干兴邦");

Analyzer analyzer = new IKAnalyzer();

1.4.5 运行结果为，符合预期

小结：

本次简略地讲述lucene，详细的介绍索引新建、查询和分词器使用。使用最新的lucene7.x版本，中的新增索引域相比较之前有很大的变化，比如说数字的存储。

下一篇将讲述lucene索引的增删改查。

posted @ 2019-01-09 17:56 i孤独行者阅读(781) 评论(0) 收藏举报

刷新页面返回顶部

i孤独行者

空谈误国，实干兴邦

第一篇：Lucene介绍与使用

1. Lucene实现全文检索的流程

小结：

公告