solr集成jieba结巴分词

为什么选择结巴分词

分词效率高
词料库构建时使用的是jieba (python)

结巴分词Java版本

下载

git clone https://github.com/huaban/jieba-analysis

编译

cd jieba-analysis
mvn install

注意

如果mvn版本较高，需要修改pom.xml文件，在plugins前面增加

solr tokenizer版本（下载）

https://github.com/sing1ee/analyzer-solr (solr 5)
https://github.com/sing1ee/jieba-solr.git (solr 4)

支持solr 6或7或更高

如果你的solr像我一样，版本比较新，需要对代码稍做修改，但改动其实不大。(根据给编译时报的错误做修改即可)

build.gradle的diff

diff --git a/build.gradle b/build.gradle
index 2a87525..06c5cc3 100644
--- a/build.gradle
+++ b/build.gradle
@@ -1,4 +1,4 @@
-group = 'analyzer.solr5'
+group = 'analyzer.solr7'
version = '1.0'
apply plugin: 'java'
apply plugin: "eclipse"
@@ -14,15 +14,14 @@ repositories {
dependencies {
testCompile group: 'junit', name: 'junit', version: '4.11'

- compile("org.apache.lucene:lucene-core:5.0.0")
- compile("org.apache.lucene:lucene-queryparser:5.0.0")
- compile("org.apache.lucene:lucene-analyzers-common:5.0.0")
- compile('com.huaban:jieba-analysis:1.0.0')
-// compile("org.fnlp:fnlp-core:2.0-SNAPSHOT")
+ compile("org.apache.lucene:lucene-core:7.1.0")
+ compile("org.apache.lucene:lucene-queryparser:7.1.0")
+ compile("org.apache.lucene:lucene-analyzers-common:7.1.0")
+ compile files('libs/jieba-analysis-1.0.3.jar')
compile("edu.stanford.nlp:stanford-corenlp:3.5.1")
}

task "create-dirs" << {
sourceSets*.java.srcDirs*.each { it.mkdirs() }
sourceSets*.resources.srcDirs*.each { it.mkdirs() }
-}
\ No newline at end of file
+}

下面是solr6的配置build.gradle

group = 'analyzer.solr6'
version = '1.0'
apply plugin: 'java'
apply plugin: "eclipse"
apply plugin: "idea"

sourceCompatibility = 1.5
version = '1.0'

repositories {
    mavenCentral()
}

dependencies {
    testCompile group: 'junit', name: 'junit', version: '4.11'

    compile("org.apache.lucene:lucene-core:6.1.0")
    compile("org.apache.lucene:lucene-queryparser:6.1.0")
    compile("org.apache.lucene:lucene-analyzers-common:6.1.0")
    compile('com.huaban:jieba-analysis:1.0.0')
//    compile("org.fnlp:fnlp-core:2.0-SNAPSHOT")
    compile("edu.stanford.nlp:stanford-corenlp:3.5.1")
}

task "create-dirs" << {
    sourceSets*.java.srcDirs*.each { it.mkdirs() }
    sourceSets*.resources.srcDirs*.each { it.mkdirs() }
}

执行gradle build

[root@bogon analyzer-solr-master]# gradle build
Starting a Gradle Daemon (subsequent builds will be faster)

> Task :compileJava
warning: [options] bootstrap class path not set in conjunction with -source 1.5
warning: [options] source value 1.5 is obsolete and will be removed in a future release
warning: [options] target value 1.5 is obsolete and will be removed in a future release
warning: [options] To suppress warnings about obsolete options, use -Xlint:-options.
4 warnings

Deprecated Gradle features were used in this build, making it incompatible with Gradle 5.0.
See https://docs.gradle.org/4.7/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 9s
2 actionable tasks: 1 executed, 1 up-to-date
[root@bogon analyzer-solr-master]#

jar包生成路径

集成到solr

拷贝jar包到solr的目录下：server/solr-webapp/webapp/WEB-INF/lib

<!-- 结巴分词 -->
    <fieldType name="text_jieba" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="analyzer.solr6.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="analyzer.solr6.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
      </analyzer>
    </fieldType>

结巴分词后的效果：

posted @ 2018-04-24 10:28 霄九天阅读(667) 评论(0) 收藏举报

刷新页面返回顶部

霄九天