solr集成jieba结巴分词
为什么选择结巴分词
- 分词效率高
- 词料库构建时使用的是jieba (python)
结巴分词Java版本
下载
git clone https://github.com/huaban/jieba-analysis
编译
cd jieba-analysis
mvn install
- 注意
如果mvn版本较高,需要修改pom.xml文件,在plugins前面增加
solr tokenizer版本(下载)
- https://github.com/sing1ee/analyzer-solr (solr 5)
- https://github.com/sing1ee/jieba-solr.git (solr 4)
支持solr 6或7或更高
如果你的solr像我一样,版本比较新,需要对代码稍做修改,但改动其实不大。(根据给编译时报的错误做修改即可)
build.gradle的diff
diff --git a/build.gradle b/build.gradle index 2a87525..06c5cc3 100644 --- a/build.gradle +++ b/build.gradle @@ -1,4 +1,4 @@ -group = 'analyzer.solr5' +group = 'analyzer.solr7' version = '1.0' apply plugin: 'java' apply plugin: "eclipse" @@ -14,15 +14,14 @@ repositories { dependencies { testCompile group: 'junit', name: 'junit', version: '4.11' - compile("org.apache.lucene:lucene-core:5.0.0") - compile("org.apache.lucene:lucene-queryparser:5.0.0") - compile("org.apache.lucene:lucene-analyzers-common:5.0.0") - compile('com.huaban:jieba-analysis:1.0.0') -// compile("org.fnlp:fnlp-core:2.0-SNAPSHOT") + compile("org.apache.lucene:lucene-core:7.1.0") + compile("org.apache.lucene:lucene-queryparser:7.1.0") + compile("org.apache.lucene:lucene-analyzers-common:7.1.0") + compile files('libs/jieba-analysis-1.0.3.jar') compile("edu.stanford.nlp:stanford-corenlp:3.5.1") } task "create-dirs" << { sourceSets*.java.srcDirs*.each { it.mkdirs() } sourceSets*.resources.srcDirs*.each { it.mkdirs() } -} \ No newline at end of file +}
下面是solr6的配置build.gradle
group = 'analyzer.solr6' version = '1.0' apply plugin: 'java' apply plugin: "eclipse" apply plugin: "idea" sourceCompatibility = 1.5 version = '1.0' repositories { mavenCentral() } dependencies { testCompile group: 'junit', name: 'junit', version: '4.11' compile("org.apache.lucene:lucene-core:6.1.0") compile("org.apache.lucene:lucene-queryparser:6.1.0") compile("org.apache.lucene:lucene-analyzers-common:6.1.0") compile('com.huaban:jieba-analysis:1.0.0') // compile("org.fnlp:fnlp-core:2.0-SNAPSHOT") compile("edu.stanford.nlp:stanford-corenlp:3.5.1") } task "create-dirs" << { sourceSets*.java.srcDirs*.each { it.mkdirs() } sourceSets*.resources.srcDirs*.each { it.mkdirs() } }
执行gradle build
[root@bogon analyzer-solr-master]# gradle build
Starting a Gradle Daemon (subsequent builds will be faster)
> Task :compileJava
warning: [options] bootstrap class path not set in conjunction with -source 1.5
warning: [options] source value 1.5 is obsolete and will be removed in a future release
warning: [options] target value 1.5 is obsolete and will be removed in a future release
warning: [options] To suppress warnings about obsolete options, use -Xlint:-options.
4 warnings
Deprecated Gradle features were used in this build, making it incompatible with Gradle 5.0.
See https://docs.gradle.org/4.7/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 9s
2 actionable tasks: 1 executed, 1 up-to-date
[root@bogon analyzer-solr-master]#
jar包生成路径

集成到solr
拷贝jar包到solr的目录下:server/solr-webapp/webapp/WEB-INF/lib
<!-- 结巴分词 --> <fieldType name="text_jieba" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="analyzer.solr6.jieba.JiebaTokenizerFactory" segMode="SEARCH"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English"/> </analyzer> <analyzer type="query"> <tokenizer class="analyzer.solr6.jieba.JiebaTokenizerFactory" segMode="SEARCH"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English"/> </analyzer> </fieldType>
结巴分词后的效果:

如果生命没有遗憾,没有波澜

浙公网安备 33010602011771号