词库热更新
作为搜索服务的使用者,我希望系统能够提供基于界面操作的,灵活的自定义热词、停用词、同义词的词典管理功能,便于用户自定义扩展符合自己业务场景的词项,进而提高搜索的准确度。
实现方案
- elasticsearch-analysis-ik插件改造,使用关系型数据库存储热词、停用词。
- elasticsearch-analysis-dynamic-synonym插件改造,使用关系型数据库存储同义词。
- 新增词项管理功能,用户可以通过界面编辑或导入符合自己业务的热词、停用词、同义词。
elasticsearch-analysis-ik插件改造
修改ES IK插件的源码,使之能够从MySQL表中定时拉取词库的更新。
表结构
CREATE TABLE `es_extra_mainword` ( `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '唯一标识符', `main_word` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '热词', `is_deleted` tinyint(1) NOT NULL DEFAULT '0' COMMENT '是否已删除', `create_user` varchar(64) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '创建者', `create_time` datetime(0) NULL DEFAULT NULL COMMENT '创建时间', `update_user` varchar(64) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '更新者', `update_time` datetime(0) NULL DEFAULT NULL COMMENT '更新时间', PRIMARY KEY (`id`) ) ENGINE = InnoDB AUTO_INCREMENT = 25 CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '扩展主词库' ROW_FORMAT = Dynamic; CREATE TABLE `es_extra_stopword` ( `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '唯一标识符', `stop_word` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '停用词', `is_deleted` tinyint(1) NOT NULL DEFAULT '0' COMMENT '是否已删除', `create_user` varchar(64) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '创建者', `create_time` datetime(0) NULL DEFAULT NULL COMMENT '创建时间', `update_user` varchar(64) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '更新者', `update_time` datetime(0) NULL DEFAULT NULL COMMENT '更新时间', PRIMARY KEY (`id`) ) ENGINE = InnoDB AUTO_INCREMENT = 25 CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '扩展停用词库' ROW_FORMAT = Dynamic;
配置修改
新增配置文件jdbc.properties
jdbc.url=jdbc:mysql://localhost:3306/test?useAffectedRows=true&characterEncoding=UTF-8&autoReconnect=true&zeroDateTimeBehavior=convertToNull&useUnicode=true&serverTimezone=GMT%2B8&allowMultiQueries=true jdbc.username=root jdbc.password=root jdbc.driver=com.mysql.cj.jdbc.Driver jdbc.update.main.dic.sql=SELECT * FROM `es_extra_main` WHERE update_time > ? order by update_time asc jdbc.update.stopword.sql=SELECT * FROM `es_extra_stopword` WHERE update_time > ? order by update_time asc jdbc.update.interval=10
修改POM文件,添加数据库连接驱动
<dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <version>42.2.18</version> </dependency>
修改src/main/assemblies/plugin.xml,将 驱动的依赖写入,否则打成 zip 后会没有驱动的 jar 包。
<dependencySets> <dependencySet> <outputDirectory/> <useProjectArtifact>true</useProjectArtifact> <useTransitiveFiltering>true</useTransitiveFiltering> <excludes> <exclude>org.elasticsearch:elasticsearch</exclude> </excludes> </dependencySet> <dependencySet> <outputDirectory/> <useProjectArtifact>true</useProjectArtifact> <useTransitiveFiltering>true</useTransitiveFiltering> <includes> <include>org.apache.httpcomponents:httpclient</include> <!--这里 --> <include>org.postgresq:postgresql</include> </includes> </dependencySet> </dependencySets>
修改src/main/resources/plugin-security.policy,添加permission java.lang.RuntimePermission "setContextClassLoader";,否则会因为权限问题抛出以下异常。
grant { // needed because of the hot reload functionality permission java.net.SocketPermission "*", "connect,resolve"; permission java.lang.RuntimePermission "setContextClassLoader"; };
代码改造
修改 Dictionary
在构造方法中加载 jdbc.properties 文件
将 getProperty()改为 public
添加了几个方法,用于增删词条
initial()启动自己实现的数据库监控线程
新增DatabaseMonitor
lastUpdateTimeOfMainDic、lastUpdateTimeOfStopword 记录上次处理的最后一条的updateTime
查出上次处理之后新增或删除的记录
循环判断 is_deleted 字段,为true则添加词条,false则删除词条
打包测试
直接mvn package,然后在 elasticsearch-analysis-ik/target/releases目录中找到 elasticsearch-analysis-ik-6.7.2.zip 压缩包,直接解压到 ES 自己的 plugins 目录即可。
elasticsearch-analysis-dynamic-synonym插件改造
新增数据表存储同义词,修改插件源码,动态获数据库中的同义词。
表结构
CREATE TABLE `es_extra_synonymword` ( `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '唯一标识符', `synonym_word` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '同义词', `is_deleted` tinyint(1) NOT NULL DEFAULT '0' COMMENT '是否已删除', `create_user` varchar(64) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '创建者', `create_time` datetime(0) NULL DEFAULT NULL COMMENT '创建时间', `update_user` varchar(64) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '更新者', `update_time` datetime(0) NULL DEFAULT NULL COMMENT '更新时间', PRIMARY KEY (`id`) ) ENGINE = InnoDB AUTO_INCREMENT = 25 CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '扩展同义词库' ROW_FORMAT = Dynamic;
配置修改
新增配置文件jdbc-reload.properties
jdbc.url=jdbc:mysql://127.0.0.1:13306/test?serverTimezone=GMT&autoReconnect=true&useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useAffectedRows=true&useSSL=false jdbc.user=root jdbc.password=123456 # 查询同义词信息 jdbc.reload.synonym.sql=select synonym_docs as words from gw_es_lexicon_synonym where del_flag = 0; # 查询数据库同义词在数据库版本号 jdbc.reload.swith.synonym.version=SELECT swith_state FROM gw_swith where swith_code = 'synonym_doc'
修改pom文件,新增数据库连接驱动
<dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <version>42.2.18</version> </dependency>
修改plugin.xml
<?xml version="1.0"?> <assembly> <id>-</id> <formats> <format>zip</format> </formats> <includeBaseDirectory>false</includeBaseDirectory> <fileSets> <fileSet> <directory>${project.basedir}/config</directory> <outputDirectory>config</outputDirectory> </fileSet> </fileSets> <files> <file> <source>${project.basedir}/src/main/resources/plugin-descriptor.properties</source> <outputDirectory>/</outputDirectory> <filtered>true</filtered> </file> <file> <source>${project.basedir}/src/main/resources/plugin-security.policy</source> <outputDirectory>/</outputDirectory> <filtered>true</filtered> </file> </files> ...略... </assembly>
代码改造
新增MySqlRemoteSynonymFile文件
修改DynamicSynonymTokenFilterFactory类中的getSynonymFile(Analyzer analyzer)方法,对其稍加改动,自定义一个类型,触发调用数据库的查询
SynonymFile getSynonymFile(Analyzer analyzer) { try { SynonymFile synonymFile; if ("fromMySql".equals(location)) { synonymFile = new MySqlRemoteSynonymFile(environment, analyzer, expand, lenient, format, location); }else if (location.startsWith("http://") || location.startsWith("https://")) { synonymFile = new RemoteSynonymFile( environment, analyzer, expand, lenient, format, location); } else { synonymFile = new LocalSynonymFile( environment, analyzer, expand, lenient, format, location); } if (scheduledFuture == null) { scheduledFuture = pool.scheduleAtFixedRate(new Monitor(synonymFile), interval, interval, TimeUnit.SECONDS); } return synonymFile; } catch (Exception e) { logger.error("failed to get synonyms: " + location, e); throw new IllegalArgumentException("failed to get synonyms : " + location, e); } }
打包测试
开始进行源码编译,使用maven依次执行 clean、compile、package,然后在编译后的targer/releases目录下找到编译后的插件安装包文件.zip;
将其拷贝到ES的安装目录下的\plugins\dynamic-synonym目录下并解压后删除压缩包;然后将jdbc驱动拷贝到当前目录下。
自定义分析器测试即可。
参考资料
https://blog.csdn.net/qq_20919883/article/details/110502496
https://zhuanlan.zhihu.com/p/381936025
https://gitee.com/ykos/elasticsearch-analysis-ik/commits/master
https://github.com/YRREALLYCUTE/elasticsearch-analysis-dynamic-synonym-mysql