全文检索工具包Lucene以及企业及应用Solr的学习（二）—— solr中edismax用到的Query Function以及java扩展

场景

简单描述就是，默认排序条件下，在某一次职位搜索返回的结果中，有以下几点要求（搜索域主要包括职位标题position_title和职位需求requirement，还有其他工作地址，职位公司等等，以下例子为了方便起见，只考虑职位标题）：

优先考虑一年内有更新、编辑操作的职位
对于时间在一个月之内的职位，如果关键字命中了标题的全部关键字，则这些职位排序最靠前
一个月之外的职位，只要标题命中，则按照时间降序排列

分析

首先明确一点，关键字命中标题的结果肯定得分高，而且命中的占比高的，得分也高；其次在标题命中的结果集内时间较近的得分越高；最后是，如果职位在一个月之内的而且标题全命中，则得分最高。

标题关键字的得分我们可以通过关键字权重来设置，但是复杂的时间计算和标题命中比的计算没有办法通过简单的权重设置来实现。不过Solr给我们留下了丰富的权重扩展查询，其中edismax可以帮助我们更多。

解决过程

假设当前关键字为：技术总监

确保关键字命中标题的结果能得到最高分（第二阶梯的得分）

position_title:"技术总监"~10^6000000　　　　这只是附件得分项，并不是全部的，因为要求关键字的查询域是职位标题和需求

实际的关键字查询语句：

(keyword:("技术总监"~10000 OR position_title:"技术总监"~10000^6000000 OR (location:"技术总监"~10000) OR (address:"技术总监"~10000) OR (requirement:"技术总监"~10000) OR (full_company_name:"技术总监"~10000^100) OR (position_title:技术^6000000) OR (position_title:总监^3500000) OR (keyword:总监^35000)) AND (keyword:"技术总监"~10000))

通过edismax的bf参数来实现标题命中的情况下，时间最新的得分高（第三阶梯的得分）

　　我系统中存放的最新时间是以Long型保存的，因此通过Solr的预定义函数计算时间对排序的影响函数如下，记为delt_days：

sub(360,div(sub(current_time_mills,d_long_modify_time),86400000))

　　将当前时间减去最新时间，然后除以一天的毫秒数，得到的是最近时间转为距离当前时间的天数，然后用360天减去该天数，得到的值就可以作为其排序的得分，可以附加另外的权重。

　　场景要求，对于一年内的职位才考虑其时间和关键字的影响，因此需要用solr预定义的函数map来进行一个if/else选择，记为time_score：

map(delt_days,0,360,delt_days,0)

　　表示在360天之外的职位，其最近时间不做得分计算。

　　到这里已经设置好了最新时间对职位排序的影响，可以通过设置权重或者乘积来调节其与第一部分中关键字得分的阶梯值：

product(time_score,100)

或者

time_score^100

确保在一个月内有更新，且标题全命中的职位得分最高（第一阶梯的得分）

　　如何让一个月内有更新，且标题含有关键字“技术总监”的结果得分进一步提升呢？一个思路就是给每个职位记录标题长度，在查询的时候设置额外的加分条件，让标题长度与关键字长度相等的职位得分再上一个阶梯。没错，就是这么简单，记为title_long_weight：

map(d_long_position_length,4,4,10,0),0)^200000

以上三个条件设置好之后，最终排序相关的语句为：

q:position_title:"技术总监"~10^6000000

bf:map(div(sub(current_time_mills,d_long_modify_time),86400000),0,30,map(d_long_position_length,4,4,10,0),0)^200000 product(map(sub(360,div(sub(current_time_mills,d_long_modify_time),86400000)),0,360,sub(360,div(sub(current_time_mills,d_long_modify_time),86400000)),0),100)^1

注意：

取消逆文档频率对标题命中的影响：因为Lucene/Solr的关键字打分模型是空间向量算法，其中tf/idf对关键字命中率影响得分做了一个稳妥的处理。如果某个关键字在越多的文档中出现，则最终该关键字的得分会做一个缩小，显然对于本场景的关键字来说，标题命中即重要，无需作此处理。解决方式是重载org.apache.lucene.search.similarities.DefaultSimilarity类的public float idf(long docFreq, long numDocs)方法：

public class HunterOnSimilarity extends DefaultSimilarity {

         .....
	@Override
	public float idf(long docFreq, long numDocs) {
		return 1f;
	}
｝

一次输入的关键字如果包含多个切分关键字，则会造成标题无法部分命中单独计算得分：例如输入的关键字是“java工程师”，很显然分词结果是“java”、“工程师”，我们的关键字查询条件是：position_title:"java工程师"~10^6000000。最终返回的结果是命中这两个关键字的标题排列在最前，这是没问题的。还有一部分职位的标题只命中了java，但是没有命中“工程师”，这些结果并没有得到关键字权重的得分。正确的做法是先分词，再依次拼接，而且这样做可以使关键字的顺序对结果集排序的产生影响。

使用了edismax之后，schema.xml配置文件的默认拼接符失效，需要手动拼接。

     <solrQueryParser defaultOperator="AND"/>

优化

至此，我们已经解决了文首场景提出的复杂排序要求，而且通过bf可以实现更为复杂的排序。

但是，对于这种bf代码：

map(div(sub(current_time_mills,d_long_modify_time),86400000),0,30,map(d_long_position_length,4,4,10,0),0)^200000 product(map(sub(360,div(sub(current_time_mills,d_long_modify_time),86400000)),0,360,sub(360,div(sub(current_time_mills,d_long_modify_time),86400000)),0),100)^1

不仅非常难以阅读，而且非常难以维护，必须格式化才行，因此我对此进行了一个很简单的Java端的封装。

通过封装之后，虽然可读性提升不大，但是在代码中出现的字符串拼接量大大降低，而且非常简单的就可以实现代码重用，维护难度大大降低。

这里封装了项目中常用的一些Solr预定义Function：

 1 package com.hunteron.solr.func;
 2 /**
 3  *
 4  * @author Smile.Wu
 5  * @version 2015-9-7
 6  */
 7 public interface IFunction {
 8 
 9     public String build();
10 }

 1 package com.hunteron.solr.func.math;
 2 
 3 import com.hunteron.solr.func.IFunction;
 4 
 5 /**
 6  *
 7  * @author Smile.Wu
 8  * @version 2015-9-7
 9  */
10 public abstract class SolrDefaultFunc implements IFunction {
11 
12 
13     private IFunction arg1;
14     private IFunction arg2;
15     private IFunction[] args = null;
16     
17     public SolrDefaultFunc(IFunction arg1, IFunction arg2, IFunction... args) {
18         this.arg1 = arg1;
19         this.arg2 = arg2;
20         this.args = args;
21     }
22     
23     public SolrDefaultFunc(int arg1, int arg2, IFunction... args) {
24         this.arg1 = new SolrJavaLang(arg1);
25         this.arg2 = new SolrJavaLang(arg2);
26         this.args = args;
27     }
28     
29     public abstract String getOperator();
30 
31     @Override
32     public String build() {
33         String operator = getOperator();
34         if(args == null || args.length < 1) {
35             if(arg2 == null) {
36                 return operator + "(" + arg1.build() + ")";
37             }
38             return operator + "(" + arg1.build() + "," + arg2.build() + ")";
39         } else {
40             StringBuilder query = new StringBuilder();
41             query.append(operator).append("(").append(arg1.build());
42             query.append(",").append(arg2.build());
43             for(IFunction fun : args) {
44                 query.append(",").append(fun.build());
45             }
46             query.append(")");
47             return query.toString();
48         }
49     }
50     
51 }

 1 package com.hunteron.solr.func.math;
 2 
 3 import com.hunteron.solr.func.IFunction;
 4 
 5 /**
 6  *
 7  * @author Smile.Wu
 8  * @version 2015-9-7
 9  */
10 public class SolrMap extends SolrDefaultFunc {
11 
12     public SolrMap(IFunction arg1, IFunction arg2, IFunction arg3, IFunction arg4) {
13         super(arg1, arg2, arg3, arg4);
14     }
15     public SolrMap(IFunction arg1, IFunction arg2, IFunction arg3, IFunction arg4, IFunction arg5) {
16         super(arg1, arg2, arg3, arg4, arg5);
17     }
18     @Override
19     public String getOperator() {
20         return "map";
21     }
22 }

将一些非常常用的函数封装在SolrFunctionUtil.java中：

package com.hunteron.solr.func.tools;

import com.hunteron.solr.func.IFunction;
import com.hunteron.solr.func.math.SolrDiv;
import com.hunteron.solr.func.math.SolrJavaLang;
import com.hunteron.solr.func.math.SolrMap;
import com.hunteron.solr.func.math.SolrProduct;
import com.hunteron.solr.func.math.SolrSub;
import com.hunteron.solr.func.math.SolrSum;

/**
 *
 * @author Smile.Wu
 * @version 2015-9-10
 */
public class SolrFunctionUtil {
	public static final long DAY_MILLS = 86400000L;
	/**
	 * 计算指定域（字段）距离当前时间的天数
	 * @param field 保存时间戳的字段
	 * @return
	 * div(sub(1441870709262,field),86400000)
	 */
	public static IFunction daysFromFieldTime(String field) {
		long currentTime = System.currentTimeMillis();
		return daysFromFieldTime(currentTime, field);
	}
	/**
	 * 计算指定域（字段）距离当前时间的天数
	 * @param currentTime 当前时间，System.currentTimeMillis()
	 * @param field 保存时间戳的字段
	 * @return
	 * div(sub(currentTime,field),86400000)
	 */
	public static IFunction daysFromFieldTime(long currentTime, String field) {
		
		return new SolrDiv(new SolrSub(new SolrJavaLang(currentTime), new SolrJavaLang(field)), new SolrJavaLang(DAY_MILLS));
	}
	
	/**
	 * 判断condition，如果condition的计算结果在x-y之间，则取值condition，否则取值other
	 * @param condition
	 * @param left
	 * @param right
	 * @param other
	 * @return
	 */
	public static IFunction mapSelectDefaultCondition(IFunction condition, long left, long right, IFunction other) {
		
		return mapSelect(condition, left, right, condition, other);
	}	
	/**
	 * 判断condition，如果condition的计算结果在x-y之间，则取值condition，否则取值other
	 * @param condition
	 * @param left
	 * @param right
	 * @param other
	 * @return
	 */
	public static IFunction mapSelectDefaultCondition(IFunction condition, long left, long right, long other) {
		
		return mapSelect(condition, left, right, condition, new SolrJavaLang(other));
	}
	/**
	 * 判断condition，如果condition的计算结果在x-y之间，则取值yesValue，否则取值other
	 * @param condition
	 * @param left
	 * @param right
	 * @param yesValue
	 * @param other
	 * @return
	 */
	public static IFunction mapSelect(IFunction condition, long left, long right, IFunction yesValue, IFunction other) {
		
		return new SolrMap(condition, new SolrJavaLang(left), new SolrJavaLang(right), yesValue, other);
	}
	/**
	 * 判断condition，如果condition的计算结果在x-y之间，则取值yesValue，否则取值other
	 * @param condition
	 * @param left
	 * @param right
	 * @param yesValue
	 * @param other
	 * @return
	 */
	public static IFunction mapSelect(IFunction condition, long left, long right, IFunction yesValue, long other) {
		
		return new SolrMap(condition, new SolrJavaLang(left), new SolrJavaLang(right), yesValue, new SolrJavaLang(other));
	}
	/**
	 * 判断condition，如果condition的计算结果在x-y之间，则取值yesValue，否则取值other
	 * @param condition
	 * @param left
	 * @param right
	 * @param yesValue
	 * @param other
	 * @return
	 */
	public static IFunction mapSelect(String field, long left, long right, long yesValue, long other) {
		
		return new SolrMap(new SolrJavaLang(field), new SolrJavaLang(left), new SolrJavaLang(right), new SolrJavaLang(yesValue), new SolrJavaLang(other));
	}
	
	/**
	 * 综合排序的solr函数
	 * @param current
	 * @param filed
	 * @return
	 */
	public static IFunction getCompositeTimeSortFunction(long current, String filed) {
		return new SolrProduct(
				SolrFunctionUtil.mapSelectDefaultCondition(
						new SolrSub(new SolrJavaLang(360), daysFromFieldTime(current, filed)), 0, 360, 0),
						new SolrJavaLang(100));
	}
	/**
	 * 综合排序中，佣金降序solr函数
	 * @param currentTime
	 * @param timeField
	 * @param rewordField
	 * @return
	 */
	public static IFunction getCompositeAmountSortFunction(long currentTime, String timeField, String rewordField) {
		
		//指定域的值（毫秒时间戳），到当前时间的天数
		IFunction deltDaysFunction = daysFromFieldTime(currentTime, timeField);
		
		IFunction rewordFunction = new SolrDiv(new SolrJavaLang(rewordField), new SolrJavaLang(1000));
		
		IFunction outerMap = mapSelect(deltDaysFunction, 0, 7, new SolrJavaLang(30), mapSelect(deltDaysFunction, 7, 37, new SolrJavaLang(20), 10));
		
		return new SolrSum(new SolrProduct(outerMap, new SolrJavaLang(100000)), 
				new SolrProduct(mapSelectDefaultCondition(rewordFunction, 0, 10000, 10000), new SolrJavaLang(60)));
	}
	
	
	public static void main(String[] args) {
		IFunction deltDays = daysFromFieldTime("d_long_modify_time");
		
		IFunction deltDaysFromYear = new SolrSub(new SolrJavaLang(360), deltDays);
		
		IFunction map = mapSelectDefaultCondition(deltDaysFromYear, 0L, 30L, 0L);
		System.out.println(map.build());
		
		System.out.println(mapSelect("priority", 2, 2, 2, 0).build());
	}
}

再增加一个方便设置权重拼接函数的Builder：

package com.hunteron.solr.func.tools;

import java.util.ArrayList;
import java.util.List;

import com.hunteron.solr.func.IFunction;

/**
 *
 * @author Smile.Wu
 * @version 2015-9-10
 * solr查询函数合并类
 */
public class SolrFunctionBuilder {

	private List<SolrFunctionWeight> functions = new ArrayList<SolrFunctionWeight>();
	
	public SolrFunctionBuilder() {
	}
	public SolrFunctionBuilder(SolrFunctionWeight function) {
		appendFunction(function);
	}
	
	public void appendFunction(SolrFunctionWeight function) {
		if(function != null) {
			functions.add(function);
		}
	}
	public void appendFunction(IFunction fun, int weight) {
		appendFunction(new SolrFunctionWeight(fun, weight));
	}
	/**
	 * 合并各个查询函数，设置权重
	 * @return
	 */
	public String build() {
		StringBuilder functionQuery = new StringBuilder("");
		if(functions.size() > 0) {
			for(SolrFunctionWeight function : functions) {
				String query = function.build();
				if(query.length() > 0) {
					functionQuery.append(query).append(" ");
				}
			}
		}
		return functionQuery.toString();
	}
}

最终测试代码中的样子如下（阅读性还是差，就是样子好看了一些，这里没有进行常用函数的封装，所以看着非常长，代码重复性高，其实很多地方可以重用一下）：

package com.hunteron.solr.func;

import com.hunteron.solr.func.math.SolrDiv;
import com.hunteron.solr.func.math.SolrJavaLang;
import com.hunteron.solr.func.math.SolrMap;
import com.hunteron.solr.func.math.SolrProduct;
import com.hunteron.solr.func.math.SolrSub;

/**
 *
 * @author Smile.Wu
 * @version 2015-9-7
 */
public class TestSolrFunc {

	public static void main(String[] args) {
		
		IFunction finalFunc = 
				new SolrSub(
						new SolrProduct(
								new SolrMap(
										new SolrDiv(
												new SolrSub(new SolrJavaLang("current"), new SolrJavaLang("d_long_modify_time")), 
												new SolrJavaLang(86400000L)
											), 
										new SolrJavaLang(0), 
										new SolrJavaLang(7), 
										new SolrJavaLang(30), 
										new SolrMap(
												new SolrDiv(
														new SolrSub(new SolrJavaLang("current"), new SolrJavaLang("d_long_modify_time")), 
														new SolrJavaLang(86400000L)
													), 
												new SolrJavaLang(7), 
												new SolrJavaLang(30), 
												new SolrJavaLang(20), 
												new SolrJavaLang(10)
											)
									), 
									new SolrJavaLang(10000)
							), 
							new SolrMap(
									new SolrDiv(new SolrJavaLang("amountField"), new SolrJavaLang(1000)), 
									new SolrJavaLang(0), 
									new SolrJavaLang(10000), 
									new SolrDiv(new SolrJavaLang("amountField"), new SolrJavaLang(1000)),
									new SolrJavaLang(10000))
					);
		long b = System.currentTimeMillis();
		System.out.println(finalFunc.build());
		System.out.println("cost : " + (System.currentTimeMillis() - b));
	}
}

posted @ 2015-10-12 09:15 极品健健 Views(1073) Comments(0) 收藏举报

刷新页面返回顶部

极品健健

Java搜索引擎之乎者也 http://git.oschina.net/wjyuian/jssdb

全文检索工具包Lucene以及企业及应用Solr的学习（二）—— solr中edismax用到的Query Function以及java扩展

场景

分析

解决过程

优化

公告