mahout推荐7无偏好值的处理
用户和物品是关联的,但是没有这种关联的强度描述,如用户浏览文章。
无偏好值的内存实现:
重要是datamodel和modelbuilder的实现。
package mahout; import java.io.File; import org.apache.mahout.cf.taste.common.TasteException; import org.apache.mahout.cf.taste.eval.DataModelBuilder; import org.apache.mahout.cf.taste.eval.RecommenderBuilder; import org.apache.mahout.cf.taste.eval.RecommenderEvaluator; import org.apache.mahout.cf.taste.impl.common.FastByIDMap; import org.apache.mahout.cf.taste.impl.eval.AverageAbsoluteDifferenceRecommenderEvaluator; import org.apache.mahout.cf.taste.impl.model.GenericBooleanPrefDataModel; import org.apache.mahout.cf.taste.impl.model.file.FileDataModel; import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood; import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender; import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity; import org.apache.mahout.cf.taste.model.DataModel; import org.apache.mahout.cf.taste.model.PreferenceArray; import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood; import org.apache.mahout.cf.taste.recommender.Recommender; import org.apache.mahout.cf.taste.similarity.UserSimilarity; import org.apache.mahout.common.RandomUtils; /** * * @author Administrator * */ public class TestRecommenderEvaluator { public static void main(String[] args) throws Exception { //强制每次生成相同的随机值,生成可重复的结果 //RandomUtils.useTestSeed(); //数据装填,无偏好值的处理 DataModel dataModel = new GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.toDataMap(new FileDataModel(new File("data/ua.base")))); //推荐评估,使用平均值 RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); //推荐评估,使用均方差 //RecommenderEvaluator evaluator = new RMSRecommenderEvaluator(); //用于生成推荐引擎的构建器,与上一例子实现相同 RecommenderBuilder builder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { // TODO Auto-generated method stub //用户相似度,多种方法 UserSimilarity similarity = new PearsonCorrelationSimilarity(model); //用户邻居 UserNeighborhood neighborhood = new NearestNUserNeighborhood(10, similarity, model); //一个推荐器 return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; DataModelBuilder modelBuilder = new DataModelBuilder() { public DataModel buildDataModel(FastByIDMap<PreferenceArray> arg0) { // TODO Auto-generated method stub return new GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.toDataMap(arg0)); } }; //推荐程序评估值(平均差值)训练90%的数据,测试数据10%,《mahout in Action》使用的是0.7,但是出现结果为NaN double score = evaluator.evaluate(builder, modelBuilder, dataModel, 0.9, 1.0); System.out.println(score); } }
结果:
14/08/04 11:33:26 INFO file.FileDataModel: Creating FileDataModel for file data\ua.base 14/08/04 11:33:26 INFO file.FileDataModel: Reading file info... 14/08/04 11:33:27 INFO file.FileDataModel: Read lines: 90570 14/08/04 11:33:27 INFO file.FileDataModel: Reading file info... 14/08/04 11:33:27 INFO file.FileDataModel: Read lines: 0 14/08/04 11:33:27 INFO model.GenericDataModel: Processed 943 users 14/08/04 11:33:27 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.9 of GenericBooleanPrefDataModel[users:1,2,3...] Exception in thread "main" java.lang.IllegalArgumentException: DataModel doesn't have preference values at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity.<init>(PearsonCorrelationSimilarity.java:74) at org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity.<init>(PearsonCorrelationSimilarity.java:66) at mahout.TestRecommenderEvaluator$1.buildRecommender(TestRecommenderEvaluator.java:45) at org.apache.mahout.cf.taste.impl.eval.AbstractDifferenceRecommenderEvaluator.evaluate(AbstractDifferenceRecommenderEvaluator.java:125) at mahout.TestRecommenderEvaluator.main(TestRecommenderEvaluator.java:60)
一个异常,不合适的参数,datamodel没有偏好值,我们用的不就是无偏好值的嘛?为何还需要偏好值呢???
PearsonCorrelationSimilarity.用户相似度度量,如果缺少偏好值,像欧式距离拒绝工作,或皮尔孙相关系数是未定义的,所以这两个计算用户相似度需要依赖偏好值,就是说我们选错了相似度度量方法,将其改为LogLikelihoodSimilarity替换PearsonCorrelationSimilarity
结果:
14/08/04 11:40:38 INFO file.FileDataModel: Creating FileDataModel for file data\ua.base 14/08/04 11:40:38 INFO file.FileDataModel: Reading file info... 14/08/04 11:40:39 INFO file.FileDataModel: Read lines: 90570 14/08/04 11:40:39 INFO file.FileDataModel: Reading file info... 14/08/04 11:40:39 INFO file.FileDataModel: Read lines: 0 14/08/04 11:40:39 INFO model.GenericDataModel: Processed 943 users 14/08/04 11:40:39 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.9 of GenericBooleanPrefDataModel[users:1,2,3...] 14/08/04 11:40:39 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation of 882 users 14/08/04 11:40:39 INFO eval.AbstractDifferenceRecommenderEvaluator: Starting timing of 882 tasks in 4 threads 14/08/04 11:40:39 INFO eval.StatsCallable: Average time per recommendation: 70ms 14/08/04 11:40:39 INFO eval.StatsCallable: Approximate memory used: 25MB / 112MB 14/08/04 11:40:39 INFO eval.StatsCallable: Unable to recommend in 30 cases 14/08/04 11:40:47 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation result: 0.0 0.0
结果是0,完全匹配,是不是好的过火了!!!
很遗憾的确如此,这个是当每个偏好值为1时,估计偏好和实际偏好之间的平均差值,结果自然是0,这个测试无效的,因为他只能输出0
.......................
但是查准率和查全率是有效的,见下一篇文章