mahout推荐11-探究用户邻域
1、固定大小的用户邻域
package mahout; import java.io.File; import java.io.IOException; import org.apache.mahout.cf.taste.common.TasteException; import org.apache.mahout.cf.taste.eval.RecommenderBuilder; import org.apache.mahout.cf.taste.eval.RecommenderEvaluator; import org.apache.mahout.cf.taste.impl.eval.AverageAbsoluteDifferenceRecommenderEvaluator; import org.apache.mahout.cf.taste.impl.eval.LoadEvaluator; import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood; import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender; import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity; import org.apache.mahout.cf.taste.model.DataModel; import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood; import org.apache.mahout.cf.taste.recommender.Recommender; import org.apache.mahout.cf.taste.similarity.UserSimilarity; import org.apache.mahout.cf.taste.similarity.precompute.example.GroupLensDataModel; public class GroupLensDataModelTest { public static void main(String[] args) throws Exception { //数据集 DataModel dataModel = new GroupLensDataModel(new File("data/ratings.dat")); //基于平均值的评估器 RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); //推荐引擎构造器 RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel dataModel) throws TasteException { // TODO Auto-generated method stub //用户相似度 UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); //用户邻居,固定值100 UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(100, userSimilarity, dataModel); //基于用户的推荐 return new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity); } }; //评估得分 double score = evaluator.evaluate(recommenderBuilder, null, dataModel, 0.95, 0.05); System.out.println(score); } private static void recommend() throws IOException, TasteException { //使用定制的GrouplensDataModel,如果没有转换数据集成为csv格式的 DataModel dataModel = new GroupLensDataModel(new File( "data/ratings.dat")); //皮尔逊相关系数,衡量用户相似度 UserSimilarity userSimilarity = new PearsonCorrelationSimilarity( dataModel); //构建用户邻居,100个 UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(100, userSimilarity, dataModel); //推荐引擎 Recommender recommender = new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity); //运行 LoadEvaluator.runLoad(recommender); } }
这里采用的用户邻域为100,同时evaluate的最后一个参数是0.05,意味着仅使用5%的数据进行评估,0.95就是说使用95%的数据来构建要评估的模型,剩下的5%用来做测试。
输出结果:0.8316465777957014
14/08/05 10:14:20 INFO file.FileDataModel: Creating FileDataModel for file C:\Users\ADMINI~1\AppData\Local\Temp\ratings.txt 14/08/05 10:14:20 INFO file.FileDataModel: Reading file info... 14/08/05 10:14:21 INFO file.FileDataModel: Processed 1000000 lines 14/08/05 10:14:23 INFO file.FileDataModel: Processed 2000000 lines 14/08/05 10:14:24 INFO file.FileDataModel: Processed 3000000 lines 14/08/05 10:14:25 INFO file.FileDataModel: Processed 4000000 lines 14/08/05 10:14:27 INFO file.FileDataModel: Processed 5000000 lines 14/08/05 10:14:28 INFO file.FileDataModel: Processed 6000000 lines 14/08/05 10:14:29 INFO file.FileDataModel: Processed 7000000 lines 14/08/05 10:14:30 INFO file.FileDataModel: Processed 8000000 lines 14/08/05 10:14:31 INFO file.FileDataModel: Processed 9000000 lines 14/08/05 10:14:34 INFO file.FileDataModel: Processed 10000000 lines 14/08/05 10:14:34 INFO file.FileDataModel: Read lines: 10000054 14/08/05 10:14:34 INFO model.GenericDataModel: Processed 10000 users 14/08/05 10:14:35 INFO model.GenericDataModel: Processed 20000 users 14/08/05 10:14:35 INFO model.GenericDataModel: Processed 30000 users 14/08/05 10:14:35 INFO model.GenericDataModel: Processed 40000 users 14/08/05 10:14:36 INFO model.GenericDataModel: Processed 50000 users 14/08/05 10:14:39 INFO model.GenericDataModel: Processed 60000 users 14/08/05 10:14:40 INFO model.GenericDataModel: Processed 69878 users 14/08/05 10:14:41 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.95 of GroupLensDataModel 14/08/05 10:14:41 INFO model.GenericDataModel: Processed 3410 users 14/08/05 10:14:42 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation of 3120 users 14/08/05 10:14:42 INFO eval.AbstractDifferenceRecommenderEvaluator: Starting timing of 3120 tasks in 4 threads 14/08/05 10:14:42 INFO eval.StatsCallable: Average time per recommendation: 46ms 14/08/05 10:14:42 INFO eval.StatsCallable: Approximate memory used: 311MB / 755MB 14/08/05 10:14:42 INFO eval.StatsCallable: Unable to recommend in 0 cases 14/08/05 10:15:04 INFO eval.StatsCallable: Average time per recommendation: 86ms 14/08/05 10:15:04 INFO eval.StatsCallable: Approximate memory used: 279MB / 639MB 14/08/05 10:15:04 INFO eval.StatsCallable: Unable to recommend in 4540 cases 14/08/05 10:15:25 INFO eval.StatsCallable: Average time per recommendation: 87ms 14/08/05 10:15:25 INFO eval.StatsCallable: Approximate memory used: 303MB / 641MB 14/08/05 10:15:25 INFO eval.StatsCallable: Unable to recommend in 9001 cases 14/08/05 10:15:46 INFO eval.StatsCallable: Average time per recommendation: 86ms 14/08/05 10:15:46 INFO eval.StatsCallable: Approximate memory used: 332MB / 641MB 14/08/05 10:15:46 INFO eval.StatsCallable: Unable to recommend in 13655 cases 14/08/05 10:15:49 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation result: 0.8316465777957014 0.8316465777957014
将固定大小的邻域使用10代替100,见userNeighborhood,评估结果为:0.8582162139625891 变大了,评估值越大越不好,意味着选错方向了。
使用500,结果为:0.7558864506703784 这个结果自然比较好。
在真实数据上做一下实验对推荐程序的调优是很有必要的。
2、基于阈值的邻域
类似以我为中心,画个圆,里面的都是我的邻居哎。
new ThresholdUserNeighborhood(0.7,similarity,model);
评分为:0.7949910282938843
14/08/05 10:27:18 INFO file.FileDataModel: Creating FileDataModel for file C:\Users\ADMINI~1\AppData\Local\Temp\ratings.txt 14/08/05 10:27:19 INFO file.FileDataModel: Reading file info... 14/08/05 10:27:20 INFO file.FileDataModel: Processed 1000000 lines 14/08/05 10:27:21 INFO file.FileDataModel: Processed 2000000 lines 14/08/05 10:27:22 INFO file.FileDataModel: Processed 3000000 lines 14/08/05 10:27:23 INFO file.FileDataModel: Processed 4000000 lines 14/08/05 10:27:24 INFO file.FileDataModel: Processed 5000000 lines 14/08/05 10:27:26 INFO file.FileDataModel: Processed 6000000 lines 14/08/05 10:27:27 INFO file.FileDataModel: Processed 7000000 lines 14/08/05 10:27:27 INFO file.FileDataModel: Processed 8000000 lines 14/08/05 10:27:28 INFO file.FileDataModel: Processed 9000000 lines 14/08/05 10:27:29 INFO file.FileDataModel: Processed 10000000 lines 14/08/05 10:27:29 INFO file.FileDataModel: Read lines: 10000054 14/08/05 10:27:32 INFO model.GenericDataModel: Processed 10000 users 14/08/05 10:27:33 INFO model.GenericDataModel: Processed 20000 users 14/08/05 10:27:33 INFO model.GenericDataModel: Processed 30000 users 14/08/05 10:27:34 INFO model.GenericDataModel: Processed 40000 users 14/08/05 10:27:34 INFO model.GenericDataModel: Processed 50000 users 14/08/05 10:27:34 INFO model.GenericDataModel: Processed 60000 users 14/08/05 10:27:35 INFO model.GenericDataModel: Processed 69878 users 14/08/05 10:27:39 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.95 of GroupLensDataModel 14/08/05 10:27:39 INFO model.GenericDataModel: Processed 3530 users 14/08/05 10:27:39 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation of 3234 users 14/08/05 10:27:39 INFO eval.AbstractDifferenceRecommenderEvaluator: Starting timing of 3234 tasks in 4 threads 14/08/05 10:27:40 INFO eval.StatsCallable: Average time per recommendation: 94ms 14/08/05 10:27:40 INFO eval.StatsCallable: Approximate memory used: 625MB / 855MB 14/08/05 10:27:40 INFO eval.StatsCallable: Unable to recommend in 134 cases 14/08/05 10:28:03 INFO eval.StatsCallable: Average time per recommendation: 93ms 14/08/05 10:28:03 INFO eval.StatsCallable: Approximate memory used: 374MB / 781MB 14/08/05 10:28:03 INFO eval.StatsCallable: Unable to recommend in 3816 cases 14/08/05 10:28:28 INFO eval.StatsCallable: Average time per recommendation: 96ms 14/08/05 10:28:28 INFO eval.StatsCallable: Approximate memory used: 336MB / 781MB 14/08/05 10:28:28 INFO eval.StatsCallable: Unable to recommend in 7943 cases 14/08/05 10:28:51 INFO eval.StatsCallable: Average time per recommendation: 94ms 14/08/05 10:28:51 INFO eval.StatsCallable: Approximate memory used: 368MB / 664MB 14/08/05 10:28:51 INFO eval.StatsCallable: Unable to recommend in 11574 cases 14/08/05 10:28:57 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation result: 0.7949910282938843 0.7949910282938843
使用0.9:结果为:0.8474542269736061 为了得到这个值,机子的cpu满负荷。
使用0.5呢:结果为 0.7409341920894663
数值越小,精度越好。
-- 运行这个,小心你的机子。