mahout推荐11-探究用户邻域

1、固定大小的用户邻域

package mahout;

import java.io.File;
import java.io.IOException;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.eval.AverageAbsoluteDifferenceRecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.eval.LoadEvaluator;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
import org.apache.mahout.cf.taste.similarity.precompute.example.GroupLensDataModel;

public class GroupLensDataModelTest {
 
	public static void main(String[] args) throws Exception {
		//数据集
		DataModel dataModel = new GroupLensDataModel(new File("data/ratings.dat"));
		//基于平均值的评估器
		RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
		//推荐引擎构造器
		RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
			
			public Recommender buildRecommender(DataModel dataModel) throws TasteException {
				// TODO Auto-generated method stub
				//用户相似度
				UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
				
				//用户邻居,固定值100
				UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(100, userSimilarity, dataModel);
				//基于用户的推荐
				return new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);
			}
		};
		//评估得分
		double score = evaluator.evaluate(recommenderBuilder, null, dataModel, 0.95, 0.05);
		System.out.println(score);
	}

	private static void recommend() throws IOException, TasteException {
		//使用定制的GrouplensDataModel,如果没有转换数据集成为csv格式的
		DataModel dataModel = new GroupLensDataModel(new File(
				"data/ratings.dat"));
		//皮尔逊相关系数,衡量用户相似度
		UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(
				dataModel);
		//构建用户邻居,100个
		UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(100,
				userSimilarity, dataModel);
		//推荐引擎
		Recommender recommender = new GenericUserBasedRecommender(dataModel,
				userNeighborhood, userSimilarity);
		//运行
		LoadEvaluator.runLoad(recommender);
	}
}

 这里采用的用户邻域为100,同时evaluate的最后一个参数是0.05,意味着仅使用5%的数据进行评估,0.95就是说使用95%的数据来构建要评估的模型,剩下的5%用来做测试。

输出结果:0.8316465777957014

14/08/05 10:14:20 INFO file.FileDataModel: Creating FileDataModel for file C:\Users\ADMINI~1\AppData\Local\Temp\ratings.txt
14/08/05 10:14:20 INFO file.FileDataModel: Reading file info...
14/08/05 10:14:21 INFO file.FileDataModel: Processed 1000000 lines
14/08/05 10:14:23 INFO file.FileDataModel: Processed 2000000 lines
14/08/05 10:14:24 INFO file.FileDataModel: Processed 3000000 lines
14/08/05 10:14:25 INFO file.FileDataModel: Processed 4000000 lines
14/08/05 10:14:27 INFO file.FileDataModel: Processed 5000000 lines
14/08/05 10:14:28 INFO file.FileDataModel: Processed 6000000 lines
14/08/05 10:14:29 INFO file.FileDataModel: Processed 7000000 lines
14/08/05 10:14:30 INFO file.FileDataModel: Processed 8000000 lines
14/08/05 10:14:31 INFO file.FileDataModel: Processed 9000000 lines
14/08/05 10:14:34 INFO file.FileDataModel: Processed 10000000 lines
14/08/05 10:14:34 INFO file.FileDataModel: Read lines: 10000054
14/08/05 10:14:34 INFO model.GenericDataModel: Processed 10000 users
14/08/05 10:14:35 INFO model.GenericDataModel: Processed 20000 users
14/08/05 10:14:35 INFO model.GenericDataModel: Processed 30000 users
14/08/05 10:14:35 INFO model.GenericDataModel: Processed 40000 users
14/08/05 10:14:36 INFO model.GenericDataModel: Processed 50000 users
14/08/05 10:14:39 INFO model.GenericDataModel: Processed 60000 users
14/08/05 10:14:40 INFO model.GenericDataModel: Processed 69878 users
14/08/05 10:14:41 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.95 of GroupLensDataModel
14/08/05 10:14:41 INFO model.GenericDataModel: Processed 3410 users
14/08/05 10:14:42 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation of 3120 users
14/08/05 10:14:42 INFO eval.AbstractDifferenceRecommenderEvaluator: Starting timing of 3120 tasks in 4 threads
14/08/05 10:14:42 INFO eval.StatsCallable: Average time per recommendation: 46ms
14/08/05 10:14:42 INFO eval.StatsCallable: Approximate memory used: 311MB / 755MB
14/08/05 10:14:42 INFO eval.StatsCallable: Unable to recommend in 0 cases
14/08/05 10:15:04 INFO eval.StatsCallable: Average time per recommendation: 86ms
14/08/05 10:15:04 INFO eval.StatsCallable: Approximate memory used: 279MB / 639MB
14/08/05 10:15:04 INFO eval.StatsCallable: Unable to recommend in 4540 cases
14/08/05 10:15:25 INFO eval.StatsCallable: Average time per recommendation: 87ms
14/08/05 10:15:25 INFO eval.StatsCallable: Approximate memory used: 303MB / 641MB
14/08/05 10:15:25 INFO eval.StatsCallable: Unable to recommend in 9001 cases
14/08/05 10:15:46 INFO eval.StatsCallable: Average time per recommendation: 86ms
14/08/05 10:15:46 INFO eval.StatsCallable: Approximate memory used: 332MB / 641MB
14/08/05 10:15:46 INFO eval.StatsCallable: Unable to recommend in 13655 cases
14/08/05 10:15:49 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation result: 0.8316465777957014
0.8316465777957014

 

将固定大小的邻域使用10代替100,见userNeighborhood,评估结果为:0.8582162139625891 变大了,评估值越大越不好,意味着选错方向了。

使用500,结果为:0.7558864506703784 这个结果自然比较好。

在真实数据上做一下实验对推荐程序的调优是很有必要的。

2、基于阈值的邻域

类似以我为中心,画个圆,里面的都是我的邻居哎。

new ThresholdUserNeighborhood(0.7,similarity,model);

评分为:0.7949910282938843

14/08/05 10:27:18 INFO file.FileDataModel: Creating FileDataModel for file C:\Users\ADMINI~1\AppData\Local\Temp\ratings.txt
14/08/05 10:27:19 INFO file.FileDataModel: Reading file info...
14/08/05 10:27:20 INFO file.FileDataModel: Processed 1000000 lines
14/08/05 10:27:21 INFO file.FileDataModel: Processed 2000000 lines
14/08/05 10:27:22 INFO file.FileDataModel: Processed 3000000 lines
14/08/05 10:27:23 INFO file.FileDataModel: Processed 4000000 lines
14/08/05 10:27:24 INFO file.FileDataModel: Processed 5000000 lines
14/08/05 10:27:26 INFO file.FileDataModel: Processed 6000000 lines
14/08/05 10:27:27 INFO file.FileDataModel: Processed 7000000 lines
14/08/05 10:27:27 INFO file.FileDataModel: Processed 8000000 lines
14/08/05 10:27:28 INFO file.FileDataModel: Processed 9000000 lines
14/08/05 10:27:29 INFO file.FileDataModel: Processed 10000000 lines
14/08/05 10:27:29 INFO file.FileDataModel: Read lines: 10000054
14/08/05 10:27:32 INFO model.GenericDataModel: Processed 10000 users
14/08/05 10:27:33 INFO model.GenericDataModel: Processed 20000 users
14/08/05 10:27:33 INFO model.GenericDataModel: Processed 30000 users
14/08/05 10:27:34 INFO model.GenericDataModel: Processed 40000 users
14/08/05 10:27:34 INFO model.GenericDataModel: Processed 50000 users
14/08/05 10:27:34 INFO model.GenericDataModel: Processed 60000 users
14/08/05 10:27:35 INFO model.GenericDataModel: Processed 69878 users
14/08/05 10:27:39 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.95 of GroupLensDataModel
14/08/05 10:27:39 INFO model.GenericDataModel: Processed 3530 users
14/08/05 10:27:39 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation of 3234 users
14/08/05 10:27:39 INFO eval.AbstractDifferenceRecommenderEvaluator: Starting timing of 3234 tasks in 4 threads
14/08/05 10:27:40 INFO eval.StatsCallable: Average time per recommendation: 94ms
14/08/05 10:27:40 INFO eval.StatsCallable: Approximate memory used: 625MB / 855MB
14/08/05 10:27:40 INFO eval.StatsCallable: Unable to recommend in 134 cases
14/08/05 10:28:03 INFO eval.StatsCallable: Average time per recommendation: 93ms
14/08/05 10:28:03 INFO eval.StatsCallable: Approximate memory used: 374MB / 781MB
14/08/05 10:28:03 INFO eval.StatsCallable: Unable to recommend in 3816 cases
14/08/05 10:28:28 INFO eval.StatsCallable: Average time per recommendation: 96ms
14/08/05 10:28:28 INFO eval.StatsCallable: Approximate memory used: 336MB / 781MB
14/08/05 10:28:28 INFO eval.StatsCallable: Unable to recommend in 7943 cases
14/08/05 10:28:51 INFO eval.StatsCallable: Average time per recommendation: 94ms
14/08/05 10:28:51 INFO eval.StatsCallable: Approximate memory used: 368MB / 664MB
14/08/05 10:28:51 INFO eval.StatsCallable: Unable to recommend in 11574 cases
14/08/05 10:28:57 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation result: 0.7949910282938843
0.7949910282938843

 

使用0.9:结果为:0.8474542269736061 为了得到这个值,机子的cpu满负荷。

使用0.5呢:结果为 0.7409341920894663

数值越小,精度越好。

 

-- 运行这个,小心你的机子。

posted @ 2014-08-04 14:04  jseven  阅读(449)  评论(0编辑  收藏  举报