总访问量: PV

DataScience && DataMining && BigData

推荐系统之最小二乘法ALS的Spark实现

1.ALS算法流程:

    初始化数据集和Spark环境---->

                   切分测试机和检验集------>

                          训练ALS模型------------>

                                  验证结果----------------->

                                        检验满足结果---->直接推荐商品,否则继续训练ALS模型

2.数据集的含义

     Rating是固定的ALS输入格式,要求是一个元组类型的数据,其中数值分别是如下的[Int,Int,Double],在建立数据集的时候,用户名和物品名需要采用数值代替

1 /**
2  * A more compact class to represent a rating than Tuple3[Int, Int, Double].
3  */
4 @Since("0.8.0")
5 case class Rating @Since("0.8.0") (
6     @Since("0.8.0") user: Int,
7     @Since("0.8.0") product: Int,
8     @Since("0.8.0") rating: Double)

  如下:第一列位用户编号,第二列位产品编号,第三列的评分Rating为Double类型

          

3.ALS的测试数据集源代码解读

     3.1ALS类的所有字段如下

@Since("0.8.0")
class ALS private (
    private var numUserBlocks: Int,
    private var numProductBlocks: Int,
    private var rank: Int,
    private var iterations: Int,
    private var lambda: Double,
    private var implicitPrefs: Boolean,  使用显式反馈ALS变量或隐式反馈
    private var alpha: Double,    ALS隐式反馈变化率用于控制每次拟合修正的幅度
    private var seed: Long = System.nanoTime()
  ) extends Serializable with Logging {

     3.2 ALS.train方法

 1 /**
 2    * Train a matrix factorization model given an RDD of ratings given by users to some products,
 3    * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the
 4    * product of two lower-rank matrices of a given rank (number of features). To solve for these
 5    * features, we run a given number of iterations of ALS. This is done using a level of
 6    * parallelism given by `blocks`.
 7    *
 8    * @param ratings    RDD of (userID, productID, rating) pairs
 9    * @param rank       number of features to use  
10    * @param iterations number of iterations of ALS (recommended: 10-20)
11    * @param lambda     regularization factor (recommended: 0.01)
12    * @param blocks     level of parallelism to split computation into  将并行度分解为等级
13    * @param seed       random seed  随机种子
14    */
15   @Since("0.9.1")
16   def train(
17       ratings: RDD[Rating], //RDD序列由用户ID 产品ID和评分组成
18       rank: Int,    //模型中的隐藏因子数目
19       iterations: Int,  //算法迭代次数
20       lambda: Double,  //ALS正则化参数
21       blocks: Int,   //块
22       seed: Long
23     ): MatrixFactorizationModel = {
24     new ALS(blocks, blocks, rank, iterations, lambda, false, 1.0, seed).run(ratings)
25   }

     3.3 基于ALS算法的协同过滤推荐

 1 package com.bigdata.demo
 2 
 3 import org.apache.spark.{SparkContext, SparkConf}
 4 import org.apache.spark.mllib.recommendation.ALS
 5 import org.apache.spark.mllib.recommendation.Rating
 6 
 7 /**
 8   * Created by SimonsZhao on 3/30/2017.
 9   * ALS最小二乘法
10   */
11 object CollaborativeFilter {
12 
13   def main(args: Array[String]) {
14     //设置环境变量
15      val conf=new SparkConf().setMaster("local").setAppName("CollaborativeFilter ")
16     //实例化环境
17      val sc = new SparkContext(conf)
18     //设置数据集
19      val data =sc.textFile("E:/scala/spark/testdata/ALSTest.txt")
20     //处理数据
21      val ratings=data.map(_.split(' ') match{
22       //数据集的转换
23       case Array(user,item,rate) =>
24         //将数据集转化为专用的Rating
25         Rating(user.toInt,item.toInt,rate.toDouble)
26     })
27     //设置隐藏因子
28      val rank=2
29     //设置迭代次数
30      val numIterations=2
31     //进行模型训练
32      val model =ALS.train(ratings,rank,numIterations,0.01)
33     //为用户2推荐一个商品
34      val rs=model.recommendProducts(2,1)
35     //打印结果
36      rs.foreach(println)
37   }
38 
39 }

     展开代码可复制

 1 package com.bigdata.demo
 2 
 3 import org.apache.spark.{SparkContext, SparkConf}
 4 import org.apache.spark.mllib.recommendation.ALS
 5 import org.apache.spark.mllib.recommendation.Rating
 6 
 7 /**
 8   * Created by SimonsZhao on 3/30/2017.
 9   * ALS最小二乘法
10   */
11 object CollaborativeFilter {
12 
13   def main(args: Array[String]) {
14     //设置环境变量
15      val conf=new SparkConf().setMaster("local").setAppName("CollaborativeFilter ")
16     //实例化环境
17      val sc = new SparkContext(conf)
18     //设置数据集
19      val data =sc.textFile("E:/scala/spark/testdata/ALSTest.txt")
20     //处理数据
21      val ratings=data.map(_.split(' ') match{
22       //数据集的转换
23       case Array(user,item,rate) =>
24         //将数据集转化为专用的Rating
25         Rating(user.toInt,item.toInt,rate.toDouble)
26     })
27     //设置隐藏因子
28      val rank=2
29     //设置迭代次数
30      val numIterations=2
31     //进行模型训练
32      val model =ALS.train(ratings,rank,numIterations,0.01)
33     //为用户2推荐一个商品
34      val rs=model.recommendProducts(2,1)
35     //打印结果
36      rs.foreach(println)
37   }
38 
39 }
点击+复制代码

4.测试及分析

    根据结果分析为第2个用户推荐了编号为15的商品,预测评分为3.99

5.基于用户的推荐源代码(mllib)

注释的部分翻译:

  用户向用户推荐产品

  num返回多少产品。 返回的数字可能少于此值。

[[评分]]对象,每个对象包含给定的用户ID,产品ID和
  评分字段中的“得分”。 每个代表一个推荐的产品,并且它们被排序
  按分数,减少。 第一个返回的是预测最强的一个
  推荐给用户。 分数是一个不透明的值,表示强列推荐的产品。

 1   /**
 2    * Recommends products to a user.
 3    *
 4    * @param user the user to recommend products to
 5    * @param num how many products to return. The number returned may be less than this.
 6    * @return [[Rating]] objects, each of which contains the given user ID, a product ID, and a
 7    *  "score" in the rating field. Each represents one recommended product, and they are sorted
 8    *  by score, decreasing. The first returned is the one predicted to be most strongly
 9    *  recommended to the user. The score is an opaque value that indicates how strongly
10    *  recommended the product is.
11    */
12   @Since("1.1.0")
13   def recommendProducts(user: Int, num: Int): Array[Rating] =
14     MatrixFactorizationModel.recommend(userFeatures.lookup(user).head, productFeatures, num)
15       .map(t => Rating(user, t._1, t._2))

6.基于物品的推荐源代码(mllib)

注释的部分翻译:

  推荐用户使用产品,也就是说,这将返回最有可能的用户对产品感兴趣

  每个都包含用户ID,给定的产品ID和评分字段中的“得分”。

  每个代表一个推荐的用户,并且它们被排序按得分,减少。

  第一个返回的是预测最强的一个推荐给产品。

   分数是一个不透明的值,表示强烈推荐给用户。

 1   /**
 2    * Recommends users to a product. That is, this returns users who are most likely to be
 3    * interested in a product.
 4    *
 5    * @param product the product to recommend users to   给用户推荐的产品
 6    * @param num how many users to return. The number returned may be less than this. 返回个用户的个数
 7    * @return [[Rating]] objects, each of which contains a user ID, the given product ID, and a
 8    *  "score" in the rating field. Each represents one recommended user, and they are sorted
 9    *  by score, decreasing. The first returned is the one predicted to be most strongly
10    *  recommended to the product. The score is an opaque value that indicates how strongly
11    *  recommended the user is.
12    */
13   @Since("1.1.0")
14   def recommendUsers(product: Int, num: Int): Array[Rating] =
15     MatrixFactorizationModel.recommend(productFeatures.lookup(product).head, userFeatures, num)
16       .map(t => Rating(t._1, product, t._2))

 

END~

posted @ 2017-03-30 20:15  CJZhaoSimons  阅读(5457)  评论(0编辑  收藏  举报