# Spark0.9.0机器学习包MLlib-Optimization代码阅读

 1 package org.apache.spark.mllib.optimization
2
3 import org.jblas.DoubleMatrix
4
5 /**
6
7  * Class used to compute the gradient for a loss function, given a single data point.
8
9  */
10
11   abstract class Gradient extends Serializable {
12
13   /**
14
15    * Compute the gradient and loss given the features of a single data point.
16
17    * @param data - Feature values for one data point. Column matrix of size dx1
18
19    * where d is the number of features.
20
21    * @param label - Label for this data item.
22
23    * @param weights - Column matrix containing weights for every feature.
24
25    * @return A tuple of 2 elements. The first element is a column matrix containing the computed
26
27    * gradient and the second element is the loss computed at this data point.
28
29    */
30
31   def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix):
32
33       (DoubleMatrix, Double)
34
35 }

可以从上面的注释上看出compute的参数data是一个样本的特征(d*1维度)，label就是一个double型变量，该数据点(a single data point)的标签，weights就是特征变量的回归系数也是d*1维度，该函数返回2个东西，第1个是该样本点下计算的梯度，第2个该样本点下的损失loss

 1 /**
2
3  * Compute gradient and loss for a logistic loss function, as used in binary classification.
4
6
7  */
8
10
11   override def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix):
12
13       (DoubleMatrix, Double) = {
14
15     val margin: Double = -1.0 * data.dot(weights)
16
17     val gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label
18
20
21     val loss =
22
23       if (label > 0) {
24
25         math.log(1 + math.exp(margin))
26
27       } else {
28
29         math.log(1 + math.exp(margin)) - margin
30
31       }
32
34
35   }
36
37 }

我们知道对于log-loss的表达式loss=-[y*log(g(wx))+(1-y)*log(1-g(wx))]， 其中g(wx)=1/(1+exp(-wx))，二分类(0,1)，对这个loss进行求w偏导，d(loss)/d(w)=[g(wx)-y] * x  (为书写方便，用d代表偏导的符号了)，具体的表达式推导请移步http://www.cnblogs.com/kobedeshow/p/3340240.html

 1 /**
2
3  * Compute gradient and loss for a Least-squared loss function, as used in linear regression.
4
5  * This is correct for the averaged least squares loss function (mean squared error)
6
7  * L = 1/n ||A weights-y||^2
8
10
11  */
12
14
15   override def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix):
16
17       (DoubleMatrix, Double) = {
18
19     val diff: Double = data.dot(weights) - label
20
21     val loss = diff * diff
22
23     val gradient = data.mul(2.0 * diff)
24
27   }
28
29 }

leastsquares-loss的表达式 如注释所示：L = 1/n ||A weights-y||^2，这里n=1，文中代码的变量diff，就是f(wx)-y的值，损失loss=diff*diff，梯度就是data.mul(2.0 * diff)，注意.mul是DoubleMatrix(jblas)的一个方法，是元素跟矩阵的点乘，.mull是矩阵跟矩阵的乘法，.dot是向量的内积

 1 /**
2
3  * Compute gradient and loss for a Hinge loss function, as used in SVM binary classification.
4
6
7  * NOTE: This assumes that the labels are {0,1}
8
9  */
10
12
13   override def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix):
14
15       (DoubleMatrix, Double) = {
16
17     val dotProduct = data.dot(weights)
18
19     // Our loss function with {0, 1} labels is max(0, 1 - (2y – 1) (f_w(x)))
20
21     // Therefore the gradient is -(2y - 1)*x
22
23     val labelScaled = 2 * label - 1.0
24
25     if (1.0 > labelScaled * dotProduct) {
26
27       (data.mul(-labelScaled), 1.0 - labelScaled * dotProduct)
28
29     } else {
30
31       (DoubleMatrix.zeros(1, weights.length), 0.0)
32
33     }
35   }
37 }

hinge-loss的二分类(-1,1)的表达式是max(0，1- y * f(x))，代码中映射到(0,1)，变成max(0, 1 - (2y – 1) (f(x)))，这时候当样本错分的时候(也就是labelScaled * dotProduct<1)，梯度是data.mul(-labelScaled)，损失是1-labelScaled * dotProduct

Updater.scala文件

 1 /**
2
3  * Class used to perform steps (weight update) using Gradient Descent methods.
4
5  * For general minimization problems, or for regularized problems of the form
6
7  * min L(w) + regParam * R(w),
8
9  * the compute function performs the actual update step, when given some
10
11  * (e.g. stochastic) gradient direction for the loss L(w),
12
13  * and a desired step-size (learning rate).
14
15  *
16
17  * The updater is responsible to also perform the update coming from the
18
19  * regularization term R(w) (if any regularization is used).
20
21  */
22
23 abstract class Updater extends Serializable {
24
25   /**
26
27    * Compute an updated value for weights given the gradient, stepSize, iteration number and
28
29    * regularization parameter. Also returns the regularization value regParam * R(w)
30
31    * computed using the *updated* weights.
32
33    * @param weightsOld - Column matrix of size dx1 where d is the number of features.
34
35    * @param gradient - Column matrix of size dx1 where d is the number of features.
36
37    * @param stepSize - step size across iterations
38
39    * @param iter - Iteration number
40
41    * @param regParam - Regularization parameter
42
43    *
44
45    * @return A tuple of 2 elements. The first element is a column matrix containing updated weights,
46
47    * and the second element is the regularization value computed using updated weights.
48
49    */
50
51   def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix, stepSize: Double, iter: Int,
52
53       regParam: Double): (DoubleMatrix, Double)
54
55 }

compute的参数weightsOld是更新前的变量回归系数(d*1维)gradient是根据指定的损失函数计算出的当前梯度stepSize 是步长也就是学习速率，iter 迭代次数，regParam 是正则参数值，该函数返回2个东西，第1个是更新后的回归系数，第2个是更新后的regParam * R(w) 值。

 1 /**
2
3  * A simple updater for gradient descent *without* any regularization.
4
5  * Uses a step-size decreasing with the square root of the number of iterations.
6
7  */
8
9 class SimpleUpdater extends Updater {
10
11   override def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix,
12
13       stepSize: Double, iter: Int, regParam: Double): (DoubleMatrix, Double) = {
14
15     val thisIterStepSize = stepSize / math.sqrt(iter)
16
18
19     (weightsOld.sub(step), 0)
20
21   }
22
23 }

 1 /**
2
3  * Updater for L1 regularized problems.
4
5  * R(w) = ||w||_1
6
7  * Uses a step-size decreasing with the square root of the number of iterations.
8
9  * Instead of subgradient of the regularizer, the proximal operator for the
10
11  * L1 regularization is applied after the gradient step. This is known to
12
13  * result in better sparsity of the intermediate solution.
14
15  * The corresponding proximal operator for the L1 norm is the soft-thresholding
16
17  * function. That is, each weight component is shrunk towards 0 by shrinkageVal.
18
19  * If w > shrinkageVal, set weight component to w-shrinkageVal.
20
21  * If w < -shrinkageVal, set weight component to w+shrinkageVal.
22
23  * If -shrinkageVal < w < shrinkageVal, set weight component to 0.
24
25  * Equivalently, set weight component to signum(w) * max(0.0, abs(w) - shrinkageVal)
26
27  */
28
29 class L1Updater extends Updater {
30
31   override def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix,
32
33       stepSize: Double, iter: Int, regParam: Double): (DoubleMatrix, Double) = {
34
35     val thisIterStepSize = stepSize / math.sqrt(iter)
36
38
40
41     val newWeights = weightsOld.sub(step)
42
43     // Apply proximal operator (soft thresholding)
44
45     val shrinkageVal = regParam * thisIterStepSize
46
47     (0 until newWeights.length).foreach { i =>
48
49       val wi = newWeights.get(i)
50
51       newWeights.put(i, signum(wi) * max(0.0, abs(wi) - shrinkageVal))
52
53     }
54
55     (newWeights, newWeights.norm1 * regParam)
56
57   }
58
59 }

加了正则项之后，前几步都一样，然后关键是对后面的处理（后面的理论暂时还不太理解，可以参考http://freemind.pluskid.org/machine-learning/sparsity-and-some-basics-of-l1-regularization/），还是说代码步骤吧，变量shrinkageVal =regParam * thisIterStepSize(注意：要*thisIterStepSize，因为w -= a*gradient  里面的gradient包括L（w）还包括正则的R(w))，然后对加正则前更新的newWeights，上遍历每一个元素，直接对该元素赋值newWeights.put(i, signum(wi) * max(0.0, abs(wi) - shrinkageVal))，对应着代码注释的红体部分。

 1 /**
2
3  * Updater for L2 regularized problems.
4
5  * R(w) = 1/2 ||w||^2
6
7  * Uses a step-size decreasing with the square root of the number of iterations.
8
9  */
10
11 class SquaredL2Updater extends Updater {
12
13   override def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix,
14
15       stepSize: Double, iter: Int, regParam: Double): (DoubleMatrix, Double) = {
16
17     val thisIterStepSize = stepSize / math.sqrt(iter)
18
20
21     // add up both updates from the gradient of the loss (= step) as well as
22
23     // the gradient of the regularizer (= regParam * weightsOld)
24
25     val newWeights = weightsOld.mul(1.0 - thisIterStepSize * regParam).sub(step)
26
27     (newWeights, 0.5 * pow(newWeights.norm2, 2.0) * regParam)
28
29   }
30
31 }

L2正则项加入后，损失函数变为loss1=loss+1/2 *regParam* ||w||^2，按梯度下降的更新公式:w=w-学习速率 * (d(loss1)/d(w));后面的d(loss1)=d(loss1)/d(w) + d(1/2*regParam*||w||^2) / d(w)了，那么更新公式变成了w=w-学习速率*d(loss)/d(w)-学习速率*d(1/2*regParam*||w|| ^2)/d(w)=(1-学习速率*regParam)*w-学习速率*d(loss)/d(w)，这个也就对应了第25行代码的意思

  1 package org.apache.spark.mllib.optimization
2
3 import org.apache.spark.Logging
4
5 import org.apache.spark.rdd.RDD
6
7 import org.jblas.DoubleMatrix
8
9 import scala.collection.mutable.ArrayBuffer
10
11 /**
12
13  * Class used to solve an optimization problem using Gradient Descent.
14
16
17  * @param updater Updater to be used to update weights after every iteration.
18
19  */
20
22
23   extends Optimizer with Logging
24
25 {
26
27   private var stepSize: Double = 1.0
28
29   private var numIterations: Int = 100
30
31   private var regParam: Double = 0.0
32
33   private var miniBatchFraction: Double = 1.0
34
35   /**
36
37    * Set the initial step size of SGD for the first step. Default 1.0.
38
39    * In subsequent steps, the step size will decrease with stepSize/sqrt(t)
40
41    */
42
43   def setStepSize(step: Double): this.type = {
44
45     this.stepSize = step
46
47     this
48
49   }
50
51   /**
52
53    * Set fraction of data to be used for each SGD iteration.
54
55    * Default 1.0 (corresponding to deterministic/classical gradient descent)
56
57    */
58
59   def setMiniBatchFraction(fraction: Double): this.type = {
60
61     this.miniBatchFraction = fraction
62
63     this
64
65   }

66
67   /**
68
69    * Set the number of iterations for SGD. Default 100.
70
71    */
72
73   def setNumIterations(iters: Int): this.type = {
74
75     this.numIterations = iters
76
77     this
78
79   }
80
81   /**
82
83    * Set the regularization parameter. Default 0.0.
84
85    */
86
87   def setRegParam(regParam: Double): this.type = {
88
89     this.regParam = regParam
90
91     this
92
93   }
94
95   /**
96
97    * Set the gradient function (of the loss function of one single data example)
98
99    * to be used for SGD.
100
101    */
102
104
106
107     this
108
109   }
110
111   /**
112
113    * Set the updater function to actually perform a gradient step in a given direction.
114
115    * The updater is responsible to perform the update from the regularization term as well,
116
117    * and therefore determines what kind or regularization is used, if any.
118
119    */
120
121   def setUpdater(updater: Updater): this.type = {
122
123     this.updater = updater
124
125     this
126
127   }
128
129   def optimize(data: RDD[(Double, Array[Double])], initialWeights: Array[Double])
130
131     : Array[Double] = {
132
133     val (weights, stochasticLossHistory) = GradientDescent.runMiniBatchSGD(
134
135         data,
136
138
139         updater,
140
141         stepSize,
142
143         numIterations,
144
145         regParam,
146
147         miniBatchFraction,
148
149         initialWeights)
150
151     weights
152
153   }
154
155 }

  1 // Top-level method to run gradient descent.
2
3 object GradientDescent extends Logging {
4
5   /**
6
7    * Run stochastic gradient descent (SGD) in parallel using mini batches.
8
9    * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
10
11    * in order to compute a gradient estimate.
12
13    * Sampling, and averaging the subgradients over this subset is performed using one standard
14
15    * spark map-reduce in each iteration.
16
17    *
18
19    * @param data - Input data for SGD. RDD of the set of data examples, each of
20
21    * the form (label, [feature values]).
22
23    * @param gradient - Gradient object (used to compute the gradient of the loss function of
24
25    * one single data example)
26
27    * @param updater - Updater function to actually perform a gradient step in a given direction.
28
29    * @param stepSize - initial step size for the first step
30
31    * @param numIterations - number of iterations that SGD should be run.
32
33    * @param regParam - regularization parameter
34
35    * @param miniBatchFraction - fraction of the input data set that should be used for
36
37    * one iteration of SGD. Default value 1.0.
38
39    *
40
41    * @return A tuple containing two elements. The first element is a column matrix containing
42
43    * weights for every feature, and the second element is an array containing the
44
45    * stochastic loss computed for every iteration.
46
47    */
48
49   def runMiniBatchSGD(
50
51     data: RDD[(Double, Array[Double])],
52
54
55     updater: Updater,
56
57     stepSize: Double,
58
59     numIterations: Int,
60
61     regParam: Double,
62
63     miniBatchFraction: Double,
64
65     initialWeights: Array[Double]) : (Array[Double], Array[Double]) = {
66
67     val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
68
69     val nexamples: Long = data.count()
70
71     val miniBatchSize = nexamples * miniBatchFraction
72
73     // Initialize weights as a column vector
74
75     var weights = new DoubleMatrix(initialWeights.length, 1, initialWeights:_*)
76
77     var regVal = 0.0
78
79     for (i <- 1 to numIterations) {
80
81       // Sample a subset (fraction miniBatchFraction) of the total data
82
83       // compute and sum up the subgradients on this subset (this is one map-reduce)
84
85       val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i).map {
86
87         case (y, features) =>
88
89           val featuresCol = new DoubleMatrix(features.length, 1, features:_*)
90
92
94
95       }.reduce((a, b) => (a._1.addi(b._1), a._2 + b._2))
96
97       /**
98
99        * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
100
101        * and regVal is the regularization value computed in the previous iteration as well.
102
103        */
104
105       stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
106
107       val update = updater.compute(
108
109         weights, gradientSum.div(miniBatchSize), stepSize, i, regParam)
110
111       weights = update._1
112
113       regVal = update._2
114
115     }
116
125 }