# 概要

## 线性回归模型

 

1. 假设函数
2. 为了找到最好的假设函数，需要找到合理的评估标准，一般来说使用损失函数来做为评估标准
3. 根据损失函数推出目标函数
4. 现在问题转换成为如何找到目标函数的最优解，也就是目标函数的最优化

# 梯度下降法

## 正则化

如何解决这些问题呢？可以采用收缩方法(shrinkage method)，收缩方法又称为正则化(regularization)。 主要是岭回归(ridge regression)和lasso回归。通过对最小二乘估计加 入罚约束,使某些系数的估计为0。

# 线性回归的代码实现

1. Y = A*X + B 假设函数
2. 随机梯度下降法
3. 岭回归或Lasso法，或什么都没有

train->run,run函数的处理逻辑

1. 利用最优化算法来求得最优解,optimizer.optimize
2. 根据最优解创建相应的回归模型, createModel

def runMiniBatchSGD(
data: RDD[(Double, Vector)],
updater: Updater,
stepSize: Double,
numIterations: Int,
regParam: Double,
miniBatchFraction: Double,
initialWeights: Vector): (Vector, Array[Double]) = {

val stochasticLossHistory = new ArrayBuffer[Double](numIterations)

val numExamples = data.count()
val miniBatchSize = numExamples * miniBatchFraction

// if no data, return initial weights to avoid NaNs
if (numExamples == 0) {

return (initialWeights, stochasticLossHistory.toArray)

}

// Initialize weights as a column vector
var weights = Vectors.dense(initialWeights.toArray)
val n = weights.size

/**
* For the first iteration, the regVal will be initialized as sum of weight squares
* if it's L2 updater; for L1 updater, the same logic is followed.
*/
var regVal = updater.compute(
weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2

for (i  (c, v) match { case ((grad, loss), (label, features)) =>
},
combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
})

/**
* NOTE(Xinghao): lossSum is computed using the weights from the previous iteration
* and regVal is the regularization value computed in the previous iteration as well.
*/
stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
val update = updater.compute(
weights, Vectors.fromBreeze(gradientSum / miniBatchSize), stepSize, i, regParam)
weights = update._1
regVal = update._2
}

stochasticLossHistory.takeRight(10).mkString(", ")))

(weights, stochasticLossHistory.toArray)

}


上述代码中最需要引起重视的部分是aggregate函数的使用,先看下aggregate函数的定义

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = {
// Clone the zero value since we will also be serializing it as part of tasks
var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())
val cleanSeqOp = sc.clean(seqOp)
val cleanCombOp = sc.clean(combOp)
val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)
val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)
sc.runJob(this, aggregatePartition, mergeResult)
jobResult
}


aggregate函数有三个入参,一是初始值ZeroValue,二是seqOp,三为combOp.

val z = sc. parallelize (List (1 ,2 ,3 ,4 ,5 ,6),2)
z.aggregate (0)(math.max(_, _), _ + _)
// 运 行 结 果 为 9
res0: Int = 9


class LeastSquaresGradient extends Gradient {
override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = {
val brzData = data.toBreeze
val brzWeights = weights.toBreeze
val diff = brzWeights.dot(brzData) - label
val loss = diff * diff
val gradient = brzData * (2.0 * diff)

}

override def compute(
data: Vector,
label: Double,
weights: Vector,
val brzData = data.toBreeze
val brzWeights = weights.toBreeze
//dot表示点积，是接受在实数R上的两个向量并返回一个实数标量的二元运算，它的结果是欧几里得空间的标准内积。
//两个向量的点积写作a·b。点乘的结果叫做点积，也称作数量积
val diff = brzWeights.dot(brzData) - label

//下面这句话完成y += a*x

diff * diff
}
}


Breeze, Epic及Puck是scalanlp中三大支柱性项目, 具体可参数www.scalanlp.org

## 正则化过程

  val update = updater.compute(
weights, Vectors.fromBreeze(gradientSum / miniBatchSize), stepSize, i, regParam)


class SquaredL2Updater extends Updater {
override def compute(
weightsOld: Vector,
stepSize: Double,
iter: Int,
regParam: Double): (Vector, Double) = {
// the gradient of the regularizer (= regParam * weightsOld)
// w' = w - thisIterStepSize * (gradient + regParam * w)
// w' = (1 - thisIterStepSize * regParam) * w - thisIterStepSize * gradient
val thisIterStepSize = stepSize / math.sqrt(iter)
val brzWeights: BV[Double] = weightsOld.toBreeze.toDenseVector
brzWeights :*= (1.0 - thisIterStepSize * regParam)
val norm = brzNorm(brzWeights, 2.0)

(Vectors.fromBreeze(brzWeights), 0.5 * regParam * norm * norm)
}
}


## 结果预测

class LinearRegressionModel (
override val weights: Vector,
override val intercept: Double)
extends GeneralizedLinearModel(weights, intercept) with RegressionModel with Serializable {

override protected def predictPoint(
dataMatrix: Vector,
weightMatrix: Vector,
intercept: Double): Double = {
weightMatrix.toBreeze.dot(dataMatrix.toBreeze) + intercept
}
}


注意LinearRegression的构造函数需要权重(weights)和截距(intercept)作为入参，对新的变量做出预测需要调用predictPoint

# 一个完整的示例程序

import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("mllib/data/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}

// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)


# 小结

posted @ 2014-08-15 20:04  徽沪一郎  阅读(...)  评论(...编辑  收藏