kmeans

如果是自己写kmeans的话，会怎么写呢？

首先kmeans的算法步骤是

随机选取k个点作为初始的簇心，接着计算各个点到各个簇心的距离，将最近的簇心作为该点的簇心。

接着对相同簇心的点做平均，得到下一个簇心

接着就是不停地迭代，知道收敛为止

那么哪些步骤可以并行计算呢？

这里主要有两部分计算量

第一部分是计算各个点到各个簇心的距离，并选取最短的簇心作为自己的簇心

第二部分是计算每个簇的均值从而获得下个迭代的簇心

目前想到的是：

比如有100w条数据，一共分成10个Partition，需要分成5个簇，那么首先将这个k个簇心分发到这10个Partition中，接着对每个Partition中的数据求到这5个簇心的最短簇心，接着利用reduceByKey计算下一个簇心（reduceByKey会首先计算各个Partition中相同的key值）

好吧，接下来看看spark中是怎么做的

首先KMeans调用了train方法：

 def train(
      data: RDD[Vector],
      k: Int,
      maxIterations: Int,
      runs: Int,
      initializationMode: String): KMeansModel = {
    new KMeans().setK(k)
      .setMaxIterations(maxIterations)
      .setRuns(runs)
      .setInitializationMode(initializationMode)
      .run(data)
  }

所以这里返回的是KMeansModel，这里主要设置了最大的迭代次数，设置簇数目，setRuns是设置并行数，

这里最重要的就是run方法了。

接下来看run

 def run(data: RDD[Vector]): KMeansModel = {
    if (data.getStorageLevel == StorageLevel.NONE) {
      logWarning("The input data is not directly cached, which may hurt performance if its"
        + " parent RDDs are also uncached.")
    }
    // Compute squared norms and cache them.
//求2范数
    val norms = data.map(Vectors.norm(_, 2.0))
    norms.persist()
//将向量和平方和zip起来
    val zippedData = data.zip(norms).map { case (v, norm) =>
      new VectorWithNorm(v, norm)
    }
//这个是大头
    val model = runAlgorithm(zippedData)
//原来还能主动unpersist的，涨姿势了
    norms.unpersist()
    // Warn at the end of the run as well, for increased visibility.
    if (data.getStorageLevel == StorageLevel.NONE) {
      logWarning("The input data was not directly cached, which may hurt performance if its"
        + " parent RDDs are also uncached.")
    }
    model
  }

这里解释下Vectors.norm(_,2.0)的作用

这里其实是在求2范数，怎么求范数呢？

这是个求P范数

所以这里的2范数其实就是各个维度的属性值平方和的开方

顺便看下norm的源码

  def norm(vector: Vector, p: Double): Double = {
    require(p >= 1.0, "To compute the p-norm of the vector, we require that you specify a p>=1. " +
      s"You specified p=$p.")
    val values = vector match {
      case DenseVector(vs) => vs
      case SparseVector(n, ids, vs) => vs
      case v => throw new IllegalArgumentException("Do not support vector type " + v.getClass)
    }
    val size = values.length
    if (p == 1) {
      var sum = 0.0
      var i = 0
      while (i < size) {
        sum += math.abs(values(i))
        i += 1
      }
      sum
    } else if (p == 2) {
      var sum = 0.0
      var i = 0
      while (i < size) {
        sum += values(i) * values(i)
        i += 1
      }
      math.sqrt(sum)
    } else if (p == Double.PositiveInfinity) {
      var max = 0.0
      var i = 0
      while (i < size) {
        val value = math.abs(values(i))
        if (value > max) max = value
        i += 1
      }
      max
    } else {
      var sum = 0.0
      var i = 0
      while (i < size) {
        sum += math.pow(math.abs(values(i)), p)
        i += 1
      }
      math.pow(sum, 1.0 / p)
    }
  }

额，似乎没啥好说的，一般来说用1,2，正无穷范数比较多，所以这里单独列出这三个来了。

接下来就主要分析runAlgorithm这个函数（话说这名字取得有点粗糙啊，你runKmeans都比这个好）

这个函数主要的工作就我上面说的那样，只是里面加了一些东西，不太理解。

posted on 2017-03-05 12:11 sunrye 阅读(365) 评论(0) 编辑收藏举报