# 1. 算法

• 采用One-vs-Rest（或其他方法）组合多个二分类基分类器；

## One-vs-Rest

AdaBoost-MH算法是由Schapire（AdaBoost算法作者）与Singer提出，基本思想与AdaBoost算法类似：自适应地调整样本-类别的分布权重。对于训练样本$\langle (x_1, Y_1), \cdots, (x_m, Y_m) \rangle$，任意一个实例 $x_i \in \mathcal{X}$，标签类别$Y_i \subseteq \mathcal{Y}$，算法流程如下：

$Y[\ell] = \left \{ { \matrix { {+1} & {\ell \in Y} \cr {-1} & {\ell \notin Y} \cr } } \right.$

$Z_t$为每一次迭代的归一化因子，保证权重分布矩阵$D$的所有权重之和为1，

$Z_t = \sum_{i=1}^{m} \sum_{\ell \in \mathcal{Y}} D_{t}(i, \ell) \exp \large{(}-\alpha_{t} Y_i[\ell] h_t(x_i, \ell) \large{)}$

## ML-KNN

ML-KNN (multi-label K nearest neighbor)基于KNN算法，已知K近邻的标签信息，通过最大后验概率（Maximum A Posteriori）估计实例$t$是否应打上标签$\ell$

$y_t(\ell) = \mathop{ \arg \max}_{b \in \{0,1\}} P(H_b^{\ell} | E_{C_t(\ell)}^{\ell} )$

$y_t(\ell) = \mathop{ \arg \max}_{b \in \{0,1\}} P(H_b^{\ell}) P(E_{C_t(\ell)}^{\ell} | H_b^{\ell} )$

# 2. 实验

LR+OvR 0.0569 0.6252 0.5586 0.5563
ML-KNN 0.0652 0.6204 0.6535 0.5977

import numpy as np
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

X_train, y_train = load_svmlight_file('tmc2007_train.svm', dtype=np.float64, multilabel=True)
X_test, y_test = load_svmlight_file('tmc2007_test.svm', dtype=np.float64, multilabel=True)

# convert multi labels to binary matrix
mb = MultiLabelBinarizer()
y_train = mb.fit_transform(y_train)
y_test = mb.fit_transform(y_test)

# LR + OvR
clf = OneVsRestClassifier(LogisticRegression(), n_jobs=10)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# multilabel classification metrics
loss = metrics.hamming_loss(y_test, y_pred)
prf = metrics.precision_recall_fscore_support(y_test, y_pred, average='samples')

"""
ML-KNN for multilabel classification
"""

clf = MLkNN(k=15)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

// AdaBoost.MH for multilabel classification
val labels0Based = true
val binaryProblem = false

learner.setNumIterations(params.numIterations) // 500 iter
learner.setNumDocumentsPartitions(params.numDocumentsPartitions)
learner.setNumFeaturesPartitions(params.numFeaturesPartitions)
learner.setNumLabelsPartitions(params.numLabelsPartitions)
val classifier = learner.buildModel(params.input, labels0Based, binaryProblem)

val testPath = "./tmc2007_test.svm"
val numRows = DataUtils.getNumRowsFromLibSvmFile(sc, testPath)
val testRdd = DataUtils.loadLibSvmFileFormatDataAsList(sc, testPath, labels0Based, binaryProblem, 0, numRows, -1);
val results = classifier.classifyWithResults(sc, testRdd, 20)

val predAndLabels = sc.parallelize(predLabels.zip(goldLabels)
.map(t => {
(t._1.map(e => e.toDouble), t._2.map(e => e.toDouble))
}))
val metrics = new MultilabelMetrics(predAndLabels)


# 3. 参考文献

[1] Schapire, Robert E., and Yoram Singer. "BoosTexter: A boosting-based system for text categorization." Machine learning 39.2-3 (2000): 135-168.
[2] Zhang, Min-Ling, and Zhi-Hua Zhou. "ML-KNN: A lazy learning approach to multi-label learning." Pattern recognition 40.7 (2007): 2038-2048.
[3] 基于PredictionIO的推荐引擎打造，及大规模多标签分类探索.

posted @ 2018-10-17 17:29  Treant  阅读(...)  评论(...编辑  收藏