# Paper Reading: A Brief Introduction to Weakly Supervised Learning

incomplete, 想利用未标注数据帮助训练

inexact, 笼统的数据标注，如垃圾邮件分类

inaccurate supervision， 带噪声的数据，如众包

#### Incomplete supervision

training data set $D=\{(x_1,y_1),\cdots,(x_l,y_l),x_{l+1},\cdots,x_m\}$

active learning (with human intervention)

​ the labeling cost only depends on the number of queries

1. informativeness: an unlabeled instance helps reduce the uncertainty of a statistical model.

1.1 Uncertainty sampling a single learner, with the least confidence

1.2 query-by-committee multiple learners, disagree to most

2. representativeness : an instance helps represent the structure of input patterns

2.1 aim to exploit the cluster structure of unlabeled data

semi-supervised learning (no human intervention is assumed)

​ Here, although the unlabeled data points are not explicitly with label information, they implicitly convey some information about data distribution which can be helpful for predictive modelling.

​ two basic assumptions: the cluster assumption (data have inherent cluster structure) and the manifold assumption (data lie on a manifold).

1. generative methods

​ labels of unlabeled instances can be treated as missing values of model parameters, and estimated by approaches such as the EM .

​ To get good performance, one usually needs domain knowledge to determine adequate generative model.

2. graph based methods

​ the performance will heavily depends on how the graphis constructed.

3. low-density seperation methods

​ It is evident that S3VMs try to identify a classiﬁcation boundary which goes across the less dense region while keeping the labeled data correctly classiﬁed.

4. disagreement-based methods

​ generate multiple learners and let them collaborate to exploit unlabeled data.

#### Inexact Supervision

​ Multi-instance learning: predict the labels for unseen bags($X_i$ is a positive bag, if there exists $x_{ip}$ which is positive, while p is unknown).

#### Inaccurate Supervision

​ For machine learning, crowdsourcing is commonly used as a cost-saving way to collect labels for training data.

