-
for input may be preprocessed
- like fixed size of the image
- also called feature extraction
- speed up computation
- easier to solve problem
-
supervised learning
- classification- discrete output
- regression- continues output
-
unsupervised learning
- visualization
- clustering
- density estimation
...
-
technique of reinforcement learning : find suirable actions to take in a given stituation in order to maximized the reward
-
function which is linear in unknow parameters are called linear model
- for the polynomial model: \(y(x, \mathbf{\beta}) = \beta^0 + \beta_1 x + \beta_2 x^2 + .... + \beta_M x^M\) is nonliear in input x but linear in unknow parameter \(\beta\)
- implement issue:
- choose the value of coefficents/parameters/weights is determined by fitting data - minimizing error function - cost function
- choose the order M - model selection/comparision
- overfitting
- FACT: as M increases, the magnitude of the coefficients typically gets larger
- regulazation to fight overfitting
- involves adding a penalty term to the error function (1.2) in order to discourage the coefficients from reaching large values
- add \(\lambda / 2 \| \mathbf{\beta} \|^2\) \(\lambda\) is the importance of the regulazation term;
- if use quadric regularizer - call ridge regression
- weight decay? in neural network
- \(\lambda\) can suppressed over-fitting, reduce value of weight, but if \(\lambda\) too large, weight goes to 0, will lead to poor fit
probability theory#
- 提出probability 是想更加科学一些
- expectation
- def: weighted averages of funtions
- if we are given a finite number N of points drawn from the probability distribution or probability density
- \(E[f] ~- 1/N sum_{n=1,2...N}f(x_n)\)
- consider expectation of functions of several variables eg f(x,y)
- expectation have subscript with repesct to different variable:
- \(E_x[f(x,y)]\) and \(E_y[f(x,y)]\)
- variance: \(Var[x] = E[x^2] - E[x]^2\)
- consider functions of several variables eg f(x,y)
- covariance: \(cov[x,y] = E_{x,y}[{x - E[x]}{y^T - E[y^T]}]\)
- consider functions of several variables eg f(x,y)
interpretation of probabilities#
- popular: classical or frequentist way
- the probability P of an uncertain event A, P(A) is defined by the frequency of that event based on previous observations
- another point of view: Bayesian view - probability provide a quanrification of uncertainty
- for future event, we do not have historical database thus can not count the frequency.
- but can measure the belief in a statement \(a\) based on some 'knowledge', denote as P(a|K), different K can generate different P(a|K) and even same K can have different P(a|K) -- the belief is subjective
- Bayes rule
- consider conditional probabilities
- \(P(A|B) = P(B|A) P(A)/P(B)\)
- interpretation: updating our belief about a hypothesis A in the light of new evidence B
- in likelihood, it is, output brief of y/A given B/input values+paramters
- P(A|B): posterior belief
- P(A): prior belief
- P(B|A): likelihood, ie the B(our model) will occur if A(the output value of the sample data) is true.
- P(B) is computed by: \(sum_{i=1,2,...}P(B|A_i)P(A_i)\) by marginalisation.
- interpretation: updating our belief about a hypothesis A in the light of new evidence B
- in machine learning, Bayes theorem is used to convert a priot ptobability \(P(A) = P(\mathbf{\beta})\) into a porterior probability \(P(A|B) = P(\mathbf{\beta}|y)\) by incorpoating the evidence provided by the observed data
- for \(\mathbf{\beta}\) in the polynormial curve fitting model, we can take an approach with Bayes theorem:
- $P(\mathbf{\beta} | \mathbf{y}) =\frac{ P(\mathbf{y}|\mathbf{\beta}) p(\mathbf{\beta})}{P(\mathbf{y})} $
- given data {y_1,y_2,...}, we want to know the \(\beta\), cant get directly. \(P(\mathbf{\beta} | \mathbf{y})\):= posterior probability
- \(P(\beta)\):= prior probability; our assumption of \(\beta\)
- \({P(\mathbf{y})}\):= normalization constant since the given data is fixed
- \(P(\mathbf{y}|\mathbf{\beta})\):= likelihood function;
- can be view as function of parameter \(\beta\)
- not a probability distrubution, so intergral w.r.t \(\beta\) not nessary = 1
- state Bayes theorem as : posterior 8 likelihood × prior, consider all of these as function of parameters \(\beta\)
- intergrate both side base on \(\beta\): \(p(y)= \intergral p(y|\beta)p(\beta)d\beta\)
- issue: particularly the need to marginalize (sum or integrate) over the whole of parameter space
different view of likelihood function#
- likelihood function: \(P(\mathbf{y}|\mathbf{\beta})\)
- from frequentist way of interpretation:
- parameter \(\beta\) is a fixed parameter, the value is determined by 'estimator'
- A widely used frequentist estimator is maximum likelihood, in which wis set to the value that maximizes the likelihood function
- ie. choosing \(\beta\) s.t. probability of the observed data is maximized
- in practice, use negative log of likelihood function = log-likelihood:= error function(monotonically decreasing)
- One approach to determining frequentist error bars is the bootstrap,
- s1: 就是在已有的dataset(size N)里面random弄出L个dataset(size N) by drawing data from 已有的dataset(抽取方式是,可以重复抽,可以有的没有被抽中)
- s2: looking at the variability of predictions between the different bootstrap data sets. then evaluate the accuracy of the estimates of the parameter
- drawback: may lead to extreme conclusion if the dataset is bad, eg, a fair-looking coin is tossed three times and lands heads each time. in this case, we will generate parameter \(\beta\) to make P(lands head) = 1
-
from Bayesian viewpoint:
浙公网安备 33010602011771号