Machine Learning for Machine Vision
Learning and Inference
Computer vision models
Given the observed measured data \(\rvx\), we could draw inference from it about state of the world \(\rvw\). Since the measurement contains noise, the best we could do is to compute a probability distribution \(Pr(\rvw|\rvx)\) over possible states of world.
Generative v.s. Discrminative
- Model contingency of the world on the data \(Pr(w|x)\) (Discriminative)
- Model joint occurrence of world and data \(Pr(x, w)\) (Generative)
- Model contingency of data on world \(Pr(x|w)\) (Generative)
3 types of models
Model \(Pr(\rvw|\rvx)\) - Discriminative
Here are steps
1. Choose an appropriate form for \(P(\rvx)\) (e.g., a normal distribution)
2. Make parameters a function of \(\rvx\) (e.g., mean is linear function of \(\rvx\))
3. Function takes parameter \(\theta\) that defines its shape
Learning: Learn parameters \(\theta\) from training data \(\rvx, \rvw\)
Inference: Just evaluate \(Pr(\rvw|\rvx)\)
Example:
1. Choose a normal distribution for \(P(w)\)
-
Make mean a linear function of \(x\), variance \(\sigma^2\) a constant
\[
Pr(w|x, \rvtheta) = \mathrm{Norm}_{w}[\phi_0 + \phi_1x, \sigma^2]
\]
-
Learning the parameters using MAP
\[
\begin{aligned}
\hat{\rvtheta} &=\arg\max_\rvtheta Pr(\rvtheta|w_{1\dots I}, x_{1\dots I})\\
&=\arg\max_\rvtheta Pr(w_{1\dots I}, x_{1\dots I}, \rvtheta) Pr(\rvtheta)
\end{aligned}
\]
Model \(Pr(\rvx, \rvw)\) - Generative
Here are steps:
1. Concatenate \(\rvx\) and \(\rvw\) to make \(\rvz = \begin{bmatrix}\rvx^\top & \rvw^\top \end{bmatrix}^\top\)
2. Model the pdf of \(\rvz\)
3. Pdf takes parameter \(\theta\) that defines its shape
Learning: Learn parameters \(\theta\) from training data \(\rvx, \rvw\)
Inference: Compute \(Pr(\rvw|\rvx)\) using Bayes rule
\[
Pr(\rvw|\rvx) = \frac{Pr(\rvx, \rvw)}{Pr(\rvx)} = \frac{Pr(\rvx, \rvw)}{\int Pr(\rvx, \rvw)d\rvw}
\]
Model \(Pr(\rvx|\rvw)\) - Generative
Here are steps:
1. Choose an appropriate form for \(Pr(\rvx)\)
2. Make parameters a function of \(\rvw\)
3. Function takes parameter \(\theta\) that defines its shape
Learning: Learn parameters \(\theta\) from training data \(\rvx, \rvw\)
Inference: Define prior \(Pr(\rvw)\) and then compute \(Pr(\rvw|\rvx)\) using Bayes rule:
\[
Pr(\rvw|\rvx) = \frac{Pr(\rvx|\rvw)Pr(\rvw)}{\int Pr(\rvx|\rvw)Pr(\rvw)d\rvw}
\]
Summary (Which to use)
- Generative methods model data - costly and many aspects of data may have no influence on world state.
- Inference is simple for discrminative models
- Data is generated from world - generative models match this
- If missing data, then generative model prefered
- Generative model allows imposition of prior knowledge from users.
Modeling Complex Densities
Models with hidden variables
Key idea: Represent density \(Pr(x)\) as marginalization of joint density with another variable \(h\) that we do not see:
\[
Pr(\rvx|\theta) = \int Pr(\rvx, \rvh|\theta)d\rvh
\]
Mixture of Gaussians (MoG)
Represent the density with mixture of Gaussians:
\[
\begin{aligned}
Pr(\rvx|\theta) &= \sum_{k=1}^K Pr(\rvx, h=k|\theta)\\
&= \sum_{k=1}^K Pr(\rvx | h=k, \theta)Pr(h=k|\theta)\\
&=\sum_{k=1}^K \lambda_k \mathrm{Norm}_\rvx [\vect{\mu}_k, \vect{\Sigma}_k]
\end{aligned}
\]
where
\[
\begin{aligned}
Pr(\rvx|h, \rvtheta) &= \mathrm{Norm}_{\rvx}[\vect{\mu}_k, \vect{\Sigma}_k]\\
Pr(h| \rvtheta) &=\mathrm{Cat}_h[\vect{\lambda}]
\end{aligned}
\]
Here we can generate data from MoG by firstly sampling \(Pr(h)\), then sample \(Pr(\rvx|h)\). The hidden variable \(h\) has a clear interpretation of which Gaussian created data point \(\rvx\).
Expectation maximization (EM)
-
E-step: Maximize bound w.r.t distributions \(q(\rvh_i)\), i.e., posterior over hidden variables:
\[
\hat{q}(\rvh_i)=Pr(\rvh_i|\rvx_i,\rvtheta^{[t]})=\frac{Pr(\rvx_i|\rvh_i,\rvtheta^{[t]})Pr(\rvh_i|\rvtheta^{[t]})}{Pr(\rvx_i)}
\]
-
M-step: Maximize bound w.r.t paramters \(\theta\)
\[
\hat{\theta}^{[t+1]} = \arg\max_{\theta}[\sum_{i=1}^I \sum_{k=1}^K \hat{q}(\rvh_i=k)\log[Pr(\rvx_i, \rvh_i=k|\rvtheta)]]
\]
E-step
\[
\begin{aligned}
Pr(\rvh_i=k|\rvx_i,\rvtheta^{[t]}) &= \frac{Pr(\rvx | h=k, \theta)Pr(h=k|\theta)}{\sum_{j=1}^K Pr(\rvx | h=j, \theta)Pr(h=j|\theta)}\\
&= \frac{\lambda_k \mathrm{Norm}_{\rvx_i} [\vect{\mu}_k, \vect{\Sigma}_k]}{\sum_{j=1}^K \lambda_k \mathrm{Norm}_{\rvx_i} [\vect{\mu}_j, \vect{\Sigma}_j]}\\
&=r_{ik}
\end{aligned}
\]
We call this the responsibility of the k-th Guassian for i-th data point. Repeat this procedure for every datapoint!
M-step
\[
\begin{aligned}
\hat{\theta}^{[t+1]} &= \arg\max_{\theta}[\sum_{i=1}^I \sum_{k=1}^K \hat{q}(\rvh_i=k)\log[Pr(\rvx_i, \rvh_i=k|\rvtheta)]]\\
&= \arg\max_{\theta}[\sum_{i=1}^I \sum_{k=1}^K r_{ik}\log[\lambda_k \mathrm{Norm}_{\rvx_i} [\vect{\mu}_k, \vect{\Sigma}_k]]]
\end{aligned}
\]
Here we take derivate and solve (Lagrange multipliers for \(\lambda\)):
\[
\begin{aligned}
\lambda_k^{[t+1]} &= \frac{\sum_{i=1}^I r_{ik}}{\sum_{j=1}^K\sum_{i=1}^I r_{ij}}\\
\vect{\mu}_k^{[t+1]} &= \frac{\sum_{i=1}^I r_{ik}\rvx_i}{\sum_{i=1}^I r_{ik}}\\
\vect{\Sigma}_k^{[t+1]} &= \frac{\sum_{i=1}^I r_{ik}(\rvx_i-\vect{\mu}_k^{[t+1]})(\rvx_i-\vect{\mu}_k^{[t+1]})^\top}{\sum_{i=1}^I r_{ik}}
\end{aligned}
\]
EM in details
E-step
We want to define a lower bound on log-likelihood \(\sum_{i=1}^I\log[\int Pr(\rvx_i, \rvh_i |\theta)d\rvh_i]\) and increases bound iteratively:
\[
\begin{aligned}
\sum_{i=1}^I\log[\int Pr(\rvx_i, \rvh_i |\theta)d\rvh_i] &=
\sum_{i=1}^I\log[\int q_i(\rvh_i)\frac{Pr(\rvx_i, \rvh_i |\theta)}{q_i(\rvh_i)}d\rvh_i]\\
&\ge \sum_{i=1}^I\int q_i(\rvh_i)\log[\frac{Pr(\rvx_i, \rvh_i |\theta)}{q_i(\rvh_i)}]d\rvh_i \quad \text{Jensen's inequality $E[\log[y]]\le \log E[y]$}\\
&= \gB[{(q_i(\rvh_i))}, \rvtheta]
\end{aligned}
\]
Now we can maxmize the log-likelihood by maximizing its lower bound \(\gB[{(q_i(\rvh_i))}, \rvtheta]\) w.r.t distributions \(q(\rvh_i)\):
\[
\begin{aligned}
\gB[{(q_i(\rvh_i))}, \rvtheta] &= \sum_{i=1}^I\int q_i(\rvh_i)\log[\frac{Pr(\rvx_i, \rvh_i |\theta)}{q_i(\rvh_i)}]d\rvh_i\\
&= \sum_{i=1}^I\int q_i(\rvh_i)\log[\frac{Pr(\rvh_i | \rvx_i, \theta) Pr(\rvx_i |\theta)}{q_i(\rvh_i)}]d\rvh_i\\
&= \sum_{i=1}^I\int q_i(\rvh_i)\log[ Pr(\rvx_i |\theta)]d\rvh_i - \sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{q_i(\rvh_i)}{Pr(\rvh_i | \rvx_i, \theta)}]d\rvh_i\\
&= \sum_{i=1}^I \underbrace{\log[ Pr(\rvx_i |\theta)]}_{\text{constant w.r.t $q(h)$}} - \sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{q_i(\rvh_i)}{Pr(\rvh_i | \rvx_i, \theta)}]d\rvh_i
\end{aligned}
\]
Since the first term is a constant w.r.t \(q(h)\), to maxmize the bound, we only need to maximize the second term:
\[
\begin{aligned}
\hat{q}_i(\rvh_i) &= \arg\max_{q_i(\rvh_i)} \gB[{(q_i(\rvh_i))}, \rvtheta]\\
&= \arg\max_{q_i(\rvh_i)}-\sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{q_i(\rvh_i)}{Pr(\rvh_i | \rvx_i, \theta)}]d\rvh_i\\
&= \arg\min_{q_i(\rvh_i)}-\sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{Pr(\rvh_i | \rvx_i, \theta)}{q_i(\rvh_i)}]d\rvh_i\\
\end{aligned}
\]
This is called Kullback Leibler divergence - distance between probability distributions. We are maximizing the negative distance (i.e. minimizing distance).
By using the relation \(\log[y]\le y-1\), we get:
\[
\begin{aligned}
\sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{Pr(\rvh_i | \rvx_i, \theta)}{q_i(\rvh_i)}] &\le \sum_{i=1}^I\int q_i(\rvh_i)(\frac{Pr(\rvh_i | \rvx_i, \theta)}{q_i(\rvh_i)}-1) d\rvh_i \\
&= \sum_{i=1}^I\int Pr(\rvh_i | \rvx_i, \theta) - q_i(\rvh_i) d\rvh_i \\
&= \sum_{i=1}^I\int Pr(\rvh_i | \rvx_i, \theta)d\rvh_i -\sum_{i=1}^I\int q_i(\rvh_i)d\rvh_i\\
&= 1-1 = 0
\end{aligned}
\]
Thus, the cost function must be positive, so the best \(\hat{q}_i(\rvh_i)\) we can have is the one that makes the cost function equals to 0. We can choose \(\hat{q}_i(\rvh_i) = Pr(\rvh_i|\rvx_i, \rvtheta_i)\):
\[
\begin{aligned}
\sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{Pr(\rvh_i | \rvx_i, \theta)}{q_i(\rvh_i)}]d\rvh_i &= \sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{Pr(\rvh_i | \rvx_i, \theta)}{Pr(\rvh_i | \rvx_i, \theta)}]d\rvh_i \\
&= \sum_{i=1}^I\int q_i(\rvh_i)\log[1]d\rvh_i = 0
\end{aligned}
\]
M step
In M step, we optimize bound w.r.t \(\theta\):
\[
\begin{aligned}
\rvtheta^{[t]} &= \arg\max_{\rvtheta} \gB[{(q_i^{[t]}(\rvh_i))}, \rvtheta]\\
&=\arg\max_{\rvtheta} \sum_{i=1}^I\int q_i^{[t]}(\rvh_i)\log[\frac{Pr(\rvx_i, \rvh_i |\theta)}{q_i^{[t]}(\rvh_i)}]d\rvh_i\\
&=\arg\max_{\rvtheta} \sum_{i=1}^I\int q_i^{[t]}(\rvh_i)\log[Pr(\rvx_i, \rvh_i |\theta)] - \underbrace{q_i^{[t]}(\rvh_i)\log[q_i^{[t]}(\rvh_i)] d\rvh_i}_{\text{constant w.r.t $\rvtheta$}}\\
&=\arg\max_{\rvtheta} \sum_{i=1}^I\int q_i^{[t]}(\rvh_i)\log[Pr(\rvx_i, \rvh_i |\theta)d\rvh_i]
\end{aligned}
\]
Regression models
Linear regression
We can model the world \(\rvw\) with a normal distribution:
\[
Pr(\rvw|\rmX, \rvtheta) = \mathrm{Norm}_{\rvw}[\rmX^\top\rvphi, \sigma^2\rmI]
\]
Learning - Maximum likelihood
\[
\hat{\rvtheta}=\arg\max_{\rvtheta}[Pr(\rvw|\rmX, \rvtheta)] = \arg\max_{\rvtheta}[\log Pr(\rvw|\rmX, \rvtheta)]
\]
Taking derivative, set result to zero and re-arrange:
\[
\begin{aligned}
\hat{\rvphi} &= (\rmX \rmX^\top)^{-1}\rmX\rvw\\
\hat{\sigma}^2 &= \frac{(\rvw-\rmX^\top\hat{\rvphi})^\top(\rvw-\rmX^\top\hat{\rvphi})}{I}
\end{aligned}
\]
Bayesian linear regression
Likelihood:
\[
Pr(\rvw|\rmX, \rvtheta) = \mathrm{Norm}_{\rvw}[\rmX^\top\rvphi, \sigma^2\rmI]
\]
Prior:
\[
Pr(\rvphi) = \mathrm{Norm}_{\rvphi}[\vect{0}, \sigma^2_p\rmI]
\]
Bayes rule:
\[
Pr(\rvphi|\rmX, \rvw) = \frac{Pr(\rvw|\rmX, \rvphi)Pr(\rvphi)}{Pr(\rvw|\rmX)}
\]
Learning
Posterior:
\[
Pr(\rvphi|\rmX, \rvw) = \mathrm{Norm}_{\rvphi}[\frac{1}{\sigma^2}\rmA^{-1}\rmX\rvw, \rmA^{-1}] \quad \text{where $\rmA = \frac{1}{\sigma^2}\rmX\rmX^\top + \frac{1}{\sigma^2_p}\rmI$}
\]
Fit variance - Maximum likelihood:
\[
\begin{aligned}
Pr(\rvw|\rmX, \sigma^2) &= \int Pr(\rvw|\rmX, \rvphi)Pr(\rvphi)d\rvphi\\
&= \int \mathrm{Norm}_{\rvw}[\rmX^\top\rvphi, \sigma^2\rmI]\mathrm{Norm}_{\rvphi}[\vect{0}, \sigma^2_p\rmI]d\rvphi\\
&= \mathrm{Norm}_{\rvw}[\vect{0}, \sigma^2_p\rmX\rmX^\top + \sigma\rmI]
\end{aligned}
\]
Inference
\[
\begin{aligned}
Pr(w^\star|\rvx^\star, \rmX, \rvw) &= \int Pr(w^\star|\rvx^\star, \rvphi)Pr(\rvphi|\rmX, \rvw)d\rvphi\\
&= \int\mathrm{Norm}_{w^\star}[\phi^\top\rvx^\star, \sigma^2] \mathrm{Norm}_{\rvphi}[\frac{1}{\sigma^2}\rmA^{-1}\rmX\rvw, \rmA^{-1}] d\rvphi\\
&= \mathrm{Norm}_{w^\star}[\frac{1}{\sigma^2}\rvx^{\star\top}\rmA^{-1}\rmX\rvw, \sigma^2 + \rvx^\star\rmA^{-1}\rvx^\star + \sigma^2]
\end{aligned}
\]
Non-linear regression