Machine Learning for Machine Vision

Learning and Inference

Computer vision models

Given the observed measured data $\rvx$, we could draw inference from it about state of the world $\rvw$. Since the measurement contains noise, the best we could do is to compute a probability distribution $Pr(\rvw|\rvx)$ over possible states of world.

Generative v.s. Discrminative

Model contingency of the world on the data $Pr(w|x)$ (Discriminative)
Model joint occurrence of world and data $Pr(x, w)$ (Generative)
Model contingency of data on world $Pr(x|w)$ (Generative)

3 types of models

Model $Pr(\rvw|\rvx)$ - Discriminative

Here are steps 1. Choose an appropriate form for $P(\rvx)$ (e.g., a normal distribution) 2. Make parameters a function of $\rvx$ (e.g., mean is linear function of $\rvx$) 3. Function takes parameter $\theta$ that defines its shape

Learning: Learn parameters $\theta$ from training data $\rvx, \rvw$

Inference: Just evaluate $Pr(\rvw|\rvx)$

Example: 1. Choose a normal distribution for $P(w)$

Make mean a linear function of $x$, variance $\sigma^2$ a constant

\[ Pr(w|x, \rvtheta) = \mathrm{Norm}_{w}[\phi_0 + \phi_1x, \sigma^2] \]
Learning the parameters using MAP

\[ \begin{aligned} \hat{\rvtheta} &=\arg\max_\rvtheta Pr(\rvtheta|w_{1\dots I}, x_{1\dots I})\\ &=\arg\max_\rvtheta Pr(w_{1\dots I}, x_{1\dots I}, \rvtheta) Pr(\rvtheta) \end{aligned} \]

Model $Pr(\rvx, \rvw)$ - Generative

Here are steps: 1. Concatenate $\rvx$ and $\rvw$ to make $\rvz = \begin{bmatrix}\rvx^\top & \rvw^\top \end{bmatrix}^\top$ 2. Model the pdf of $\rvz$ 3. Pdf takes parameter $\theta$ that defines its shape

Learning: Learn parameters $\theta$ from training data $\rvx, \rvw$

Inference: Compute $Pr(\rvw|\rvx)$ using Bayes rule

\[ Pr(\rvw|\rvx) = \frac{Pr(\rvx, \rvw)}{Pr(\rvx)} = \frac{Pr(\rvx, \rvw)}{\int Pr(\rvx, \rvw)d\rvw} \]

Model $Pr(\rvx|\rvw)$ - Generative

Here are steps: 1. Choose an appropriate form for $Pr(\rvx)$ 2. Make parameters a function of $\rvw$ 3. Function takes parameter $\theta$ that defines its shape

Learning: Learn parameters $\theta$ from training data $\rvx, \rvw$

Inference: Define prior $Pr(\rvw)$ and then compute $Pr(\rvw|\rvx)$ using Bayes rule:

\[ Pr(\rvw|\rvx) = \frac{Pr(\rvx|\rvw)Pr(\rvw)}{\int Pr(\rvx|\rvw)Pr(\rvw)d\rvw} \]

Summary (Which to use)

Generative methods model data - costly and many aspects of data may have no influence on world state.
Inference is simple for discrminative models
Data is generated from world - generative models match this
If missing data, then generative model prefered
Generative model allows imposition of prior knowledge from users.

Modeling Complex Densities

Models with hidden variables

Key idea: Represent density $Pr(x)$ as marginalization of joint density with another variable $h$ that we do not see:

\[ Pr(\rvx|\theta) = \int Pr(\rvx, \rvh|\theta)d\rvh \]

Mixture of Gaussians (MoG)

Represent the density with mixture of Gaussians:

\[ \begin{aligned} Pr(\rvx|\theta) &= \sum_{k=1}^K Pr(\rvx, h=k|\theta)\\ &= \sum_{k=1}^K Pr(\rvx | h=k, \theta)Pr(h=k|\theta)\\ &=\sum_{k=1}^K \lambda_k \mathrm{Norm}_\rvx [\vect{\mu}_k, \vect{\Sigma}_k] \end{aligned} \]

where

\[ \begin{aligned} Pr(\rvx|h, \rvtheta) &= \mathrm{Norm}_{\rvx}[\vect{\mu}_k, \vect{\Sigma}_k]\\ Pr(h| \rvtheta) &=\mathrm{Cat}_h[\vect{\lambda}] \end{aligned} \]

Here we can generate data from MoG by firstly sampling $Pr(h)$, then sample $Pr(\rvx|h)$. The hidden variable $h$ has a clear interpretation of which Gaussian created data point $\rvx$.

Expectation maximization (EM)

E-step: Maximize bound w.r.t distributions $q(\rvh_i)$, i.e., posterior over hidden variables:

\[ \hat{q}(\rvh_i)=Pr(\rvh_i|\rvx_i,\rvtheta^{[t]})=\frac{Pr(\rvx_i|\rvh_i,\rvtheta^{[t]})Pr(\rvh_i|\rvtheta^{[t]})}{Pr(\rvx_i)} \]
M-step: Maximize bound w.r.t paramters $\theta$

\[ \hat{\theta}^{[t+1]} = \arg\max_{\theta}[\sum_{i=1}^I \sum_{k=1}^K \hat{q}(\rvh_i=k)\log[Pr(\rvx_i, \rvh_i=k|\rvtheta)]] \]

E-step

\[ \begin{aligned} Pr(\rvh_i=k|\rvx_i,\rvtheta^{[t]}) &= \frac{Pr(\rvx | h=k, \theta)Pr(h=k|\theta)}{\sum_{j=1}^K Pr(\rvx | h=j, \theta)Pr(h=j|\theta)}\\ &= \frac{\lambda_k \mathrm{Norm}_{\rvx_i} [\vect{\mu}_k, \vect{\Sigma}_k]}{\sum_{j=1}^K \lambda_k \mathrm{Norm}_{\rvx_i} [\vect{\mu}_j, \vect{\Sigma}_j]}\\ &=r_{ik} \end{aligned} \]

We call this the responsibility of the k-th Guassian for i-th data point. Repeat this procedure for every datapoint!

M-step

\[ \begin{aligned} \hat{\theta}^{[t+1]} &= \arg\max_{\theta}[\sum_{i=1}^I \sum_{k=1}^K \hat{q}(\rvh_i=k)\log[Pr(\rvx_i, \rvh_i=k|\rvtheta)]]\\ &= \arg\max_{\theta}[\sum_{i=1}^I \sum_{k=1}^K r_{ik}\log[\lambda_k \mathrm{Norm}_{\rvx_i} [\vect{\mu}_k, \vect{\Sigma}_k]]] \end{aligned} \]

Here we take derivate and solve (Lagrange multipliers for $\lambda$):

\[ \begin{aligned} \lambda_k^{[t+1]} &= \frac{\sum_{i=1}^I r_{ik}}{\sum_{j=1}^K\sum_{i=1}^I r_{ij}}\\ \vect{\mu}_k^{[t+1]} &= \frac{\sum_{i=1}^I r_{ik}\rvx_i}{\sum_{i=1}^I r_{ik}}\\ \vect{\Sigma}_k^{[t+1]} &= \frac{\sum_{i=1}^I r_{ik}(\rvx_i-\vect{\mu}_k^{[t+1]})(\rvx_i-\vect{\mu}_k^{[t+1]})^\top}{\sum_{i=1}^I r_{ik}} \end{aligned} \]

EM in details

E-step

We want to define a lower bound on log-likelihood $\sum_{i=1}^I\log[\int Pr(\rvx_i, \rvh_i |\theta)d\rvh_i]$ and increases bound iteratively:

\[ \begin{aligned} \sum_{i=1}^I\log[\int Pr(\rvx_i, \rvh_i |\theta)d\rvh_i] &= \sum_{i=1}^I\log[\int q_i(\rvh_i)\frac{Pr(\rvx_i, \rvh_i |\theta)}{q_i(\rvh_i)}d\rvh_i]\\ &\ge \sum_{i=1}^I\int q_i(\rvh_i)\log[\frac{Pr(\rvx_i, \rvh_i |\theta)}{q_i(\rvh_i)}]d\rvh_i \quad \text{Jensen's inequality $E[\log[y]]\le \log E[y]$}\\ &= \gB[{(q_i(\rvh_i))}, \rvtheta] \end{aligned} \]

Now we can maxmize the log-likelihood by maximizing its lower bound $\gB[{(q_i(\rvh_i))}, \rvtheta]$ w.r.t distributions $q(\rvh_i)$:

\[ \begin{aligned} \gB[{(q_i(\rvh_i))}, \rvtheta] &= \sum_{i=1}^I\int q_i(\rvh_i)\log[\frac{Pr(\rvx_i, \rvh_i |\theta)}{q_i(\rvh_i)}]d\rvh_i\\ &= \sum_{i=1}^I\int q_i(\rvh_i)\log[\frac{Pr(\rvh_i | \rvx_i, \theta) Pr(\rvx_i |\theta)}{q_i(\rvh_i)}]d\rvh_i\\ &= \sum_{i=1}^I\int q_i(\rvh_i)\log[ Pr(\rvx_i |\theta)]d\rvh_i - \sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{q_i(\rvh_i)}{Pr(\rvh_i | \rvx_i, \theta)}]d\rvh_i\\ &= \sum_{i=1}^I \underbrace{\log[ Pr(\rvx_i |\theta)]}_{\text{constant w.r.t $q(h)$}} - \sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{q_i(\rvh_i)}{Pr(\rvh_i | \rvx_i, \theta)}]d\rvh_i \end{aligned} \]

Since the first term is a constant w.r.t $q(h)$, to maxmize the bound, we only need to maximize the second term:

\[ \begin{aligned} \hat{q}_i(\rvh_i) &= \arg\max_{q_i(\rvh_i)} \gB[{(q_i(\rvh_i))}, \rvtheta]\\ &= \arg\max_{q_i(\rvh_i)}-\sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{q_i(\rvh_i)}{Pr(\rvh_i | \rvx_i, \theta)}]d\rvh_i\\ &= \arg\min_{q_i(\rvh_i)}-\sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{Pr(\rvh_i | \rvx_i, \theta)}{q_i(\rvh_i)}]d\rvh_i\\ \end{aligned} \]

This is called Kullback Leibler divergence - distance between probability distributions. We are maximizing the negative distance (i.e. minimizing distance).

By using the relation $\log[y]\le y-1$, we get:

\[ \begin{aligned} \sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{Pr(\rvh_i | \rvx_i, \theta)}{q_i(\rvh_i)}] &\le \sum_{i=1}^I\int q_i(\rvh_i)(\frac{Pr(\rvh_i | \rvx_i, \theta)}{q_i(\rvh_i)}-1) d\rvh_i \\ &= \sum_{i=1}^I\int Pr(\rvh_i | \rvx_i, \theta) - q_i(\rvh_i) d\rvh_i \\ &= \sum_{i=1}^I\int Pr(\rvh_i | \rvx_i, \theta)d\rvh_i -\sum_{i=1}^I\int q_i(\rvh_i)d\rvh_i\\ &= 1-1 = 0 \end{aligned} \]

Thus, the cost function must be positive, so the best $\hat{q}_i(\rvh_i)$ we can have is the one that makes the cost function equals to 0. We can choose $\hat{q}_i(\rvh_i) = Pr(\rvh_i|\rvx_i, \rvtheta_i)$:

\[ \begin{aligned} \sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{Pr(\rvh_i | \rvx_i, \theta)}{q_i(\rvh_i)}]d\rvh_i &= \sum_{i=1}^I\int q_i(\rvh_i)\log[ \frac{Pr(\rvh_i | \rvx_i, \theta)}{Pr(\rvh_i | \rvx_i, \theta)}]d\rvh_i \\ &= \sum_{i=1}^I\int q_i(\rvh_i)\log[1]d\rvh_i = 0 \end{aligned} \]

M step

In M step, we optimize bound w.r.t $\theta$:

\[ \begin{aligned} \rvtheta^{[t]} &= \arg\max_{\rvtheta} \gB[{(q_i^{[t]}(\rvh_i))}, \rvtheta]\\ &=\arg\max_{\rvtheta} \sum_{i=1}^I\int q_i^{[t]}(\rvh_i)\log[\frac{Pr(\rvx_i, \rvh_i |\theta)}{q_i^{[t]}(\rvh_i)}]d\rvh_i\\ &=\arg\max_{\rvtheta} \sum_{i=1}^I\int q_i^{[t]}(\rvh_i)\log[Pr(\rvx_i, \rvh_i |\theta)] - \underbrace{q_i^{[t]}(\rvh_i)\log[q_i^{[t]}(\rvh_i)] d\rvh_i}_{\text{constant w.r.t $\rvtheta$}}\\ &=\arg\max_{\rvtheta} \sum_{i=1}^I\int q_i^{[t]}(\rvh_i)\log[Pr(\rvx_i, \rvh_i |\theta)d\rvh_i] \end{aligned} \]

Regression models

Linear regression

We can model the world $\rvw$ with a normal distribution:

\[ Pr(\rvw|\rmX, \rvtheta) = \mathrm{Norm}_{\rvw}[\rmX^\top\rvphi, \sigma^2\rmI] \]

Learning - Maximum likelihood

\[ \hat{\rvtheta}=\arg\max_{\rvtheta}[Pr(\rvw|\rmX, \rvtheta)] = \arg\max_{\rvtheta}[\log Pr(\rvw|\rmX, \rvtheta)] \]

Taking derivative, set result to zero and re-arrange:

\[ \begin{aligned} \hat{\rvphi} &= (\rmX \rmX^\top)^{-1}\rmX\rvw\\ \hat{\sigma}^2 &= \frac{(\rvw-\rmX^\top\hat{\rvphi})^\top(\rvw-\rmX^\top\hat{\rvphi})}{I} \end{aligned} \]

Bayesian linear regression

Likelihood:

\[ Pr(\rvw|\rmX, \rvtheta) = \mathrm{Norm}_{\rvw}[\rmX^\top\rvphi, \sigma^2\rmI] \]

Prior:

\[ Pr(\rvphi) = \mathrm{Norm}_{\rvphi}[\vect{0}, \sigma^2_p\rmI] \]

Bayes rule:

\[ Pr(\rvphi|\rmX, \rvw) = \frac{Pr(\rvw|\rmX, \rvphi)Pr(\rvphi)}{Pr(\rvw|\rmX)} \]

Learning

Posterior:

\[ Pr(\rvphi|\rmX, \rvw) = \mathrm{Norm}_{\rvphi}[\frac{1}{\sigma^2}\rmA^{-1}\rmX\rvw, \rmA^{-1}] \quad \text{where $\rmA = \frac{1}{\sigma^2}\rmX\rmX^\top + \frac{1}{\sigma^2_p}\rmI$} \]

Fit variance - Maximum likelihood:

\[ \begin{aligned} Pr(\rvw|\rmX, \sigma^2) &= \int Pr(\rvw|\rmX, \rvphi)Pr(\rvphi)d\rvphi\\ &= \int \mathrm{Norm}_{\rvw}[\rmX^\top\rvphi, \sigma^2\rmI]\mathrm{Norm}_{\rvphi}[\vect{0}, \sigma^2_p\rmI]d\rvphi\\ &= \mathrm{Norm}_{\rvw}[\vect{0}, \sigma^2_p\rmX\rmX^\top + \sigma\rmI] \end{aligned} \]

Inference

\[ \begin{aligned} Pr(w^\star|\rvx^\star, \rmX, \rvw) &= \int Pr(w^\star|\rvx^\star, \rvphi)Pr(\rvphi|\rmX, \rvw)d\rvphi\\ &= \int\mathrm{Norm}_{w^\star}[\phi^\top\rvx^\star, \sigma^2] \mathrm{Norm}_{\rvphi}[\frac{1}{\sigma^2}\rmA^{-1}\rmX\rvw, \rmA^{-1}] d\rvphi\\ &= \mathrm{Norm}_{w^\star}[\frac{1}{\sigma^2}\rvx^{\star\top}\rmA^{-1}\rmX\rvw, \sigma^2 + \rvx^\star\rmA^{-1}\rvx^\star + \sigma^2] \end{aligned} \]

Machine Learning for Machine Vision

Learning and Inference

Computer vision models

Generative v.s. Discrminative

3 types of models

Model \(Pr(\rvw|\rvx)\) - Discriminative

Model \(Pr(\rvx, \rvw)\) - Generative

Model \(Pr(\rvx|\rvw)\) - Generative

Summary (Which to use)

Modeling Complex Densities

Models with hidden variables

Mixture of Gaussians (MoG)

Expectation maximization (EM)

E-step

M-step

EM in details

E-step

M step

Regression models

Linear regression

Learning - Maximum likelihood

Bayesian linear regression

Learning

Inference

Non-linear regression