Probability

Common Probability Distributions

Bernoulli & Beta distribution

Bernoulli distribution:

\[ Pr(x) = Bern_x(\lambda) = \lambda^x(1-\lambda)^{1-x}, \text{where}\ x = \{0, 1\} \]

Beta distribution:

\[ Pr(\lambda) = Beta_{\lambda}[\alpha, \beta]=\frac{\Gamma[\alpha+\beta]}{\Gamma[\alpha]+\Gamma[\beta]} \lambda^{\alpha-1} (1-\lambda)^{\beta-1} $$ Here $$ \Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt \\ \Gamma(z) = (z-1)! \]

Notes on beta distribution:

$\alpha, \beta$ are both $>$ 0
Mean depends on relative value $E(\lambda) = \frac{\alpha}{\alpha+\beta}$
Concentration depends on magnitude

Fitting probability models

Maximum likelihood (ML)

Fitting: Find the parameters under which the data $x_{1\dots I}$ are mostly likely:

\[ \begin{aligned} \hat{\theta} &= \arg\max_\theta [Pr(x_{1\dots I}|\theta)]\\ &= \arg\max_\theta [\prod_{i=1}^I Pr(x_i|\theta)] \quad \text{Assumption: independent data} \end{aligned} \]

Predictive density: Evaluate using the best parameters $\hat{\theta}$ under probability distribution $Pr(x^\star | \hat{\theta})$

Maximum a posteriori (MAP)

Fitting: Find the parameters which maximize the posterior probability: $Pr(\theta|x_{1\dots I})$.

\[ \begin{aligned} \hat{\theta} &= \arg\max_\theta [Pr(\theta|x_{1\dots I})]\\ &= \arg\max_\theta [\frac{Pr(x_{1\dots I}|\theta) Pr(\theta)}{Pr(x_{1\dots I})}]\\ &= \arg\max_\theta [\frac{\prod_{i=1}^IPr(x_{i}|\theta) Pr(\theta)}{Pr(x_{1\dots I})}] \quad \text{Assumption: independent data}\\ \end{aligned} \]

Predictive density: Evaluate using the best parameters $\hat{\theta}$ with MAP parameters $Pr(x^\star | \hat{\theta})$.

NOTE (Also a common mistake): We need to distinguish the fitting and predictive density here. Fitting is used to find a best model where ML and MAP are different kinds of methods to let us fit the model. Once we get the fitted model $\hat{\theta}$ based on your training data $x_{1\dots I}$, you can evaluate the probability of a new data point $x^\star$ using the fitted model $\hat{\theta}$. Thus, the equation of predictive density of ML and MAP are exactly the same.

Bayesian approach

Fitting: Compute posterior distribution over possible parameters.

\[ Pr(\theta|x_{1\dots I}) = \frac{\prod_{i=1}^IPr(x_{i}|\theta) Pr(\theta)}{Pr(x_{1\dots I})} \]

Predictive density:

\[ Pr(x^\star|x_{1\dots I}) = \int Pr(x^\star|\theta) Pr(\theta|x_{1\dots I})d\theta \]

The posterior here can been seen as weights of each possible parameters. To evaluate the new datapoint, we take the weighted average of the predictions of all possible parameters here.

Comparison of ML v.s. MAP v.s. Bayesian

For predictive density, ML & MAP can be seen as the specialized version of Bayesian where we get a delta function as the weight, i.e. zero probability everywhere except at estimate:

\[ \begin{aligned} Pr(x^\star|x_{1\dots I}) &= \int Pr(x^\star|\theta) \delta[\theta - \hat{\theta}]d\theta &= Pr(x^\star|\hat{\theta}) \end{aligned} \]

NOTE (An intuition): Let us say we want to select some classimates to solve a problem, ML does it by giving an exam to all classimates and choose the one with the highest score. MAP means, for instance, peter got the first place, but we know david is consistently better in the previous exams, even though he is the second place in this exam, we still pick him. Finally, Bayesian does it like this: instead of picking up a single student, we give them weight based on their score and our prior. This weight is the posterior we get. Each time we will get all answers from all students and use the weight average as the solution.

The normal distribution

Univariate normal distribution

\[ Pr(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp [-0.5(x-\mu)^2/\sigma^2] \]