# The Expectation-Maximization (EM) Algorithm
## Introduction ¶
T he Expectation-Maximization (EM, 期望最大化) algorithm is a classic, general algorithm for computing maximum-likelihood (or maximum a posteriori) estimates iteratively when the observed data are incomplete.
Take an example. Let’s say we have 200 samples of the heights of 100 boys and 100 girls, assuming that the height of each gender follows Gaussian distribution, then we could easily perform MLE on each distribution respectively, the parameters can be easily estimated.
$$ x_i \sim \mathcal{N}(x_i \mid \mu_i, \sigma_i^2) $$
where $i = \lbrace 1, 2 \rbrace$, $1$ for male and $2$ for female, $x_i$ is the height.
However, for some reason, the gender of each sample is unknown, despect the fact that genders of people have impact on their heights. We may come up with a model to be a weighted avergae of 2 Gaussians
$$ x \sim \pi_1 \mathcal{N}(x \mid \mu_1, \sigma_1^2) + \pi_2 \mathcal{N}(x \mid \mu_2, \sigma_2^2) $$
where $\pi_1 + \pi_2 = 1$. It looks nice but could be hard to find the maximum-likelihood estimates of the parameters explicitly.
## The EM Algorithm ¶
Given a joint distribution $p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta})$ over the observed variable $\boldsymbol{x}$ and the latent variable $\boldsymbol{z}$, we wish to fit a model $p(\boldsymbol{x} \mid \boldsymbol{\theta})$ with parameters $\boldsymbol{\theta}$. $\boldsymbol{z}$ is called the latent (hidden) variable, since it’s missing or cannot be observed, but has impact on the distribution. $\boldsymbol{x}$ is the incomplete data, $\lbrace \boldsymbol{x}, \boldsymbol{z} \rbrace$ gives the complete data.
We only have the joint pdf $p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta})$, but what we wish to fit is the marginal pdf $p(\boldsymbol{x} \mid \boldsymbol{\theta})$.
The marginal likelihood is given by $$ \begin{aligned} \mathcal{L}(\theta) &= p(\boldsymbol{x} \mid \boldsymbol{\theta}) \\ &= \int_{\boldsymbol{z}} p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta}) d\boldsymbol{z} \\ &= \int_{\boldsymbol{z}} p(\boldsymbol{x} \mid \boldsymbol{z}, \boldsymbol{\theta}) p(\boldsymbol{z}) d\boldsymbol{z} \end{aligned} $$
Instead of optimizing $\log{p(\boldsymbol{x} \mid \boldsymbol{\theta})}$, the EM algorithm works on optimizing $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ at each step $t$. The algorithm is given as follows.
-
Initialize parameters $\boldsymbol{\theta}$ randomly.
-
Repeat E-step and M-step until convergence.
- Expectation step (E-step): Evaluate expectation of the joint log-likelihood of $\boldsymbol{\theta}$ w.r.t. the posteriori distribution of $\boldsymbol{z}$, given $\boldsymbol{x}$ and the current parameter estimates $\boldsymbol{\theta}^{(t)}$.
$$ \begin{align} Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) &= \mathbb{E}_{\boldsymbol{z} \sim p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)})} \lbrack \log{p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta})} \rbrack \\ &= \int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)}) \log{p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta})} d\boldsymbol{z} \end{align} $$
- Maximization step (M-step): Update parameters.
$$ \boldsymbol{\theta}^{(t+1)} = \underset{\boldsymbol{\theta}}{\text{argmax}}\ Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) $$
Write the EM algorithm in one line:
$$ \boxed{\boldsymbol{\theta}^{(t+1)} = \underset{\boldsymbol{\theta}}{\text{argmax}}\ \left\lbrace \int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)}) \log{p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta})} d\boldsymbol{z} \right\rbrace} $$
## Proof of Convergence ¶
Next we show that the EM algorithm guarantees convergence by proving the inequality holds for $t = 1, 2, \dots$
$$ \log{p(\boldsymbol{x} \mid \boldsymbol{\theta}^{(t+1)})} \geq \log{p(\boldsymbol{x} \mid \boldsymbol{\theta}^{(t)})} $$
Proof. Since
$$ p(\boldsymbol{x} \mid \boldsymbol{\theta}) = \frac{p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta})}{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta})} $$
we have
$$ \log{p(\boldsymbol{x} \mid \boldsymbol{\theta})} = \log{p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta})} - \log{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta})} $$
Take the expectation over the posteriori distribution of the latent variable $p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)})$ after finishing step $t$. The lhs is a expectation of a constant so it does not change, the rhs can be defined as two parts $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ and $H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$
$$ \begin{align} &\mathbb{E}_{\boldsymbol{z} \sim p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)})} \lbrack \log{p(\boldsymbol{x} \mid \boldsymbol{\theta})} \rbrack = \mathbb{E}_{\boldsymbol{z} \sim p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)})} \lbrack \log{p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta})} - \log{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta})} \rbrack \\ \Rightarrow& \log{p(\boldsymbol{x} \mid \boldsymbol{\theta})} = \underbrace{\int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)}) \log{p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\theta})} d\boldsymbol{z}}_{Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})} - \underbrace{\int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)}) \log{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta})} d\boldsymbol{z}}_{H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})} \\ \Rightarrow& \log{p(\boldsymbol{x} \mid \boldsymbol{\theta})} = Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \end{align} $$
Remind that in the algorithm, we only optimize $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$, and the term $H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ is not considered.
In the algorithm, after finishing step $t$, since $\boldsymbol{\theta}^{(t+1)}$ is found to maximize $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$, we have
$$ Q(\boldsymbol{\theta}^{(t+1)} \mid \boldsymbol{\theta}^{(t)}) \geq Q(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) $$
Then, $$ \begin{align} \text{To prove}& \quad \log{p(\boldsymbol{x} \mid \boldsymbol{\theta}^{(t+1)})} \geq \log{p(\boldsymbol{x} \mid \boldsymbol{\theta}^{(t)})} \\ \Leftrightarrow \text{To prove}& \quad Q(\boldsymbol{\theta}^{(t+1)} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t+1)} \mid \boldsymbol{\theta}^{(t)}) \geq Q(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) \\ \Leftrightarrow \text{To prove}& \quad H(\boldsymbol{\theta}^{(t+1)} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) \leq Q(\boldsymbol{\theta}^{(t+1)} \mid \boldsymbol{\theta}^{(t)}) - Q(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) \\ \Leftarrow \text{To prove}& \quad H(\boldsymbol{\theta}^{(t+1)} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) \leq 0 \\ \Leftarrow \text{To prove}& \quad H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) \leq 0 \quad \forall \boldsymbol{\theta} \end{align} $$
$$ \begin{align} &H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) \\ =&\ \int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)}) \log{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta})} d\boldsymbol{z} - \int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)}) \log{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)})} d\boldsymbol{z} \\ =&\ \int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)}) \log{\left\lparen \frac{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta})}{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)})} \right\rparen} d\boldsymbol{z} \end{align} $$
Since $\log{x}$ is concave, by Jensen’s inequality: If $f$ is a convex function, and $\boldsymbol{x}$ is a random variable, then $$ f(\mathbb{E}\lbrack \boldsymbol{x} \rbrack) \leq \mathbb{E}\lbrack f(\boldsymbol{x}) \rbrack $$ Similarly, the inequality flips if $f$ is concave. , we have
$$ \begin{align} &\int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)}) \log{\left\lparen \frac{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta})}{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)})} \right\rparen} d\boldsymbol{z} \\ \leq&\ \log{\left\lparen \int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)}) \frac{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta})}{p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}^{(t)})} d\boldsymbol{z} \right\rparen} \\ =&\ \log{\left\lparen \int_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\theta}) d\boldsymbol{z} \right\rparen} \\ =&\ \log{1} \\ =&\ 0 \end{align} $$
Thus, we for all $\theta$, $$ H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) \leq 0 $$
Therefore, $$ \log{p(\boldsymbol{x} \mid \boldsymbol{\theta}^{(t+1)})} \geq \log{p(\boldsymbol{x} \mid \boldsymbol{\theta}^{(t)})} $$
## References ¶
- Wikipedia
- Books
- Mixture Models and EM, Chapter 9, Pattern Recognition and Machine Learning.