## Finite Mixture Models 

In general, a finite mixture model (有限混合模型) is defined as

$$ \begin{align} p(\boldsymbol{x} \mid \boldsymbol{\theta}) &= \int_{\boldsymbol{z}} p(\boldsymbol{z}) p(\boldsymbol{x} \mid \boldsymbol{z}, \boldsymbol{\theta}) d\boldsymbol{z} \\ &= \sum_{k=1}^{K} \pi_k \ p(\boldsymbol{x} \mid z = k , \boldsymbol{\theta}_k) \end{align} $$

where

  • $\boldsymbol{z}$ is the latent variable following the categorical distribution $\text{Categorical}(\boldsymbol{\pi}, K)$.

    • $K$ is the number of mixture components.

    • $\boldsymbol{z} = (z_1, z_2, \dots, z_K)$ uses the one-hot representation, $z_k \in \lbrace 0, 1 \rbrace$ and $\sum_{k=1}^{K} z_k = 1$.

    • $\boldsymbol{\pi} = (\pi_1, \pi_2, \dots, \pi_K)$ is the probability vector, $\pi_k$ are mixture weights (混合权重), $0 \leq \pi_k \leq 1$ and $\sum_{k=1}^{K}\pi_k = 1$.

      The marginal distribution over $\boldsymbol{z}$ is defined as

      $$ p(\boldsymbol{z}) = \prod_{k=1}^{K} \pi_{k}^{\boldsymbol{z}_k} $$

      That is, $p(\boldsymbol{z}) = \pi_k$ if and only if its $k$-th element $\boldsymbol{z}_k = 1$ (and all other elements are zero).

      Sometimes we could simply let $\boldsymbol{z}$ be a scalar, $z = k$ if $z_k = 1$, which is equivalent to above.

      $$ p(z = k) = \pi_k $$

  • $p(\boldsymbol{x} \mid z = k , \boldsymbol{\theta}_k)$ are mixture components (混合分量). For each $k$ it’s a pdf over $\boldsymbol{x}$ with parameters $\boldsymbol{\theta}_k$.

The mixture model (the PDF) $p(\boldsymbol{x} \mid \boldsymbol{\theta})$ is a convex combination of the PDFs of the mixture components.

$p(z)$ is always a categorical distribution, $p(\boldsymbol{x} \mid z)$ can take a variety of distributions, it depends on your choice. The complete set of parameters of a mixture model is

$$ \boldsymbol{\theta} = \lbrace \pi_1, \dots, \pi_K, \boldsymbol{\theta}_1, \dots, \boldsymbol{\theta}_K \rbrace $$

where $\pi_k$ are mixture weights, and $\boldsymbol{\theta}_k$ are parameters of mixture components.

Mixture models are latent variable models, it assumes that, the observed data are divided into different clusters ── first we sample the category $z$, which cannot be observed, second we sample the observable $\boldsymbol{x}$ from the distribution belonging to $z$.

When the data distribution is likely to be divided into different groups, or the distribution is complex (multimodal), you could try mixture models.

Fitting a mixture model is a process of unsupervised clustering, the trained model can be used for density estimation.

## Gaussian Mixtures 

Gaussian Mixture Models (GMM, 高斯混合模型) is a typical and widely-used mixture models.

GMM is defined as follows, the mixture components are Gaussian distributions

$$ p(\boldsymbol{x} \mid \boldsymbol{\theta}) = p(\boldsymbol{x} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{k=1}^{K} \pi_k \ \mathcal{N}(\boldsymbol{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) $$

where

  • $\boldsymbol{\pi} = \lparen \pi_1, \dots, \pi_K \rparen$ is the mixture weights.
  • $\boldsymbol{\mu} = \lbrace \boldsymbol{\mu}_1, \dots, \boldsymbol{\mu}_K \rbrace$ is the mean vectors.
  • $\boldsymbol{\Sigma} = \lbrace \boldsymbol{\Sigma}_1, \dots, \boldsymbol{\Sigma}_K \rbrace$ is the covariance matrices.
  • $\boldsymbol{\theta} = \lbrace \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma} \rbrace$ is the set of parameters.

Introduce the latent variable $\boldsymbol{z}$ using the 1-of-K representation, which follows the categorical distribution

$$ p(\boldsymbol{z}) = \prod_{k=1}^{K} \pi_{k}^{\boldsymbol{z}_k} $$

we could write the conditional distribution of $\boldsymbol{x}$ given $\boldsymbol{z}$ as

$$ p(\boldsymbol{x} \mid \boldsymbol{z}, \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \prod_{k=1}^{K} \mathcal{N}(\boldsymbol{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)^{z_k} $$

and the joint distribution of $\boldsymbol{x}$ and $\boldsymbol{z}$ is

$$ \begin{align} p(\boldsymbol{x}, \boldsymbol{z} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) &= p(\boldsymbol{z}) p(\boldsymbol{x} \mid \boldsymbol{z}, \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) \\ &= \prod_{k=1}^{K} \left\lparen \pi_{k} \mathcal{N}(\boldsymbol{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\rparen^{z_k} \end{align} $$

It’s important that introducing $\boldsymbol{z}$ does not change the marginal distribution of $\boldsymbol{x}$

$$ p(\boldsymbol{x} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{k=1}^{K} p(\boldsymbol{z}) p(\boldsymbol{x} \mid \boldsymbol{z}, \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) $$

An important quantity is the contional probability of $\boldsymbol{x}$ given $\boldsymbol{z}$, indicating the expected underlying category of $\boldsymbol{x}$, it’s called the responsibility (响应度) that component $k$ takes for the observation $\boldsymbol{x}_n$, denoted by $\gamma_{nk}$.

$$ \begin{align} \gamma_{nk} &\triangleq p(z_k = 1 \mid \boldsymbol{x} = \boldsymbol{x}_n) \\ &= \frac{p(z_k = 1) p(\boldsymbol{x} = \boldsymbol{x}_n \mid z_k = 1)}{\sum_{i=1}^{K} p(z_i = 1) p(\boldsymbol{x} = \boldsymbol{x}_n \mid z_i = 1)} \end{align} $$

## Learning Algorithms for Gaussian Mixture Models 

Maximum Likelihood Estimation (MLE) Bayes Estimation
Probabilistic model $$\begin{align}&\ p(\boldsymbol{X}, \boldsymbol{Z}\ ; \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})\\ =&\ p(\boldsymbol{X} \mid \boldsymbol{Z}\ ; \boldsymbol{\mu}, \boldsymbol{\Sigma})\ p(\boldsymbol{Z} \mid \boldsymbol{\pi})\end{align}$$ $$\begin{align}&\ p(\boldsymbol{X}, \boldsymbol{Z}, \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})\\ =&\ p(\boldsymbol{X} \mid \boldsymbol{Z}, \boldsymbol{\mu}, \boldsymbol{\Sigma})\ p(\boldsymbol{Z}, \boldsymbol{\pi})\ p(\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) \end{align}$$
Latent variable $$\boldsymbol{Z}$$ Posterior calculation $$p(\boldsymbol{Z} \mid \boldsymbol{X}\ ; \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})$$ Posterior calculation $$p(\boldsymbol{Z}, \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma} \mid \boldsymbol{X})$$
Parmeters $$\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}$$ Point estimation $$\boldsymbol{\pi}^{\star}, \boldsymbol{\mu}^{\star}, \boldsymbol{\Sigma}^{\star} = \text{argmax}\ p(\boldsymbol{X}\ ; \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})$$ Posterior calculation $$p(\boldsymbol{Z}, \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma} \mid \boldsymbol{X})$$

### Maximum Likelihood Estimation 

#### Expectation Maximization 

EM for Gaussian mixtures:

  1. Randomly initialize parameters $\boldsymbol{\theta}_k = \lbrace \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k, \pi_k \rbrace$ of each Gaussian.

  2. Perform E-step and M-step

    • E-step: Evaluate the responsibility of each observation using the current parameters.

      Compute the current responsibilities

      $$ \gamma_{nk}^{(t)} = \frac{\pi_k^{(t)} \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})}{\sum_{i=1}^{K} \pi_i^{(t)} \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_i^{(t)}, \boldsymbol{\Sigma}_i^{(t)})} $$

      for $n = 1, 2, \dots, N$ and $k = 1, 2, \dots, K$.

    • M-step: Update parameters of each Gaussian by the current responsibilities.

      Update parameters of the $k$-th Gaussian by $$ \boldsymbol{\theta}_k^{(t+1)} \begin{cases} \boldsymbol{\mu}_k^{(t+1)} = \displaystyle\frac{1}{N_k}\sum_{n=1}^{N}\gamma_{nk}^{(t)} \boldsymbol{x}_n \\ \boldsymbol{\Sigma}_k^{(t+1)} = \displaystyle\frac{1}{N_k}\sum_{n=1}^{N}\gamma_{nk}^{(t)} (\boldsymbol{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\boldsymbol{x}_n - \boldsymbol{\mu}_k^{(t+1)})^T \\ \pi_k^{(t+1)} = \displaystyle\frac{N_k}{N} \end{cases} $$

      where

      $$ N_k = \sum_{n=1}^{N}\gamma_{nk}^{(t)} $$

      Derivation of the E-step and M-step for GMMs

      Given a set of observed data $\mathcal{X} = \lbrace \boldsymbol{x}_1, \dots, \boldsymbol{x}_N \rbrace$ and a set of latent data $\mathcal{Z} = \lbrace \boldsymbol{z}_1, \dots, \boldsymbol{z}_N \rbrace$, and the set pameters of GMM $\boldsymbol{\theta} = \lbrace \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma} \rbrace = \lbrace \pi_1, \dots, \pi_K, \boldsymbol{\mu}_1, \dots, \boldsymbol{\mu}_K, \boldsymbol{\Sigma}_1, \dots, \boldsymbol{\Sigma}_K \rbrace$.

      First, we evaluate log-likelihood of the complete data

      $$ \begin{align} \log{p(\mathcal{X}, \mathcal{Z} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})} &= \log{\left\lparen \prod_{n=1}^{N} p(\boldsymbol{x}_n, \boldsymbol{z}_n \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) \right\rparen} \\ &= \log{\left\lparen \prod_{n=1}^{N} \prod_{k=1}^{K} (\pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k))^{z_{nk}} \right\rparen} \\ &= \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \log{\left\lparen \pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\rparen} \end{align} $$

      Then, the Q-function of GMM in the EM algorithm is given by

      $$ \begin{align} Q(\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma} \mid \boldsymbol{\pi}^{(t)}, \boldsymbol{\mu}^{(t)}, \boldsymbol{\Sigma}^{(t)}) &= \mathbb{E}_{\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\pi}^{(t)}, \boldsymbol{\mu}^{(t)}, \boldsymbol{\Sigma}^{(t)}} \left\lbrack \log{p(\mathcal{X}, \mathcal{Z} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})} \right\rbrack \\ &= \mathbb{E}_{\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\pi}^{(t)}, \boldsymbol{\mu}^{(t)}, \boldsymbol{\Sigma}^{(t)}} \left\lbrack \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \log{\left\lparen \pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\rparen} \right\rbrack \\ &= \sum_{n=1}^{N} \sum_{k=1}^{K} \mathbb{E}_{\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{\pi}^{(t)}, \boldsymbol{\mu}^{(t)}, \boldsymbol{\Sigma}^{(t)}} \left\lbrack z_{nk} \right\rbrack \log{\left\lparen \pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\rparen} \\ &= \color{blue}{\sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \log{\left\lparen \pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\rparen}} \end{align} $$

      where

      $$ \gamma_{nk}^{(t)} = \frac{\pi_k^{(t)} \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})}{\sum_{i=1}^{K} \pi_i^{(t)} \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_i^{(t)}, \boldsymbol{\Sigma}_i^{(t)})} $$

      Take the derivative of the Q-function w.r.t. the mean of the k-th Gaussian

      $$ \begin{align} \frac{\partial Q}{\partial \boldsymbol{\mu}_k} &= \frac{\partial \left\lparen \sum_{n=1}^{N} \sum_{j=1}^{K} \gamma_{nj}^{(t)} \log{\left\lparen \pi_j \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j) \right\rparen} \right\rparen}{\partial \boldsymbol{\mu}_j} \\ &= \sum_{n=1}^{N} \gamma_{nk}^{(t)} \frac{1}{\pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)} \cdot \pi_k \cdot \frac{\partial \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\partial \boldsymbol{\mu}_k} \\ &= \sum_{n=1}^{N} \gamma_{nk}^{(t)} \frac{\partial (-\frac{1}{2}(\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T\boldsymbol{\Sigma}_k^{-1}(\boldsymbol{x}_n - \boldsymbol{\mu}_k))}{\partial \boldsymbol{\mu}_k} \end{align} $$

      $$ \begin{align} \frac{\partial (-\frac{1}{2}(\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T\boldsymbol{\Sigma}_k^{-1}(\boldsymbol{x}_n - \boldsymbol{\mu}_k))}{\partial \boldsymbol{\mu}_k} &\overset{\boldsymbol{y} = \boldsymbol{x}_n - \boldsymbol{\mu}_k}{=} -\frac{1}{2} \frac{\partial \left\lparen \boldsymbol{y}^T\boldsymbol{\Sigma}_k^{-1}\boldsymbol{y} \right\rparen}{\partial \boldsymbol{y}} \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{\mu}_k} \\ &= -\frac{1}{2} \cdot (2\boldsymbol{\Sigma}_k^{-1}\boldsymbol{y}) \cdot (-\boldsymbol{I}) \\ &= \boldsymbol{\Sigma}_k^{-1} \boldsymbol{y} \\ &= \boldsymbol{\Sigma}_k^{-1} (\boldsymbol{x}_n - \boldsymbol{\mu}_k) \end{align} $$

      Therefore,

      $$ \begin{align} &\frac{\partial Q}{\partial \boldsymbol{\mu}_k} = \sum_{n=1}^{N} \gamma_{nk}^{(t)} \boldsymbol{\Sigma}_k^{-1} (\boldsymbol{x}_n - \boldsymbol{\mu}_k) = 0 \\ &\Rightarrow \boldsymbol{\Sigma}_k^{-1} \sum_{n=1}^{N} \gamma_{nk}^{(t)} \boldsymbol{x}_n = \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k \sum_{n=1}^{N} \gamma_{nk}^{(t)} \\ &\Rightarrow \color{blue}{\boldsymbol{\mu}_k = \frac{\sum_{n=1}^{N} \gamma_{nk}^{(t)} \boldsymbol{x}_n}{\sum_{n=1}^{N} \gamma_{nk}^{(t)}}} \end{align} $$

      Take the derivative of the Q-function w.r.t. the covariance matrix of the k-th Gaussian

      $$ \begin{align} \frac{\partial Q}{\partial \boldsymbol{\Sigma}_k} &= \frac{\partial \left\lparen \sum_{n=1}^{N} \sum_{j=1}^{K} \gamma_{nj}^{(t)} \log{\left\lparen \pi_j \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j) \right\rparen} \right\rparen}{\partial \boldsymbol{\Sigma}_k} \\ &= \sum_{n=1}^{N} \gamma_{nk}^{(t)} \frac{\partial \log{\pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}}{\partial \boldsymbol{\Sigma}_k} \end{align} $$

      Since

      $$ \begin{align} \log{\pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)} &= \log{\pi_k} + \log{\mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)} \\ &= \log{\pi_k} - \frac{m}{2}\log{2\pi} - \frac{1}{2}\log{|\boldsymbol{\Sigma}_k|} - \frac{1}{2}(\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\boldsymbol{x}_n - \boldsymbol{\mu}_k) \\ \frac{\partial \log{\pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}}{\partial \boldsymbol{\boldsymbol{\Sigma}}_k} &= -\frac{1}{2} \left\lparen \frac{\partial \log{|\boldsymbol{\Sigma}_k|}}{\partial \boldsymbol{\Sigma}_k} + \frac{\partial (\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1}(\boldsymbol{x}_n - \boldsymbol{\mu}_k)}{\partial \boldsymbol{\Sigma}_k} \right\rparen \end{align} $$

      Take derivatives of these 2 terms respectively,

      $$ \begin{align} & \frac{\partial \log{|\boldsymbol{\Sigma}_k|}}{\partial \boldsymbol{\Sigma}_k} = \boldsymbol{\Sigma}_k^{-1} \\ & \frac{\partial (\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T\boldsymbol{\Sigma}_k^{-1}(\boldsymbol{x}_n - \boldsymbol{\mu}_k)}{\partial \boldsymbol{\Sigma}_k} = -\boldsymbol{\Sigma}_k^{-1} (\boldsymbol{x}_n - \boldsymbol{\mu}_k)(\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} \end{align} $$

      Combining them gives

      $$ \begin{align} &\frac{\partial Q}{\partial \boldsymbol{\Sigma}_k} = -\frac{1}{2} \sum_{n=1}^{N} \gamma_{nk}^{(t)} (\boldsymbol{\Sigma}_k^{-1} -\boldsymbol{\Sigma}_k^{-1}(\boldsymbol{x}_n - \boldsymbol{\mu}_k)(\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1}) = 0 \\ &\Rightarrow \left\lparen \sum_{n=1}^{N} \gamma_{nk}^{(t)} \right\rparen \boldsymbol{\Sigma}_k^{-1} = \boldsymbol{\Sigma}_k^{-1} \left\lparen \sum_{n=1}^{N} \gamma_{nk}^{(t)}(\boldsymbol{x}_n - \boldsymbol{\mu}_k)(\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T \right\rparen \boldsymbol{\Sigma}_k^{-1} \\ &\Rightarrow \boldsymbol{\Sigma}_k \left\lparen \sum_{n=1}^{N} \gamma_{nk}^{(t)} \right\rparen = \sum_{n=1}^{N} \gamma_{nk}^{(t)}(\boldsymbol{x}_n - \boldsymbol{\mu}_k)(\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T \\ &\Rightarrow \color{blue}{\boldsymbol{\Sigma}_k = \frac{\sum_{n=1}^{N} \gamma_{nk}^{(t)}(\boldsymbol{x}_n - \boldsymbol{\mu}_k)(\boldsymbol{x}_n - \boldsymbol{\mu}_k)^T}{\sum_{n=1}^{N} \gamma_{nk}^{(t)}}} \end{align} $$

      Note that in the update rule for $\boldsymbol{\Sigma}_k$, we are supposed to use the newly computed means $\boldsymbol{\mu}_k^{(t+1)}$.

      Regarding the k-th mixing weight, it’s a constrained optimization problem

      $$ \begin{align} \text{maximize}&\quad Q(\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma} \mid \boldsymbol{\pi}^{(t)}, \boldsymbol{\mu}^{(t)}, \boldsymbol{\Sigma}^{(t)}) \\ \text{s.t.}&\quad g(\boldsymbol{\pi}) = \sum_{k=1}^K \pi_k = 1 \end{align} $$

      We first formulate the Lagrangian

      $$ \begin{align} \mathcal{L}(\pi_k, \lambda) &= Q(\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma} \mid \boldsymbol{\pi}^{(t)}, \boldsymbol{\mu}^{(t)}, \boldsymbol{\Sigma}^{(t)}) + \lambda g(\boldsymbol{\pi}) \\ &= \sum_{n=1}^{N} \sum_{j=1}^{K} \gamma_{nj}^{(t)} \log{\left\lparen \pi_j \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j) \right\rparen} + \lambda (\sum_{j=1}^K \pi_j - 1) \\ \end{align} $$

      Differentiate the Lagrangian w.r.t. $\pi_k$, gives

      $$ \begin{align} \frac{\partial \mathcal{L}(\pi_k, \lambda)}{\partial \pi_k} &= \sum_{n=1}^{N} \gamma_{nk}^{(t)} \frac{1}{\pi_k \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)} \mathcal{N}(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) + \lambda \\ &= \frac{\sum_{n=1}^{N} \gamma_{nk}^{(t)}}{\pi_k} + \lambda \end{align} $$

      Let the derivative be zero, gives

      $$ \begin{align} &\ \frac{\partial \mathcal{L}(\pi_k, \lambda)}{\partial \pi_k} = 0 \\ \Rightarrow&\ \frac{\sum_{n=1}^{N} \gamma_{nk}^{(t)}}{\pi_k} + \lambda = 0 \\ \Rightarrow&\ \pi_k = -\frac{\sum_{n=1}^{N} \gamma_{nk}^{(t)}}{\lambda} \end{align} $$

      By the constraint,

      $$ \sum_{k=1}^K \pi_k = \sum_{k=1}^K -\frac{\sum_{n=1}^{N} \gamma_{nk}^{(t)}}{\lambda} = -\frac{\sum_{n=1}^{N} \sum_{k=1}^K \gamma_{nk}^{(t)}}{\lambda} = 1 $$

      Since

      $$ \sum_{k=1}^K \gamma_{nk}^{(t)} = 1 $$

      we have

      $$ \lambda = -N $$

      Therefore,

      $$ \pi_k = -\frac{\sum_{n=1}^{N} \gamma_{nk}^{(t)}}{\lambda} = \color{blue}{\frac{\sum_{n=1}^{N} \gamma_{nk}^{(t)}}{N}} $$

  3. Evaluate the log-likelihood and check convergence, if not converge repeat step 2.

    $$ \begin{align} \log{p(\mathcal{X} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})} &= \log{\prod_{\boldsymbol{x} \in \mathcal{X}} p(\boldsymbol{x} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})} \\ &= \sum_{i=1}^{N} \log{p(\boldsymbol{x}_i \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})} \\ &= \sum_{i=1}^{N} \log{\left\lparen \sum_{k=1}^{K} \pi_k \ \mathcal{N}(\boldsymbol{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\rparen} \end{align} $$

### Bayesian Estimation 

#### Gibbs Sampling 

T.B.D.

#### Variational Bayes 

T.B.D.


References: