GMM

EM Algorithm for Gaussian Mixture

The goal is to estimate a three-component Gaussian mixture model:

\begin{eqnarray*} P(Z = j) &=& \phi_j, \quad j=1, 2, 3\\ (X|Z=j) &\sim& N(\mu_j, \sigma_j). \end{eqnarray*}

When \(Z\) Can Be Observed

Denote by \(p(x; \mu, \sigma)\) the probability density function of a normal distribution

\[p(x; \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}.\]

If the sample of \(Z\) is visible, the so called complete data likelihood of one single sample point \((x_i, z_i)\) is

\[\phi_1p(x_i; \mu_1, \sigma_1)1_{\{z_i = 1\}} + \phi_2p(x_i; \mu_2, \sigma_2)1_{\{z_i = 2\}} + \phi_3p(x_i; \mu_3, \sigma_3)1_{\{z_i = 3\}}.\]

Note that this is different from the actual density function of the Gaussian mixture distribution because of those indicator functions. Although this formula has three terms, the indicator functions ensure that only one will be nonzero, and that the log-likelihood is

\[\log(\phi_1 p(x_i; \mu_1, \sigma_1))1_{\{z_i = 1\}} + \log(\phi_2 p(x_i; \mu_2, \sigma_2))1_{\{z_i = 2\}} + \log(\phi_3 p(x_i; \mu_3, \sigma_3))1_{\{z_i = 3\}}.\]

Here the indicator functions are NOT in the logarithm and again, only one term will be nonzero.

Given \(m\) sample points \((x_1, z_1), (x_2, z_2), \ldots, (x_m, z_m)\), the log-likelihood is thus

\begin{eqnarray*} &&\sum_{i=1}^m \left(\log(p(x_i; \mu_1, \sigma_1))1_{\{z_i = 1\}} + \log(p(x_i; \mu_2, \sigma_2))1_{\{z_i = 2\}} + \log(p(x_i; \mu_3, \sigma_3))1_{\{z_i = 3\}} \right)\\ %&=& \sum_{i=1}^m \left[\left( -\log(\sigma_1\sqrt{2\pi}) -\frac{1}{2}\left(\frac{x_i-\mu_1}{\sigma_1}\right)^2 \right) 1_{\{z_i = 1\}} + % \left( -\log(\sigma_2\sqrt{2\pi}) -\frac{1}{2}\left(\frac{x_i-\mu_2}{\sigma_2}\right)^2 \right)1_{\{z_i = 2\}} + % \left( -\log(\sigma_3\sqrt{2\pi}) -\frac{1}{2}\left(\frac{x_i-\mu_3}{\sigma_3}\right)^2 \right)1_{\{z_i = 3\}} \right]\\ %&=& \sum_{1\leq i \leq m, z_i=1} \left( -\log(\sigma_1\sqrt{2\pi}) -\frac{1}{2}\left(\frac{x_i-\mu_1}{\sigma_1}\right)^2 \right) % + \sum_{1\leq i \leq m, z_i=2} \left( -\log(\sigma_2\sqrt{2\pi}) -\frac{1}{2}\left(\frac{x_i-\mu_2}{\sigma_2}\right)^2 \right) % + \sum_{1\leq i \leq m, z_i=3} \left( -\log(\sigma_3\sqrt{2\pi}) -\frac{1}{2}\left(\frac{x_i-\mu_3}{\sigma_3}\right)^2 \right) &=& \sum_{1\leq i \leq m, z_i=1} \log(\phi_1 p(x_i; \mu_1, \sigma_1)) + \sum_{1\leq i \leq m, z_i=2} \log(\phi_2 p(x_i; \mu_2, \sigma_2)) + \sum_{1\leq i \leq m, z_i=3} \log(\phi_3 p(x_i; \mu_3, \sigma_3))\\ &=& \left(\sum_{1\leq i \leq m, z_i=1} \log(p(x_i; \mu_1, \sigma_1)) + \sum_{1\leq i \leq m, z_i=2} \log(p(x_i; \mu_2, \sigma_2)) + \sum_{1\leq i \leq m, z_i=3} \log(p(x_i; \mu_3, \sigma_3))\right) + m_1\log\phi_1 + m_2\log\phi_2 + m_3\log\phi_3, \end{eqnarray*}

where \(m_j\) is the number of sample points where \(z_i = j\).

Only the first summation depends on \(\mu_1, \sigma_1\), so the usual MLE argument for normal distribution applies, and same for \(\mu_2, \sigma_2\) and \(\mu_3, \sigma_3\). As a result the estimators for \(\mu_j, \sigma_j\) are

\[\mu_j = \frac{1}{m_j}\sum_{i=1}^{m_1} x_i, \qquad\sigma_j = \sqrt{\frac{1}{m_j}\sum_{i=1}^{m_j}(x_i - \mu_j)^2}, \qquad j=1, 2, 3.\]

To find estimator for \(\phi_j\) is to maximize \(m_1\log\phi_1 + m_2\log\phi_2 + m_3\log\phi_3\) given \(\phi_1 + \phi_2 + \phi_3 = 1\). The method of Lagrange multipliers says the estimator must satisfy

\[\nabla (m_1\log\phi_1 + m_2\log\phi_2 + m_3\log\phi_3) \parallel \nabla (\phi_1 + \phi_2 + \phi_3),\]

or equivalently \((m_1/\phi_1, m_2/\phi_2, m_3/\phi_3) \parallel (1, 1, 1)\). Thus

\[\phi_j = \frac{m_j}{m} \qquad j=1, 2, 3.\]

The AM–GM inequality would work too to optimize \(\phi_1^{m_1}\phi_2^{m_2}\phi_3^{m_3}\).

When \(Z\) Is a Latent Random Variable

When the value of \(Z\) can not be observed, the EM algorithm is used for estimation where all expressions involving observation \(z_i\) in the above estimators, say \(g(z_i)\), are replaced by a conditional expectation \(E[g(z_i)|X=x_i]\). Given the current estimates of all model parameters and the observation \(x_i\), the process of finding \(E[g(z_i)|X=x_i]\) is called the E-step. To maximize the log-likelihood function given \(E[g(z_i)|X=x_i]\) (as an estimate of \(g(z_i)\)) is called the M-step. The EM algorithm alternates between E-steps and M-steps until convergence.

To derive the M-step, first rewrite the estimators derived above using indicator functions to make expressions involving \(z_i\) more explicit:

\begin{eqnarray*} \mu_j &=& \frac{\sum_{i=1}^{m} x_i 1_{\{z_i = j\}}}{\sum_{i=1}^{m} 1_{\{z_i = j\}}}, \\ \sigma_j &=& \sqrt{\frac{\sum_{i=1}^{m}(x_i - \mu_j)^21_{\{z_i = j\}}}{\sum_{i=1}^{m} 1_{\{z_i = j\}}}}, \\ \phi_j &=& \frac{\sum_{i=1}^{m} 1_{\{z_i = j\}}}{m}. \end{eqnarray*}

Now replace all \(1_{\{z_i = j\}}\) by the conditional expectation

\[E\left[1_{\{z_i = j\}}\big|X=x_i\right] = P(z_i = j|X=x_i),\]

which we denote by \(w_i^{(j)}\) for simplicity:

\begin{eqnarray*} \mu_j &=& \frac{\sum_{i=1}^{m} x_i w_i^{(j)}}{\sum_{i=1}^{m} w_i^{(j)}}, \\ \sigma_j &=& \sqrt{\frac{\sum_{i=1}^{m}(x_i - \mu_j)^2 w_i^{(j)}}{\sum_{i=1}^{m} w_i^{(j)}}}, \\ \phi_j &=& \frac{\sum_{i=1}^{m} w_i^{(j)}}{m}. \end{eqnarray*}

These are the formulas to update estimates in the M-step.

The closed form formula of \(w_i^{(j)}\) can be obtained by Bayesian’s rule (one discrete random variable and one continuous case)

\begin{eqnarray*} w_i^{(j)} &=& P(z_i = j|X=x_i) \\ &=& \frac{p(x_i; \mu_j, \sigma_j)\phi_j}{p(x_i; \mu_1, \sigma_1)\phi_1 + p(x_i; \mu_2, \sigma_2)\phi_2 + p(x_i; \mu_3, \sigma_3)\phi_3} \end{eqnarray*}

This is the formula to update the estimate for \(w_i^{(j)}\) in the E-step.

Mixture Density Networks (MDN)

EM 學習的結果是靜態的分布。MDN 的學習結果也是 Gaussian Mixture 的參數，但是動態的，隨著 input 改變，也就是在學習 \(\mu_j(\mathbf x), \sigma_j(\mathbf x), \phi_j(\mathbf x)\)。
用下面的變換保證 \(\sigma_j > 0, \sum_j \phi_j = 1\)，真正學習的參數是 \(z^\alpha_j, z^\sigma_j, z^\mu_j\)。

\[\phi_j = \frac{\exp(z_j^\alpha)}{\sum_j \exp(z_j^\alpha)}, \qquad\sigma_j = \exp(z_j^\sigma), \qquad\mu_j = z_j^\mu.\]
cost function 是用 network outputs (\(z^\alpha_j, z^\sigma_j, z^\mu_j\)) 構造出來的分布在樣本點 \(y\) 值的 log-likelihood 取負，然後用 back-propagation 學習 NN 的權重。