結婚制度の廃止を望んでやまない

前回1変数をやったので多変量もやりたい.

導出

分布の定義

まずは多変量正規分布である. $D$ は $\textbf x$ の次元数.

$\begin{eqnarray} \mathcal N \left(\textbf x \mid \boldsymbol \mu, \boldsymbol \Lambda^{-1} \right) &=& \frac {\left| \boldsymbol \Lambda \right|^\frac {1} {2}} {(2\pi)^{\frac{D}{2}}} {\rm exp }\left( - \frac {1} {2} \left(\textbf x - \boldsymbol \mu\right)^{\rm T} \Lambda \left(\textbf x - \boldsymbol \mu\right) \right) \end{eqnarray}$

つづいてWishart分布である. $\boldsymbol \Lambda \left(D\times D\right)$ について

$\begin{eqnarray} \mathcal W \left(\boldsymbol \Lambda \mid \textbf W^{-1}, \nu \right) &=& {\frac {\left|\boldsymbol \Lambda \right|^{\frac {\nu-D-1}{2}}{\rm exp} \left( \frac {-\operatorname {tr} (\mathbf {W} ^{-1}{\boldsymbol \Lambda })} {2}\right)} {2^{\frac {\nu D}{2}}\left|{\mathbf {W} }\right|^{\frac {\nu} {2}}\Gamma _{D}\left({\frac {\nu}{2}}\right)}} \end{eqnarray}$

統計モデル

データ $\textbf x = \lbrace \textbf x_1, ..., \textbf x_N\rbrace$ は多変量正規分布に従う. 事前分布をNormal-Wishart分布とする.

$\begin{eqnarray} p\left(\boldsymbol \mu, \boldsymbol \Lambda \mid \textbf x \right) &=& \prod_{i=1}^N \mathcal N \left(\textbf x_i \mid \boldsymbol \mu, \boldsymbol \Lambda^{-1} \right) \mathcal N \left(\boldsymbol \mu \mid \boldsymbol \mu_0, \left(\lambda_0 \boldsymbol \Lambda\right)^{-1} \right) \mathcal W \left(\boldsymbol \Lambda \mid \textbf W_0^{-1}, \nu_0 \right) \end{eqnarray}$

$\mu_0, \lambda_0, \textbf W_0, \nu_0$ が事前分布のパラメータである.

$\boldsymbol \mu$ と $\boldsymbol \Lambda$ の事後分布

例によって対数をとって考える.

$\begin{eqnarray} \operatorname {ln} p\left(\boldsymbol \mu, \boldsymbol \Lambda \mid \textbf x \right) &=& \frac {N} {2} \operatorname {ln} \left| \boldsymbol \Lambda \right| - \frac {1} {2} \sum_{i=1}^N\left( \left(\textbf x_i - \boldsymbol \mu\right)^{\rm T} \boldsymbol \Lambda \left(\textbf x_i - \boldsymbol \mu\right) \right) + \frac {1} {2} \operatorname {ln} \left|\boldsymbol \Lambda \right| - \frac {\lambda_0} {2} \left(\boldsymbol \mu - \boldsymbol \mu_0\right)^{\rm T} \boldsymbol \Lambda \left(\boldsymbol \mu - \boldsymbol \mu_0\right) \\ && + \frac {\nu_{0} - D - 1} {2}\operatorname {ln} \left|\boldsymbol \Lambda \right| + \frac {-\operatorname {tr} (\mathbf {W_0} ^{-1}\mathbf {\boldsymbol \Lambda})} {2} + {\rm const} \\ &=& \frac {N + \nu_{0} - D} {2} \operatorname {ln} \left| \boldsymbol \Lambda \right| - \frac {1} {2} \underbrace {\left( \sum_{i=1}^N\left( \left(\textbf x_i - \boldsymbol \mu\right)^{\rm T} \boldsymbol \Lambda \left(\textbf x_i - \boldsymbol \mu\right) \right) + \lambda_0 \left(\boldsymbol \mu - \boldsymbol \mu_0\right)^{\rm T} \boldsymbol \Lambda \left(\boldsymbol \mu - \boldsymbol \mu_0\right) \right)}_{=: A} - \frac {\operatorname {tr} (\mathbf {W_0} ^{-1}\mathbf {\boldsymbol \Lambda})} {2} + {\rm const} \\ \end{eqnarray}$

ひとまず $\mu$ の項である $A$ をやっつけるとする.

$\begin {eqnarray} A &=& \sum_{i=1}^N\left( \left(\textbf x_i - \boldsymbol \mu\right)^{\rm T} \boldsymbol \Lambda \left(\textbf x_i - \boldsymbol \mu\right) \right) + \lambda_0\left(\boldsymbol \mu - \boldsymbol \mu_0\right)^{\rm T} \boldsymbol \Lambda \left(\boldsymbol \mu - \boldsymbol \mu_0\right) \\ &=& \sum_{i=1}^N \left( {\textbf x_i}^{\rm T} \boldsymbol \Lambda \textbf x_i \right) - \boldsymbol \mu^{\rm T} \boldsymbol \Lambda \sum_{i=1}^N \textbf x_i - \sum_{i=1}^N {\textbf x_i}^{\rm T} \boldsymbol \Lambda \boldsymbol \mu + N \boldsymbol \mu^{\rm T} \boldsymbol \Lambda \boldsymbol \mu + \lambda_0 \boldsymbol \mu^{\rm T} \boldsymbol \Lambda \boldsymbol \mu - \lambda_0 \boldsymbol \mu_0^{\rm T} \boldsymbol \Lambda \boldsymbol \mu - \lambda_0 \boldsymbol \mu^{\rm T} \boldsymbol \Lambda \boldsymbol \mu_0 + \lambda_0 \boldsymbol \mu_0^{\rm T} \boldsymbol \Lambda \boldsymbol \mu_0 \\ &=& \left(N + \lambda_0\right)\boldsymbol \mu^{\rm T} \boldsymbol \Lambda \boldsymbol \mu - \boldsymbol \mu^{\rm T} \boldsymbol \Lambda \left( \sum_{i=1}^N \textbf x_i + \lambda_0 \boldsymbol \mu_0 \right) - \left( \sum_{i=1}^N \textbf x_i + \lambda_0 \boldsymbol \mu_0 \right)^{\rm T} \boldsymbol \Lambda \boldsymbol \mu + \sum_{i=1}^N \left( {\textbf x_i}^{\rm T} \boldsymbol \Lambda \textbf x_i \right) + \lambda_0 \boldsymbol \mu_0^{\rm T} \boldsymbol \Lambda \boldsymbol \mu_0 \\ &=& (N + \lambda_0) \left( \boldsymbol \mu - \frac {N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0} {N + \lambda_0} \right)^{\rm T} \boldsymbol \Lambda \left( \boldsymbol \mu - \frac {N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0} {N + \lambda_0} \right) \underbrace {- \frac { \left(N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0\right)^{\rm T} \boldsymbol \Lambda \left(N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0\right) } {N + \lambda_0} + \sum_{i=1}^N \left( {\textbf x_i}^{\rm T} \boldsymbol \Lambda \textbf x_i \right) + \lambda_0 \boldsymbol \mu_0^{\rm T} \boldsymbol \Lambda \boldsymbol \mu_0}_{=: B} \\ \end {eqnarray}$

$B$ をやっつける.

$\begin{eqnarray} B &=& - \frac { \left(N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0\right)^{\rm T} \boldsymbol \Lambda \left(N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0\right) } {N + \lambda_0} + \sum_{i=1}^N \left( {\textbf x_i}^{\rm T} \boldsymbol \Lambda \textbf x_i \right) + \lambda_0 \boldsymbol \mu_0^{\rm T} \boldsymbol \Lambda \boldsymbol \mu_0 \\ &=& - \frac {{\rm tr}\left( \left(N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0\right) \left(N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0\right)^{\rm T} \boldsymbol \Lambda \right)} {N + \lambda_0} + {\rm tr}\left( \sum_{i=1}^N \left( \textbf x_i {\textbf x_i}^{\rm T} \right) \boldsymbol \Lambda \right) + {\rm tr} \left( \lambda_0 \boldsymbol \mu_0 \boldsymbol \mu_0^{\rm T} \boldsymbol \Lambda \right) \\ &=& \frac {1}{N+\lambda_0} {\rm tr} \left(\left( - N^2 \overline {\textbf x}\overline {\textbf x}^{\rm T} - N \lambda_0 \overline {\textbf x}\boldsymbol \mu_0^{\rm T} - N \lambda_0 \boldsymbol \mu_0 \overline {\textbf x}^{\rm T} - \lambda_0^2 \boldsymbol \mu_0 \boldsymbol \mu_0^{\rm T} + N \sum_{i=1}^N \left( \textbf x_i {\textbf x_i}^{\rm T} \right) + \lambda_0 \sum_{i=1}^N \left( \textbf x_i {\textbf x_i}^{\rm T} \right) + N \lambda_0 \boldsymbol \mu_0 \boldsymbol \mu_0^{\rm T} + \lambda_0^2 \boldsymbol \mu_0 \boldsymbol \mu_0^{\rm T} \right)\boldsymbol \Lambda \right) \\ &=& \frac {1}{N+\lambda_0} {\rm tr} \left(\left( - N^2 \overline {\textbf x}\overline {\textbf x}^{\rm T} - N \lambda_0 \overline {\textbf x}\boldsymbol \mu_0^{\rm T} - N \lambda_0 \boldsymbol \mu_0 \overline {\textbf x}^{\rm T} + N \sum_{i=1}^N \left( \textbf x_i {\textbf x_i}^{\rm T} \right) + \lambda_0 \sum_{i=1}^N \left( \textbf x_i {\textbf x_i}^{\rm T} \right) + N \lambda_0 \boldsymbol \mu_0 \boldsymbol \mu_0^{\rm T} \right)\boldsymbol \Lambda \right) \\ &=& \frac {1}{N+\lambda_0} {\rm tr} \left(\left( N\lambda _0 \left(\mu_0 - \overline {\textbf x}\right) \left(\mu_0 - \overline {\textbf x}\right)^{\rm T} - N\left( N + \lambda_0\right) \overline {\textbf x}\overline {\textbf x}^{\rm T} + \left( N + \lambda_0 \right) \sum_{i=1}^N \left( \textbf x_i {\textbf x_i}^{\rm T} \right) \right)\boldsymbol \Lambda \right) \\ &=& \frac {N\lambda _0}{N+\lambda_0} {\rm tr} \left( \left(\mu_0 - \overline {\textbf x}\right) \left(\mu_0 - \overline {\textbf x}\right)^{\rm T} \boldsymbol \Lambda \right) + {\rm tr} \left(\left( - N\overline {\textbf x}\overline {\textbf x}^{\rm T} + \sum_{i=1}^N \left( \textbf x_i {\textbf x_i}^{\rm T} \right) \right)\boldsymbol \Lambda \right) \\ &=& \frac {N\lambda _0}{N+\lambda_0} {\rm tr} \left( \left(\mu_0 - \overline {\textbf x}\right) \left(\mu_0 - \overline {\textbf x}\right)^{\rm T} \boldsymbol \Lambda \right) + {\rm tr} \left(\left( \sum_{i=1}^N \left( \textbf x_i {\textbf x_i}^{\rm T} - \overline {\textbf x}\overline {\textbf x}^{\rm T} - \overline {\textbf x}\overline {\textbf x}^{\rm T} + \overline {\textbf x}\overline {\textbf x}^{\rm T} \right) \right)\boldsymbol \Lambda \right) \\ &=& \frac {N\lambda _0}{N+\lambda_0} {\rm tr} \left( \left(\mu_0 - \overline {\textbf x}\right) \left(\mu_0 - \overline {\textbf x}\right)^{\rm T} \boldsymbol \Lambda \right) + {\rm tr} \left( \sum_{i=1}^N \left(\textbf x_i - \overline {\textbf x}\right) \left(\textbf x_i - \overline {\textbf x}\right)^{\rm T} \boldsymbol \Lambda \right) \\ \end{eqnarray}$

$A, B$ を対数事後分布に戻し, Normal-Wishart分布になるよう変形.

$\begin{eqnarray} \operatorname {ln} p\left(\boldsymbol \mu, \boldsymbol \Lambda \mid \textbf x \right) &=& - \frac {1} {2} (N + \lambda_0) \left( \boldsymbol \mu - \frac {N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0} {N + \lambda_0} \right)^{\rm T} \boldsymbol \Lambda \left( \boldsymbol \mu - \frac {N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0} {N + \lambda_0} \right) + \frac {1} {2} \operatorname {ln} \left| \boldsymbol \Lambda \right| - \frac {1} {2} \operatorname {ln} \left| \boldsymbol \Lambda \right| + \frac {N + \nu_{0} - D} {2} \operatorname {ln} \left| \boldsymbol \Lambda \right| \\ && - \frac {1} {2} \frac {N\lambda _0}{N+\lambda_0} {\rm tr} \left( \left(\mu_0 - \overline {\textbf x}\right) \left(\mu_0 - \overline {\textbf x}\right)^{\rm T} \boldsymbol \Lambda \right) - \frac {1} {2} {\rm tr} \left( \sum_{i=1}^N \left(\textbf x_i - \overline {\textbf x}\right) \left(\textbf x_i - \overline {\textbf x}\right)^{\rm T} \boldsymbol \Lambda \right) - \frac {\operatorname {tr} (\mathbf {W_0} ^{-1}\mathbf {\boldsymbol \Lambda})} {2} + {\rm const} \\ \end{eqnarray}$

事後分布が導出できた. そんなに難しくない.

$\begin{eqnarray} p\left(\boldsymbol \mu, \boldsymbol \Lambda \mid \textbf x \right) &=& \mathcal N\left( \boldsymbol \mu \mid \boldsymbol \mu_N, \left(\lambda_N \boldsymbol \Lambda \right)^{-1}\right) \mathcal W\left(\boldsymbol \Lambda \mid \textbf W_N^{-1}, \nu_N \right) \\ \boldsymbol \mu_N &=& \frac {N\overline {\textbf x} + \lambda_0 \boldsymbol \mu_0} {N + \lambda_0} \\ \boldsymbol \lambda_N &=& N+\lambda_0 \\ \textbf W_N^{-1} &=& \textbf W_0^{-1} + \frac {N\lambda _0}{N+\lambda_0} \left(\mu_0 - \overline {\textbf x}\right) \left(\mu_0 - \overline {\textbf x}\right)^{\rm T} + \sum_{i=1}^N \left(\textbf x_i - \overline {\textbf x}\right) \left(\textbf x_i - \overline {\textbf x}\right)^{\rm T} \\ \nu_N &=& N + \nu_0 \end{eqnarray}$

$\boldsymbol \mu$ の周辺事後分布

$\boldsymbol \mu$ と $\boldsymbol \Lambda$ の事後分布が得られたわけだが, ここから $\boldsymbol \mu$ の周辺分布は導出できるのか考えてみる. ちなみに $\boldsymbol \Lambda$ の周辺事後分布は明らかに事後分布のWishart分布の部分である.

要するに下のNormal-Wishart分布の $\mu$ の平均に対する周辺分布を導出して, パラメータを事後分布のパラメータに置き換えればよい.

Wishart分布の正規化項 $Z(\textbf W, \mu)$ を導入すると計算しやすい.

$Z(\textbf W, \nu) = {2^{\frac {\nu D}{2}}\left|{\mathbf {W} }\right|^{\frac {\nu} {2}}\Gamma _{D}\left({\frac {\nu}{2}}\right)}$

しこしこ解く.

$\begin{eqnarray} \mathcal f \left(\boldsymbol \mu, \boldsymbol \Lambda \mid \boldsymbol \mu_0, \boldsymbol \lambda, \textbf W, \nu \right) &=& \frac {\left| \lambda \boldsymbol \Lambda \right|^\frac {1} {2}} {(2\pi)^{\frac{D}{2}}} {\rm exp }\left( - \frac {1} {2} \left(\boldsymbol \mu - \boldsymbol \mu_0\right)^{\rm T} \lambda\Lambda \left(\boldsymbol \mu - \boldsymbol \mu_0\right) \right) {\frac {\left|\boldsymbol \Lambda \right|^{\frac {\nu-D-1}{2}}{\rm exp} \left( \frac {-\operatorname {tr} (\mathbf {W} ^{-1}{\boldsymbol \Lambda })} {2}\right)} {Z(\textbf W, \nu)}} \\ &=& \frac {\lambda ^\frac {D} {2}} {(2\pi)^{\frac{D}{2}}} \frac {1} {{Z(\textbf W, \nu)}} \left|\boldsymbol \Lambda \right|^{\frac {(\nu+1)-D-1}{2}} {\rm exp} \left(- \frac {1} {2} {\rm tr} \left( \underbrace { \textbf W^{-1} + \lambda \left(\boldsymbol \mu -\boldsymbol \mu_0\right) \left(\boldsymbol \mu -\boldsymbol \mu_0\right)^{\rm T} }_{=: \textbf W^{'-1}} \right)\right) \\ &=& \frac {\lambda ^\frac {D} {2}} {(2\pi)^{\frac{D}{2}}} \frac {Z(\textbf W^{\prime}, \nu+1)} {Z(\textbf W, \nu)} {\frac {\left|\boldsymbol \Lambda \right|^{\frac {(\nu+1)-D-1}{2}}{\rm exp} \left( - \frac {1} {2} \operatorname {tr} (\mathbf {W} ^{\prime-1}{\boldsymbol \Lambda }) \right)} {Z(\textbf W^{\prime}, \nu+1)}} \\ &=& \frac {\lambda ^\frac {D} {2}} {(2\pi)^{\frac{D}{2}}} \frac {Z(\textbf W^{\prime}, \nu+1)} {Z(\textbf W, \nu)} \mathcal W\left(\textbf W^{\prime-1}, \nu+1\right) \end{eqnarray}$

$\boldsymbol \Lambda$ について積分し, $\boldsymbol \mu$ の周辺分布を求める.

ここで $\left| \textbf W^{\prime}\right|$ の処理は光成滋生 (2017). "パターン認識と機械学習の学習普及版"のp79が参考になる.

まず公式として, $|\boldsymbol I + \boldsymbol a \boldsymbol b^{\rm T}| = 1 + \boldsymbol b^{\rm T} \boldsymbol a$ がある(参考). $\boldsymbol ここでI$ は単位行列.

$\begin{eqnarray} \left| \textbf W^{\prime -1}\right| &=& \left | \textbf W^{-1} + \lambda \left(\boldsymbol \mu -\boldsymbol \mu_0\right) \left(\boldsymbol \mu -\boldsymbol \mu_0\right)^{\rm T} \right| \\ &=& \left| \textbf W^{-1} \right| \left | \boldsymbol I + \textbf W\lambda \left(\boldsymbol \mu -\boldsymbol \mu_0\right) \left(\boldsymbol \mu -\boldsymbol \mu_0\right)^{\rm T} \right| \\ &=& \left| \textbf W^{-1} \right| \left ( 1 + \lambda \left(\boldsymbol \mu -\boldsymbol \mu_0\right)^{\rm T} \textbf W \left(\boldsymbol \mu -\boldsymbol \mu_0\right) \right) \\ \left| \textbf W^{\prime}\right| &=& \frac {\left| \textbf W \right|} { 1 + \lambda \left(\boldsymbol \mu -\boldsymbol \mu_0\right)^{\rm T} \textbf W \left(\boldsymbol \mu -\boldsymbol \mu_0\right) } \\ \end{eqnarray}$

よって以下のように導出できる.

$\begin{eqnarray} \int \mathcal f \left(\boldsymbol \mu, \boldsymbol \Lambda \mid \boldsymbol \mu_0, \boldsymbol \lambda, \textbf W, \nu \right) {\rm d} \boldsymbol \Lambda &=& \frac {\lambda ^\frac {D} {2}} {(2\pi)^{\frac{D}{2}}} \frac {2^{\frac {(\nu + 1)D} {2}}} {2^{\frac {\nu D} {2}}} \frac {\left| \textbf W ^{\prime}\right|^{\frac {\nu + 1} {2}}} {\left| \textbf W\right|^{\frac {\nu} {2}}} \frac {\Gamma _{D}\left({\frac {\nu+1}{2}}\right)} {\Gamma _{D}\left({\frac {\nu}{2}}\right)} \\ &=& \frac {\lambda ^\frac {D} {2}} {\pi^{\frac{D}{2}}} {\left| \textbf W \right|^{\frac {1} {2}}} \left( 1 + \lambda \left(\boldsymbol \mu -\boldsymbol \mu_0\right)^{\rm T} \textbf W \left(\boldsymbol \mu -\boldsymbol \mu_0\right) \right) ^ {-\frac {\nu + 1}{2}} \frac {\pi ^{D(D-1)/4}\prod _{j=1}^{D}\Gamma \left({\frac {\nu+2-j}{2}}\right)} {\pi ^{D(D-1)/4}\prod _{j=1}^{D}\Gamma \left({\frac {\nu+1-j}{2}}\right)} \\ &=& \frac {\lambda ^\frac {D} {2}} {\pi^{\frac{D}{2}}} {\left| \textbf W \right|^{\frac {1} {2}}} \left( 1 + \lambda \left(\boldsymbol \mu -\boldsymbol \mu_0\right)^{\rm T} \textbf W \left(\boldsymbol \mu -\boldsymbol \mu_0\right) \right) ^ {-\frac {\nu + 1}{2}} \frac {\Gamma \left({\frac {\nu+1}{2}}\right)} {\Gamma \left({\frac {\nu+1-D}{2}}\right)} \\ &=& \frac {\Gamma \left({\frac {\nu+1}{2}}\right)} {\Gamma \left({\frac {\nu+1-D}{2}}\right)} \frac {1} {\pi^{\frac{D}{2}}} \frac {\left(\lambda \left(\nu + 1 - D\right)\right)^\frac {D} {2}} {\left(\nu + 1 - D \right)^{\frac {D} {2}}} {\left| \textbf W \right|^{\frac {1} {2}}} \\ && \times \left( 1 + \frac {1} {\nu + 1 - D} \left(\boldsymbol \mu -\boldsymbol \mu_0\right)^{\rm T} \lambda \left(\nu + 1 - D\right) \textbf W \left(\boldsymbol \mu -\boldsymbol \mu_0\right) \right) ^ {-\frac {\left(\nu + 1 -D\right) + D}{2}} \\ &=& \frac {\Gamma \left({\frac {\nu+1}{2}}\right)} {\Gamma \left({\frac {\nu+1-D}{2}}\right)} \frac {1} {\pi^{\frac{D}{2}}} \frac {1} {\left(\nu + 1 - D \right)^{\frac {D} {2}}} {\left| \lambda \left(\nu + 1 - D\right) \textbf W \right|^{\frac {1} {2}}} \\ && \times \left( 1 + \frac {1} {\nu + 1 - D} \left(\boldsymbol \mu -\boldsymbol \mu_0\right)^{\rm T} \lambda \left(\nu + 1 - D\right) \textbf W \left(\boldsymbol \mu -\boldsymbol \mu_0\right) \right) ^ {-\frac {\left(\nu + 1 -D\right) + D}{2}} \\ \end{eqnarray}$

これは, 精度 $\lambda(\nu + 1 - D)\textbf W$ , 平均 $\mu_0$ , 自由度 $\nu + 1 - D$ のStudent's t分布である.

そういうわけで, $p(\mu \mid \textbf x)$ は, 精度 $\lambda_N(\nu_N + 1 - D)\textbf W_N$ , 平均 $\mu_N$ , 自由度 $\nu_N + 1 - D$ のStudent's t分布となる.

解けた.

事後予測分布

任意の $\hat {\textbf x}$ についての事後予測分布 $p(\hat {\textbf x} \mid \textbf x)$ を導出する.

まず $p( {\textbf x})$ を導出する. 前回やったように, 正規化項以外の割り算は割って1にならなければならない.

$\begin{eqnarray} p\left(\textbf x \right) &=& \frac {p\left( \boldsymbol \mu,\boldsymbol \Lambda \right) p\left(\textbf x \mid \boldsymbol \mu, \boldsymbol \Lambda \right)} {p\left(\boldsymbol \mu, \boldsymbol \Lambda \mid \textbf x \right)} \\ &=& \frac { \frac {\lambda_0^{\frac {D} {2}}} {\left(2\pi\right)^{\frac {D} {2}}} \frac {1} {2^{\frac {\nu_0 D}{2}}\left|{\mathbf {W}_0 }\right|^{\frac {\nu_0} {2}}\Gamma _{D}\left({\frac {\nu_0}{2}}\right)} \frac {1} {\left(2\pi\right)^{\frac {ND} {2}}} } { \frac {\lambda_N^{\frac {D} {2}}} {\left(2\pi\right)^{\frac {D} {2}}} \frac {1} {2^{\frac {\nu_N D}{2}}\left|{\mathbf {W}_N }\right|^{\frac {\nu_N} {2}}\Gamma _{D}\left({\frac {\nu_N}{2}}\right)} } \\ &=& \frac {1} {\pi^{\frac {ND} {2}}} \frac {\Gamma _{D}\left({\frac {\nu_N}{2}}\right)} {\Gamma _{D}\left({\frac {\nu_0}{2}}\right)} \frac {\left|{\mathbf {W}_N }\right|^{\frac {\nu_N} {2}}} {\left|{\mathbf {W}_0 }\right|^{\frac {\nu_0} {2}}} \left(\frac {\lambda_0} {\lambda_N}\right)^{\frac {D} {2}} \end{eqnarray}$

$p(\hat {\textbf x} \mid \textbf x)$ は以下となる.

$\begin{eqnarray} p(\hat {\textbf x} \mid \textbf x) &=& \frac {p(\hat {\textbf x}, \textbf x)} {p(\textbf x)} &=& \frac {1} {\pi^{\frac {D} {2}}} \frac {\Gamma _{D}\left({\frac {1}{2}\nu_{N+1}}\right)} {\Gamma _{D}\left({\frac {1}{2}\nu_N}\right)} \frac {\left|{\mathbf {W}_{N+1} }\right|^{\frac {1} {2}\nu_{N+1}}} {\left|{\mathbf {W}_N }\right|^{\frac {1} {2}\nu_N}} \left(\frac {\lambda_{N}} {\lambda_{N+1}}\right)^{\frac {D} {2}} \end{eqnarray}$

$\lbrace \hat {\textbf x}, \textbf x \rbrace$ による事後分布のパラメータが出てきた. このパラメータを添字 $N+1$ で表現すると以下となる.

$\begin{eqnarray} \boldsymbol \mu_{N+1} &=& \frac {\hat {\textbf x} + \lambda_N \boldsymbol \mu_N} {1 + \lambda_N} \\ \boldsymbol \lambda_{N+1} &=& 1+\lambda_N \\ \textbf W_{N+1}^{-1} &=& \textbf W_N^{-1} + \frac {\lambda_N}{1+\lambda_N} \left(\mu_N - \hat {\textbf x}\right) \left(\mu_N - \hat {\textbf x}\right)^{\rm T} \\ \nu_{N+1} &=& 1 + \nu_N \end{eqnarray}$ $\begin{eqnarray} p(\hat {\textbf x} \mid \textbf x) &=& \frac {1} {\pi^{\frac {D} {2}}} \frac {\Gamma _{D}\left({\frac {1}{2}(\nu_N+1)}\right)} {\Gamma _{D}\left({\frac {1}{2}\nu_N}\right)} \frac {1} {\left| \textbf W_N^{-1} + \frac {\lambda_N}{1+\lambda_N} \left(\mu_N - \hat {\textbf x}\right) \left(\mu_N - \hat {\textbf x}\right)^{\rm T} \right|^{\frac {1} {2}(\nu_N+1)}} \frac {1} {\left|{\mathbf {W}_N }\right|^{\frac {1} {2}\nu_N}} \left(\frac {\lambda_N} {\lambda_N + 1}\right)^{\frac {D} {2}} \\ &=& \frac {1} {\pi^{\frac {D} {2}}} \frac {\pi ^{D(D-1)/4}\prod _{j=1}^{D}\Gamma \left({\frac {\nu_N+2-j}{2}}\right)} {\pi ^{D(D-1)/4}\prod _{j=1}^{D}\Gamma \left({\frac {\nu_N+1-j}{2}}\right)} \\ && \times \frac {1} {\left\lbrace \left|\textbf W_N^{-1}\right| \left( 1 + \left(\mu_N - \hat {\textbf x}\right)^{\rm T} \frac {\lambda_N}{1+\lambda_N}\textbf W_N \left(\mu_N - \hat {\textbf x}\right) \right) \right\rbrace^{\frac {1} {2}(\nu_N+1)}} \frac {1} {\left|{\mathbf {W}_N }\right|^{\frac {1} {2}\nu_N}} \left(\frac {\lambda_N} {\lambda_N + 1}\right)^{\frac {D} {2}} \\ &=& \frac {1} {\pi^{\frac {D} {2}}} \frac {\Gamma \left({\frac {\nu_N+1}{2}}\right)} {\Gamma \left({\frac {\nu_N+1-D}{2}}\right)} \frac {1} {\left|{\mathbf {W}_N^{-1} }\right|^{\frac {1} {2}}} \frac {1} {\left(\frac {\lambda_N+1} {\lambda_N}\right)^{\frac {D} {2}}} \frac {1} { \left( 1 + \left(\mu_N - \hat {\textbf x}\right)^{\rm T} \frac {\lambda_N}{1+\lambda_N}\textbf W_N \left(\mu_N - \hat {\textbf x}\right) \right)^{\frac {1} {2}(\nu_N+1)}} \\ &=& \frac {1} {\pi^{\frac {D} {2}}} \frac {\Gamma \left({\frac {(\nu_N+1-D)+D}{2}}\right)} {\Gamma \left({\frac {\nu_N+1-D}{2}}\right)} \frac {1} {\left|\left( \frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N} {\mathbf {W}_N} \right)^{-1}\right|^{\frac {1} {2}}} \frac {1} {\left(\frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N}\right)^{\frac {D} {2}}} \frac {1} {\left(\frac {\lambda_N+1} {\lambda_N}\right)^{\frac {D} {2}}} \\ && \times \left( 1 + \frac {1} {\nu_N + 1 - D} \left(\mu_N - \hat {\textbf x}\right)^{\rm T} \frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N}\textbf W_N \left(\mu_N - \hat {\textbf x}\right) \right)^{-\frac {1} {2}\left(\left(\nu_N+1-D\right)+D\right)} \\ &=& \frac {1} {\pi^{\frac {D} {2}}} \frac {\Gamma \left({\frac {(\nu_N+1-D)+D}{2}}\right)} {\Gamma \left({\frac {\nu_N+1-D}{2}}\right)} \frac {1} {\left|\left( \frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N} {\mathbf {W}_N} \right)^{-1}\right|^{\frac {1} {2}}} \frac {1} {\left(\frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N}\right)^{\frac {D} {2}}} \frac {1} {\left(\frac {\lambda_N+1} {\lambda_N}\right)^{\frac {D} {2}}} \\ && \times \left( 1 + \frac {1} {\nu_N + 1 - D} \left(\mu_N - \hat {\textbf x}\right)^{\rm T} \frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N}\textbf W_N \left(\mu_N - \hat {\textbf x}\right) \right)^{-\frac {1} {2}\left(\left(\nu_N+1-D\right)+D\right)} \\ &=& \frac {1} {\pi^{\frac {D} {2}}} \frac {\Gamma \left({\frac {(\nu_N+1-D)+D}{2}}\right)} {\Gamma \left({\frac {\nu_N+1-D}{2}}\right)} \left| \frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N} {\mathbf {W}_N} \right|^{\frac {1} {2}} \frac {1} {\left(\nu_N + 1 - D\right)^{\frac {D} {2}}}\\ && \times \left( 1 + \frac {1} {\nu_N + 1 - D} \left(\hat {\textbf x} - \mu_N\right)^{\rm T} \frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N}\textbf W_N \left(\hat {\textbf x} - \mu_N\right) \right)^{-\frac {1} {2}\left(\left(\nu_N+1-D\right)+D\right)} \\ \end{eqnarray}$

$p(\hat {\textbf x} \mid \textbf x)$ は, 精度 $\frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N}\textbf W_N$ , 平均 $\mu_N$ , 自由度 $\nu_N + 1 - D$ のstudentのt分布となる.

$p\left(\boldsymbol \mu \mid \textbf x \right)$ のStudent's t分布の精度を $1 + \lambda_N$ で割った分布なわけだがそこに意味があるのかは分からない.

なんにせよ導出は全部できた.

可視化と検証

計算して可視化してみる. Rstanでサンプリングもしてみて結果を比較してみる. 一致しなければ上の導出が誤っていることになる.

import numpy as np
import scipy as sp
from scipy import stats
from scipy import special

import pandas as pd
pd.set_option('display.width', 200)
import matplotlib.pyplot as plt
import matplotlib
plt.style.use('ggplot')

import os
import subprocess
import warnings
warnings.filterwarnings('ignore')

テストデータと事前分布

適当な二次元データをサンプリングする.

# 適当な二次元データ
np.random.seed(0)
mu = np.array([5, 3])
cov = np.array([[5,2],[2,1]])
N = 100
X = pd.DataFrame(sp.random.multivariate_normal(mu, cov, N), columns=['x', 'y'])
X.plot(kind='scatter', x='x', y='y').set_aspect('equal')
plt.savefig('1.png', bbox_inches='tight', pad_inches=0)

f:id:kazufusa1484:20180727120938p:plain

事前分布を適当に決める.

# 適当な事前分布のパラメータ

mu0 = np.array([0, 0]).reshape(2, 1) # 縦ベクトル
lambda0 = 1
W0 = np.array([[0.01, 0], [0, 0.01]])
nu0 = 1.1

事後分布の計算

meanX = np.array(X.mean()).reshape(2, 1)
muN = (N * meanX + lambda0 * mu0) / (N + lambda0)
lambdaN = N + lambda0
WNinv = np.linalg.inv(W0) \
    + (N * lambda0) / (N + lambda0) * (mu0 - meanX) @ (mu0 - meanX).T \
    + X.apply(
        lambda x:
            (np.array(x).reshape(2, 1) - meanX)
            @ (np.array(x).reshape(2, 1) - meanX).T
        , axis=1
    ).sum()
WN = np.linalg.inv(WNinv)
nuN = N + nu0

print(f"muN     : {muN.tolist()}")
print(f"lambdaN : {lambdaN}")
print(f"WN      : {WN.tolist()}")
print(f"nuN     : {nuN}")

    muN     : [[4.9302018085299], [3.0252711023662453]]
    lambdaN : 101
    WN      : [[0.002450141567402565, -0.002569876767106904], [-0.002569876767106904, 0.007340956277110485]]
    nuN     : 101.1

計算はできたがこれが正しいのかはわからない.

RStanで事後分布サンプリング

別の手法で事後分布を求めて比較する必要が有る. というわけでRStanで同じモデルでMCMCサンプリングする. 保存するのは $\boldsymbol \mu$ と $\boldsymbol \Lambda^{-1}\left(=\boldsymbol \Sigma \right)$ のサンプリング結果と, 各イテレーションにおいて $\boldsymbol \mu$ と $\boldsymbol \Lambda^{-1}$ をパラメータとする2変量正規分布から1点サンプリングしたものである.

X.to_csv("x.csv", index=False)
from subprocess import Popen, PIPE

R = Popen(["R", "--vanilla", "--silent"], stdin=PIPE, stdout=PIPE)

R.stdin.write(b'''
options(width=200, echo=F, digits=4)

sink(file="/dev/null")
suppressMessages({
    library(rstan)
})
x <- read.table("x.csv", sep=",", header=T)
data = list(
    N = nrow(x),
    x = x,
    mu0 = c(0, 0),
    lambda0 = 1,
    W0 = matrix(c(0.01, 0, 0, 0.01), nrow=2, ncol=2),
    nu0 = 1.1
)
model = "
data {
    int<lower = 0> N;
    vector[2] x[N];

    vector[2] mu0;
    real lambda0;
    cov_matrix[2] W0;
    real nu0;
}
transformed data {
    cov_matrix[2] W0inv;
    W0inv = inverse(W0);
}
parameters {
    vector[2] mu;
    cov_matrix[2] Sigma;
}
model {
    target += inv_wishart_lpdf(Sigma | nu0, W0inv);
    target += multi_normal_lpdf(mu | mu0, Sigma / lambda0);

    for (n in 1:N) {
        target += multi_normal_lpdf(x[n] | mu, Sigma);
    }
}
generated quantities {
    vector[2] sampled_x;
    sampled_x = multi_normal_rng(mu, Sigma);
}
"

fit <- stan(
    model_code = model,
    data       = data,
    iter       = 20000,
    warmup     = 10000,
    thin       = 1,
    cores      = 4,
    seed       = 1
)
sink()
write.table(rstan:::extract(fit), file="stan.csv", row.names=F, sep=",")
fit
''')
print(R.communicate()[0].decode('ascii'))
R.kill()

> 
> options(width=200, echo=F, digits=4)
Inference for Stan model: 28d63f23d0d81288800b4248b89bfb14.
4 chains, each with iter=20000; warmup=10000; thin=1; 
post-warmup draws per chain=10000, total post-warmup draws=40000.

                mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
mu[1]           4.93    0.00 0.26    4.43    4.76    4.93    5.10    5.44 35071    1
mu[2]           3.03    0.00 0.15    2.74    2.93    3.03    3.13    3.32 35489    1
Sigma[1,1]      6.57    0.01 0.95    4.97    5.90    6.48    7.14    8.67 35641    1
Sigma[1,2]      2.30    0.00 0.45    1.53    1.99    2.26    2.58    3.30 32605    1
Sigma[2,1]      2.30    0.00 0.45    1.53    1.99    2.26    2.58    3.30 32605    1
Sigma[2,2]      2.20    0.00 0.32    1.66    1.97    2.16    2.39    2.90 35722    1
sampled_x[1]    4.93    0.01 2.58   -0.12    3.20    4.93    6.66   10.01 39676    1
sampled_x[2]    3.02    0.01 1.49    0.11    2.04    3.02    4.03    5.94 39455    1
lp__         -397.57    0.01 1.60 -401.50 -398.41 -397.24 -396.38 -395.43 17924    1

Samples were drawn using NUTS(diag_e) at Fri Oct 13 01:54:59 2318.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

サンプリング結果のサマリを表示してみると. Rhatがどれも1でn_effも大きいので十分収束したんだと思う. サンプリング結果はCSV経由でPython側で取得する.

stan = pd.read_csv('stan.csv')
stan.columns = [f"{c}.stan" for c in stan.columns]
stan.columns

    Index(['mu.1.stan', 'mu.2.stan', 'Sigma.1.1.stan', 'Sigma.2.1.stan', 'Sigma.1.2.stan', 'Sigma.2.2.stan', 'sampled_x.1.stan', 'sampled_x.2.stan', 'lp__.stan'], dtype='object')

$\boldsymbol \mu$ を比較

2次元なので比較が難しい. まずはヒートマップを書いてみる. 多変量のStudent's t分布のPDFは自作する. 多分合ってると思う.

def student_t_pdf(x, m, Lambda, df):
    D = len(m)
    m = m.reshape(D, 1)
    x = x.T

    return np.diag(
        sp.special.gamma(D/2. + df/2.) /
        sp.special.gamma(df/2.) *
        np.power(np.linalg.det(Lambda), 1/2.) / 
        np.power(np.pi * df, D/2.) *
        np.power(1 + (x-m).T@Lambda@(x-m)/df, -(D+df)/2.)
    )

fig, axes = plt.subplots(ncols=3, figsize=(16, 4))
delta = 0.05
gx = np.arange(3.5, 6.5, delta)
gy = np.arange(1.5, 4.5, delta)
gxx, gyy = np.meshgrid(gx, gy)
gxgy = np.c_[gxx.ravel(), gyy.ravel()]

zz = student_t_pdf(
    gxgy,
    muN,
    lambdaN * (nuN + 1 - 2) * WN,
    (nuN + 1 - 2)
)

im = axes[0].imshow(
    zz.reshape(len(gx), len(gy)), 
    interpolation='none', 
    origin='lower',
    extent=[3.5, 6.5, 1.5, 4.5],
    cmap=matplotlib.cm.rainbow
)
axes[0].set_title('derivated $\mu$ posterior dist.')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
fig.colorbar(im, ax=axes[0])

x = stan['mu.1.stan']
y = stan['mu.2.stan']
H = np.histogram2d(x, y, bins=[gx, gy], normed=True)
im = axes[1].imshow(
    H[0].T,
    interpolation='none', 
    origin='lower',
    extent=[3.5, 6.5, 1.5, 4.5],
    cmap=matplotlib.cm.rainbow
)
axes[1].set_title('Stan sampled $\mu$ posterior dist.')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
fig.colorbar(im, ax=axes[1])

stan.head(1000).plot(kind='scatter', x='mu.1.stan', y='mu.2.stan', alpha=0.1, ax=axes[2]).set_aspect('equal')
axes[2].set_xlim(3.5, 6.5)
axes[2].set_ylim(1.5, 4.5)
axes[2].set_title('Stan sampled posterior $\mu$')
axes[2].set_xlabel('x')
axes[2].set_ylabel('y')
plt.savefig('2.png', bbox_inches='tight', pad_inches=0)

f:id:kazufusa1484:20180727121047p:plain

左が $\boldsymbol \mu$ の事後分布, 中央がStanがサンプリングした $\boldsymbol \mu$ の事後分布, 右はStanのサンプリング結果をそのままプロットしたもの. ヒートマップはだいたい同じ形だけれども, 赤い領域の大きさちょっと違う. Stanの方が若干ばらつきが大きいように見える.

単にStanのサンプルサイズの小ささゆえに2Dヒストグラムがガタついているのが目視での差異の原因かもしれない. だいたい同じなのでこれはこれでよしとしておく.

続いて各次元の周辺事後分布を比較してみると, こちらはぴったり一致している.

delta = 0.05
gridx = np.arange(1, 8, delta)
D = 2
data = pd.DataFrame(gridx, columns=['x'])

data['mu1'] = pd.DataFrame(sp.stats.t.pdf(
    gridx,
    df=(nuN + 1 - D),
    loc=muN.reshape(2)[0],
    scale=np.sqrt(np.linalg.inv(lambdaN * (nuN + 1 - D) * WN)[0,0])
))
data['mu2'] = pd.DataFrame(sp.stats.t.pdf(
    gridx,
    df=(nuN + 1 - D),
    loc=muN.reshape(2)[1],
    scale=np.sqrt(np.linalg.inv(lambdaN * (nuN + 1 - D) * WN)[1,1])
))

fig, axes = plt.subplots(ncols=2, figsize=(12,4))
data.plot(x='x', y='mu1', ax=axes[0])
stan[['mu.1.stan']].plot.kde(ax=axes[0])
data.plot(x='x', y='mu2', ax=axes[1])
stan[['mu.2.stan']].plot.kde(ax=axes[1])
axes[0].set_xlim(2.5, 7.5)
axes[1].set_xlim(1.5, 4.5)
plt.savefig('3.png', bbox_inches='tight', pad_inches=0)

f:id:kazufusa1484:20180727121105p:plain

まあいいんじゃないですかね.

$\boldsymbol \Lambda$ を比較

これこそどうやってStanのサンプリング結果と比較したらよいのか分からない. だれかおしえてくれ.

簡単に, $\boldsymbol \Lambda$ の事後分布から $\boldsymbol \Lambda^{-1}$ (分散共分散行列)をサンプリングして各要素の周辺分布をStanのサンプリング結果のSigmaと比較しよう.

sampled_Sigma = pd.DataFrame(
    sp.stats.invwishart.rvs(df=nuN, scale=np.linalg.inv(WN), size=10000).reshape(10000, 4),
    columns=['Sigma.1.1', 'Sigma.1.2', 'Sigma.2.1', 'Sigma.2.2']
)
fig, axes = plt.subplots(ncols=3, figsize=(14,4))
sampled_Sigma[['Sigma.1.1']].plot.kde(ax=axes[0])
sampled_Sigma[['Sigma.1.2']].plot.kde(ax=axes[1])
sampled_Sigma[['Sigma.2.2']].plot.kde(ax=axes[2])
stan[['Sigma.1.1.stan']].plot.kde(ax=axes[0])
stan[['Sigma.1.2.stan']].plot.kde(ax=axes[1])
stan[['Sigma.2.2.stan']].plot.kde(ax=axes[2])
plt.savefig('4.png', bbox_inches='tight', pad_inches=0)

f:id:kazufusa1484:20180727121123p:plain

一緒ですね. これ以上は深くつっこまない.

解析的に解けるのにStanのサンプリング結果と比較するための事後分布をサンプリングするというのはいかにも筋がわるい.

条件付き事後予測分布の比較

あとは事後予測分布を比較したい. Stanではsampled_xとしてサンプリングしているわけだが.

$\boldsymbol \mu$ のときはヒートマップと各次元の周辺事後分布を比較したが, 目的ベースでいくと, $\hat {\textbf x}$ の第2成分がある $y$ のときの第1成分の事後予測分布の条件付き確率分布を算出してStanと比較したい. これができると様々な条件に対する予測が簡単にできることになる.

そのためには事後予測分布である多変量Student's t分布の条件付き確率を導出しなければならない.

Student's t分布の確率密度関数 $\rm St$ は以下である.

$\begin{eqnarray} {\rm St}\left(\textbf x \mid \boldsymbol \mu, \boldsymbol \Lambda, \nu\right) &=& \frac {\Gamma \left({\frac {\nu+D}{2}}\right)} {\Gamma \left({\frac {\nu}{2}}\right)} \frac {\left| \boldsymbol \Lambda \right|^{\frac {1} {2}}} {\left(\pi \nu \right)^{\frac{D}{2}}} \left( 1 + \frac {1} {\nu} \left(\textbf x -\boldsymbol \mu\right)^{\rm T} \boldsymbol \Lambda \left(\textbf x -\boldsymbol \mu\right) \right) ^ {-\frac {\nu + D}{2}} \\ \end{eqnarray}$

ここで

${\mathbf {x}}= {\begin{bmatrix} {\mathbf {x}}_{1}\\ {\mathbf {x}}_{2} \end{bmatrix}} {\text{ with sizes }} {\begin{bmatrix} q\times 1\\ (D-q)\times 1 \end{bmatrix}}$ $\boldsymbol\mu = \begin{bmatrix} \boldsymbol\mu_1 \\ \boldsymbol\mu_2 \end{bmatrix} \text{ with sizes }\begin{bmatrix} q \times 1 \\ (D-q) \times 1 \end{bmatrix}$ $\boldsymbol\Lambda^{-1} = \boldsymbol\Sigma = \begin{bmatrix} \boldsymbol\Sigma_{11} & \boldsymbol\Sigma_{12} \\ \boldsymbol\Sigma_{21} & \boldsymbol\Sigma_{22} \end{bmatrix} \text{ with sizes }\begin{bmatrix} q \times q & q \times (D-q) \\ (D-q) \times q & (D-q) \times (D-q) \end{bmatrix}$

であるときに, $p(\textbf x_1 \mid \textbf x_2, \boldsymbol \mu, \boldsymbol \Lambda, \nu)$ が欲しいわけである.

多変量正規分布の条件付き確率によると,

$\mathbf x \sim \mathcal N \left(\boldsymbol \mu, \boldsymbol \Sigma \right)$

であるとき,

$\mathbf x_1 \sim \mathcal N \left( \boldsymbol\mu_1 + \boldsymbol\Sigma_{12} \boldsymbol\Sigma_{22}^{-1} \left(\mathbf x_2 - \boldsymbol\mu_2 \right) , {\boldsymbol {\Sigma }}_{11}-{\boldsymbol {\Sigma }}_{12}{\boldsymbol {\Sigma }}_{22}^{-1}{\boldsymbol {\Sigma }}_{21} \right)$

である.

また, Student'sのt分布は以下のように変形できる(PRML 式2.161).

$\begin{eqnarray} {\rm St}\left(\textbf x \mid \boldsymbol \mu, \boldsymbol \Lambda, \nu\right) &=& \int \mathcal N\left( \textbf x \mid \boldsymbol \mu , \left(\eta \boldsymbol \Lambda \right)^{-1} \right) {\rm Gamma} \left( \eta \,\middle| \frac {\nu} {2}, \frac {\nu} {2} \right) {\rm d} \eta \\ \end{eqnarray}$

上記より, 以下が得られる.

$\begin{eqnarray} p\left(\textbf x_1 \mid \mathbf x_2, \boldsymbol \mu, \boldsymbol \Lambda, \nu\right) &=& {\rm St} \left( \textbf x_1 \,\middle|\, \boldsymbol\mu_1 + \boldsymbol\Sigma_{12} \boldsymbol\Sigma_{22}^{-1} \left(\mathbf x_2 - \boldsymbol\mu_2\right) , \left( {\boldsymbol {\Sigma }}_{11} - {\boldsymbol {\Sigma }}_{12}{\boldsymbol {\Sigma }}_{22}^{-1}{\boldsymbol {\Sigma }}_{21} \right)^{-1} , \nu \right) \end{eqnarray}$

(多変量正規分布の条件付き確率やPRML 式2.161をリファレンスからそのまま引いているので)簡単じゃないか.

すでに導出したように, 事後予測分布 $p(\hat {\mathbf x} \mid \mathbf x)$ に当てはめると, 上の式のパラメータは, 精度 $\boldsymbol \Lambda = \frac {\lambda_N\left(\nu_N + 1 - D\right)} {1+\lambda_N}\textbf W_N$ , 平均 $\boldsymbol \mu = \boldsymbol \mu_N$ , 自由度 $\nu = \nu_N + 1 - D$ である.

Student's t分布の条件付き確率がわかったので, x1|x2=3の場合の条件付き確率を可視化し, stanのサンプリング結果と比較する. Stanの方は, 2.9 <=x2<=3.1の範囲にあるサンプリング結果だけを抽出してヒストグラムと密度分布を可視化してみる.

# x2 = yのときのxの条件付き確率を算出する
def conditional_student_t_pdf(x, y, m, Lambda, df):
    D = len(m)
    p = len(x)
    Sigma = np.linalg.inv(Lambda)
    Sigma11 = Sigma[0:p, 0:p]
    Sigma12 = Sigma[0:p, p:]
    Sigma21 = Sigma[p:, 0:p]
    Sigma22 = Sigma[p:, p:]
    mu1 = m[:p, 0]
    mu2 = m[p:, 0]
    cmu = mu1 + Sigma12@(Sigma22.T)*(y-mu2)
    cSigma = Sigma11 - Sigma12@np.linalg.inv(Sigma22)@Sigma21
    cLambda = np.linalg.inv(cSigma)
    
    return student_t_pdf(x.T, cmu, cLambda, df)

delta = 0.05
gridx = np.arange(-5, 15, delta)
D = 2
data = pd.DataFrame(gridx, columns=['x'])

# x2=3の場合の各xグリッド点の条件付き確率
data['derivated'] = pd.DataFrame(conditional_student_t_pdf(
    np.array([gridx]),
    np.array([[3]]),
    muN,
    WN * lambdaN * (nuN + 1 - 2) / (1 + lambdaN),
    (nuN + 1 - 2)
))

ax = data.plot(x='x', y='derivated')
# sampling結果をx2の値の範囲で絞り込み
cstan = stan[
    (2.9 <= stan['sampled_x.2.stan']) & (stan['sampled_x.2.stan'] <= 3.1)
]
# 十分な個数があることを確認, 1次元だと2000は必要と緑本に書いてあった
print(cstan.count())
cstan[['sampled_x.1.stan']].plot.hist(normed=True, bins=50, ax=ax)
cstan[['sampled_x.1.stan']].plot.density(ax = ax)
ax.set_xlim(-5, 15)
ax.set_title('conditional posterior predictive dist.(x1|x2=3)')
plt.savefig('5.png', bbox_inches='tight', pad_inches=0)

    mu.1.stan           2188
    mu.2.stan           2188
    Sigma.1.1.stan      2188
    Sigma.2.1.stan      2188
    Sigma.1.2.stan      2188
    Sigma.2.2.stan      2188
    sampled_x.1.stan    2188
    sampled_x.2.stan    2188
    lp__.stan           2188
    dtype: int64

f:id:kazufusa1484:20180727121148p:plain

よく一致しますね. よかったですね. やりましたね.

事後予測分布の比較

最後に事後予測分布のヒートマップを描いて終わる.

fig, axes = plt.subplots(ncols=3, figsize=(16, 4))
delta = 0.5
xmin = -6
xmax = 16
ymin = -6
ymax = 16
gx = np.arange(xmin, xmax, delta)
gy = np.arange(ymin, ymax, delta)
gxx, gyy = np.meshgrid(gx, gy)
gxgy = np.c_[gxx.ravel(), gyy.ravel()]

zz = student_t_pdf(
    gxgy,
    muN,
    WN * lambdaN * (nuN + 1 - 2) / (1 + lambdaN),
    (nuN + 1 - 2)
)

im = axes[0].imshow(
    zz.reshape(len(gx), len(gy)), 
    interpolation='none', 
    origin='lower',
    extent=[xmin, xmax, ymin, ymax],
    cmap=matplotlib.cm.rainbow
)
axes[0].set_title('derivated posterior predictive dist.')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_yticks(np.arange(-5, 20, 5))
fig.colorbar(im, ax=axes[0])

x = stan['sampled_x.1.stan']
y = stan['sampled_x.2.stan']
H = np.histogram2d(x, y, bins=[gx, gy], normed=True)
im = axes[1].imshow(
    H[0].T,
    interpolation='none', 
    origin='lower',
    extent=[xmin, xmax, ymin, ymax],
    cmap=matplotlib.cm.rainbow
)
axes[1].set_title('Stan sampled posterior predictive dist.')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].set_yticks(np.arange(-5, 20, 5))
fig.colorbar(im, ax=axes[1])

stan.head(10000).plot(kind='scatter', x='sampled_x.1.stan', y='sampled_x.2.stan', alpha=0.1, ax=axes[2]).set_aspect('equal')
axes[2].set_xlim(-6, 16)
axes[2].set_ylim(-6, 16)
axes[2].set_title('Stan sampled posterior x')
axes[2].set_xlabel('x')
axes[2].set_ylabel('y')
axes[2].set_yticks(np.arange(-5, 20, 5))
plt.savefig('6.png', bbox_inches='tight', pad_inches=0)

f:id:kazufusa1484:20180727121201p:plain

おしまい.

導出