Động lực của thuật toán tối đa hóa kỳ vọng

20

Trong cách tiếp cận thuật toán EM chúng tôi sử dụng Bất đẳng thức Jensen để đi đến

\log p (x | θ) \geq \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z - \int \log p (z | x, θ) p (z | x, θ^{(k)}) d z

$\log p(x|\theta) \geq \int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz - \int \log p(z|x,\theta) p(z|x,\theta^{(k)})dz$

và xác định $\theta^{(k+1)}$ bởi

θ^{(k + 1)} = \arg max_{θ} \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z

$\theta^{(k+1)}=\arg \max_{\theta}\int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz$

Tất cả mọi thứ tôi đọc EM chỉ cần đặt nó xuống nhưng tôi luôn cảm thấy khó chịu khi không có lời giải thích tại sao thuật toán EM phát sinh một cách tự nhiên. Tôi hiểu rằng $\log$ khả năng thường được xử lý để đối phó với Ngoài ra thay vì nhân nhưng sự xuất hiện của $\log$ trong định nghĩa của $\theta^{(k+1)}$ cảm thấy không có động lực cho tôi. Tại sao người ta nên xem xét $\log$ và không phải các chức năng đơn điệu khác? Vì nhiều lý do, tôi nghi ngờ rằng "ý nghĩa" hay "động lực" đằng sau tối đa hóa kỳ vọng có một số loại giải thích về mặt lý thuyết thông tin và thống kê đầy đủ. Nếu có một lời giải thích như vậy sẽ bão hòa hơn nhiều so với chỉ là một thuật toán trừu tượng.

mixture expectation-maximization

— người dùng782220
nguồn

3

Thuật toán tối đa hóa kỳ vọng là gì? , Công nghệ sinh học tự nhiên 26 : 897 cường899 (2008) có một hình ảnh đẹp minh họa cách thức hoạt động của thuật toán.

— chl

@chl: Mình đã xem bài báo đó. Điểm tôi đang hỏi là thông báo rằng không nơi nào giải thích được tại sao cách tiếp cận không đăng nhập không thể hoạt động

— user782220

10

Thuật toán EM có các cách hiểu khác nhau và có thể phát sinh ở các dạng khác nhau trong các ứng dụng khác nhau.

Tất cả bắt đầu với hàm khả năng , hoặc tương đương, hàm khả năng mà chúng tôi muốn tối đa hóa. (Chúng tôi thường sử dụng logarit vì nó đơn giản hóa phép tính: Nó hoàn toàn đơn điệu, lõm và .) Trong một thế giới lý tưởng, giá trị của chỉ phụ thuộc vào tham số mô hình , vì vậy chúng ta có thể tìm kiếm thông qua không gian của và tìm thấy một nhằm tối đa hóa $p(x \vert \theta)$ $\log p(x \vert \theta)$ $\log(ab) = \log a + \log b$ $p$ $\theta$ $\theta$ . $p$

Tuy nhiên, trong nhiều ứng dụng thực tế thú vị, mọi thứ phức tạp hơn, bởi vì không phải tất cả các biến đều được quan sát. Có, chúng ta có thể quan sát trực tiếp , nhưng một số biến khác không được quan sát. Bởi vì sự biến mất , chúng tôi đang ở trong một loại tình huống con gà và trứng: Nếu không có chúng ta không thể ước lượng tham số và không có chúng ta không thể suy ra những gì giá trị của có thể được. $x$ $z$ $z$ $z$ $\theta$ $\theta$ $z$

Đó là nơi thuật toán EM phát huy tác dụng. Chúng tôi bắt đầu với một dự đoán ban đầu của mô hình thông số và lấy được các giá trị kỳ vọng của các biến thiếu (tức là, bước E). Khi chúng ta có các giá trị của , chúng ta có thể tối đa hóa khả năng ghi các tham số (nghĩa là bước M, tương ứng với phương trình trong báo cáo vấn đề). Với điều này chúng ta có thể lấy được giá trị dự kiến mới của (một bước E), vân vân và vân vân. Nói cách khác, trong mỗi bước chúng ta giả định một trong những cả hai, và $\theta$ $z$ $z$ $\theta$ $\arg \max$ $\theta$ $z$ $z$ $\theta$ , đã được biết đến. Chúng tôi lặp lại quá trình lặp đi lặp lại này cho đến khi khả năng không thể tăng thêm nữa.

Đây là thuật toán EM một cách ngắn gọn. Người ta biết rằng khả năng sẽ không bao giờ giảm trong quá trình EM lặp đi lặp lại này. Nhưng hãy nhớ rằng thuật toán EM không đảm bảo tối ưu toàn cầu. Đó là, nó có thể kết thúc với một chức năng tối ưu cục bộ.

Sự xuất hiện của trong phương trình của là không thể tránh khỏi, bởi vì ở đây các chức năng bạn muốn tối đa hóa được viết như một loga. $\log$ $\theta^{(k+1)}$

— Vĩ Vĩ
nguồn

Tôi không thấy cách này trả lời câu hỏi.

— broncoAbierto

9

Khả năng so với khả năng đăng nhập

Như đã nói, được giới thiệu trong khả năng tối đa đơn giản vì nó thường dễ dàng tối ưu hóa các khoản tiền hơn so với các sản phẩm. Lý do chúng tôi không xem xét các chức năng đơn điệu khác là logarit là chức năng duy nhất có đặc tính biến sản phẩm thành tổng. $\log$

Một cách khác để thúc đẩy logarit là như sau: Thay vì tối đa hóa xác suất của dữ liệu theo mô hình của chúng tôi, chúng tôi có thể cố gắng giảm thiểu phân kỳ Kullback-Leibler giữa phân phối dữ liệu, và phân phối mô hình, , $p_\text{data}(x)$ $p(x \mid \theta)$

D_{KL} [p_{data} (x) ∣∣ p (x ∣ θ)] = \int p_{data} (x) \log \frac{p_{data} (x)}{p (x ∣ θ)} d x = c o n s t - \int p_{data} (x) \log p (x ∣ θ) d x .

$D_\text{KL}[p_\text{data}(x) \mid\mid p(x \mid \theta)] = \int p_\text{data}(x) \log \frac{p_\text{data}(x)}{p(x \mid \theta)} \, dx = const - \int p_\text{data}(x)\log p(x \mid \theta) \, dx.$

Thuật ngữ đầu tiên ở phía bên tay phải là hằng số trong các tham số. Nếu chúng tôi có mẫu từ phân phối dữ liệu (điểm dữ liệu của chúng tôi), chúng tôi có thể tính gần đúng thuật ngữ thứ hai với khả năng ghi nhật ký trung bình của dữ liệu, $N$

\int p_{data} (x) \log p (x ∣ θ) d x \approx \frac{1}{N} \sum_{n} \log p (x_{n} ∣ θ) .

$\int p_\text{data}(x)\log p(x \mid \theta) \, dx \approx \frac{1}{N} \sum_n \log p(x_n \mid \theta).$

An alternative view of EM

I am not sure this is going to be the kind of explanation you are looking for, but I found the following view of expectation maximization much more enlightening than its motivation via Jensen's inequality (you can find a detailed description in Neal & Hinton (1998) or in Chris Bishop's PRML book, Chapter 9.3).

It is not difficult to show that

\log p (x ∣ θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z + D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)]

$\log p(x \mid \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz + D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)]$

for any $q(z \mid x)$ . If we call the first term on the right-hand side $F(q, \theta)$ , this implies that

F (q, θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z = \log p (x ∣ θ) - D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)] .

$F(q, \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz = \log p(x \mid \theta) - D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)].$

Because the KL divergence is always positive, $F(q, \theta)$ is a lower bound on the log-likelihood for every fixed $q$ . Now, EM can be viewed as alternately maximizing $F$ with respect to $q$ and $\theta$ . In particular, by setting $q(z \mid x) = p(z \mid x, \theta)$ in the E-step, we minimize the KL divergence on the right-hand side and thus maximize $F$ .

— Lucas
nguồn

Thanks for the post! Though the given document doesn't say logarithm is the unique function turning products into sums. It says logarithm is the only function that fulfills all three listed properties at the same time.

— Weiwei

@Weiwei: Right, but the first condition mainly requires that the function is invertible. Of course, f(x) = 0 also implies f(x + y) = f(x)f(y), but this is an uninteresting case. The third condition asks that the derivative at 1 is 1, which is only true for the logarithm to base

e

$e$ . Drop this constraint and you get logarithms to different bases, but still logarithms.

— Lucas

4

The paper that I found clarifying with respect to expectation-maximization is Bayesian K-Means as a "Maximization-Expectation" Algorithm (pdf) by Welling and Kurihara.

Suppose we have a probabilistic model $p(x,z,\theta)$ with $x$ observations, $z$ hidden random variables, and a total of $\theta$ parameters. We are given a dataset $D$ and are forced (by higher powers) to establish $p(z,\theta|D)$ .

1. Gibbs sampling

We can approximate $p(z,\theta|D)$ by sampling. Gibbs sampling gives $p(z,\theta|D)$ by alternating:

θ \sim p (θ | z, D) z \sim p (z | θ, D)

$\theta \sim p(\theta|z,D) \\ z \sim p(z|\theta,D)$

2. Variational Bayes

Instead, we can try to establish a distribution $q(\theta)$ and $q(z)$ and minimize the difference with the distribution we are after $p(\theta,z|D)$ . The difference between distributions has a convenient fancy name, the KL-divergence. To minimize $KL[q(\theta)q(z)||p(\theta,z|D)]$ we update:

q (θ) \propto \exp (E [\log p (θ, z, D)]_{q (z)}) q (z) \propto \exp (E [\log p (θ, z, D)]_{q (θ)})

$q(\theta) \propto \exp (E [\log p(\theta,z,D) ]_{q(z)} ) \\ q(z) \propto \exp (E [\log p(\theta,z,D) ]_{q(\theta)} )$

3. Expectation-Maximization

To come up with full-fledged probability distributions for both $z$ and $\theta$ might be considered extreme. Why don't we instead consider a point estimate for one of these and keep the other nice and nuanced. In EM the parameter $\theta$ is established as the one being unworthy of a full distribution, and set to its MAP (Maximum A Posteriori) value, $\theta^*$ .

θ^{*} = \underset{θ}{argmax} E [\log p (θ, z, D)]_{q (z)} q (z) = p (z | θ^{*}, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(z)} \\ q(z) = p(z|\theta^*,D)$

Here $\theta^* \in \operatorname{argmax}$ would actually be a better notation: the argmax operator can return multiple values. But let's not nitpick. Compared to variational Bayes you see that correcting for the $\log$ by $\exp$ doesn't change the result, so that is not necessary anymore.

4. Maximization-Expectation

There is no reason to treat $z$ as a spoiled child. We can just as well use point estimates $z^*$ for our hidden variables and give the parameters $\theta$ the luxury of a full distribution.

z^{*} = \underset{z}{argmax} E [\log p (θ, z, D)]_{q (θ)} q (θ) = p (θ | z^{*}, D)

$z^* = \underset{z}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(\theta)} \\ q(\theta) = p(\theta|z^*,D)$

If our hidden variables $z$ are indicator variables, we suddenly have a computationally cheap method to perform inference on the number of clusters. This is in other words: model selection (or automatic relevance detection or imagine another fancy name).

5. Iterated conditional modes

Of course, the poster child of approximate inference is to use point estimates for both the parameters $\theta$ as well as the observations $z$ .

θ^{*} = \underset{θ}{argmax} p (θ, z^{*}, D) z^{*} = \underset{z}{argmax} p (θ^{*}, z, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} p(\theta,z^*,D) \\ z^* = \underset{z}{\operatorname{argmax}} p(\theta^*,z,D) \\$

To see how Maximization-Expectation plays out I highly recommend the article. In my opinion, the strength of this article is however not the application to a $k$ -means alternative, but this lucid and concise exposition of approximation.

— Anne van Rossum
nguồn

(+1) this is a beautiful summary of all methods.

— kedarps

4

There is a useful optimisation technique underlying the EM algorithm. However, it's usually expressed in the language of probability theory so it's hard to see that at the core is a method that has nothing to do with probability and expectation.

Consider the problem of maximising

g (x) = \sum_{i} \exp (f_{i} (x))

$g(x)=\sum_i\exp(f_i(x))$ (or equivalently

\log g (x)

$\log g(x)$ ) with respect to

x

$x$ . If you write down an expression for

g^{'} (x)

$g'(x)$ and set it equal to zero you will often end up with a transcendental equation to solve. These can be nasty.

Now suppose that the $f_i$ play well together in the sense that linear combinations of them give you something easy to optimise. For example, if all of the $f_i(x)$ are quadratic in $x$ then a linear combination of the $f_i(x)$ will also be quadratic, and hence easy to optimise.

Given this supposition, it'd be cool if, in order to optimise $\log g(x)=\log \sum_i\exp(f_i(x))$ we could somehow shuffle the $\log$ past the $\sum$ so it could meet the $\exp$ s and eliminate them. Then the $f_i$ could play together. But we can't do that.

Let's do the next best thing. We'll make another function $h$ that is similar to $g$ . And we'll make it out of linear combinations of the $f_i$ .

Let's say $x_0$ is a guess for an optimal value. We'd like to improve this. Let's find another function $h$ that matches $g$ and its derivative at $x_0$ , i.e. $g(x_0)=h(x_0)$ and $g'(x_0)=h'(x_0)$ . If you plot a graph of $h$ in a small neighbourhood of $x_0$ it's going to look similar to $g$ .

You can show that

g^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x)) .

$g'(x)=\sum_i f_i'(x)\exp(f_i(x)).$ We want something that matches this at

x_{0}

$x_0$ . There's a natural choice:

h (x) = constant + \sum_{i} f_{i} (x) \exp (f_{i} (x_{0})) .

$h(x)=\mbox{constant}+\sum_i f_i(x)\exp(f_i(x_0)).$ You can see they match at

x = x_{0}

$x=x_0$ . We get

h^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x_{0})) .

$h'(x)=\sum_i f_i'(x)\exp(f_i(x_0)).$ As

x_{0}

$x_0$ is a constant we have a simple linear combination of the

f_{i}

$f_i$ whose derivative matches

g

$g$ . We just have to choose the constant in

h

$h$ to make

g (x_{0}) = h (x_{0})

$g(x_0)=h(x_0)$ .

So starting with $x_0$ , we form $h(x)$ and optimise that. Because it's similar to $g(x)$ in the neighbourhood of $x_0$ we hope the optimum of $h$ is similar to the optimum of g. Once you have a new estimate, construct the next $h$ and repeat.

I hope this has motivated the choice of $h$ . This is exactly the procedure that takes place in EM.

But there's one more important point. Using Jensen's inequality you can show that $h(x)\le g(x)$ . This means that when you optimise $h(x)$ you always get an $x$ that makes $g$ bigger compared to $g(x_0)$ . So even though $h$ was motivated by its local similarity to $g$ , it's safe to globally maximise $h$ at each iteration. The hope I mentioned above isn't required.

This also gives a clue to when to use EM: when linear combinations of the arguments to the $\exp$ function are easier to optimise. For example when they're quadratic - as happens when working with mixtures of Gaussians. This is particularly relevant to statistics where many of the standard distributions are from exponential families.

— Dan Piponi
nguồn

3

As you said, I will not go into technical details. There are quite a few very nice tutorials. One of my favourites are Andrew Ng's lecture notes. Take a look also at the references here.

EM is naturally motivated in mixture models and models with hidden factors in general. Take for example the case of Gaussian mixture models (GMM). Here we model the density of the observations as a weighted sum of $K$ gaussians:
$p (x) = \sum_{i = 1}^{K} π_{i} N (x | μ_{i}, Σ_{i})$ $p(x) = \sum_{i=1}^{K}\pi_{i} \mathcal{N}(x|\mu_{i}, \Sigma_{i})$ where $\pi_{i}$ is the probability that the sample $x$ was caused/generated by the ith component, $\mu_{i}$ is the mean of the distribution, and $\Sigma_{i}$ is the covariance matrix. The way to understand this expression is the following: each data sample has been generated/caused by one component, but we do not know which one. The approach is then to express the uncertainty in terms of probability ( $\pi_{i}$ represents the chances that the ith component can account for that sample), and take the weighted sum. As a concrete example, imagine you want to cluster text documents. The idea is to assume that each document belong to a topic (science, sports,...) which you do not know beforehand!. The possible topics are hidden variables. Then you are given a bunch of documents, and by counting n-grams or whatever features you extract, you want to then find those clusters and see to which cluster each document belongs to. EM is a procedure which attacks this problem step-wise: the expectation step attempts to improve the assignments of the samples it has achieved so far. The maximization step you improve the parameters of the mixture, in other words, the form of the clusters.
The point is not using monotonic functions but convex functions. And the reason is the Jensen's inequality which ensures that the estimates of the EM algorithm will improve at every step.

— jpmuc
nguồn