The paper that I found clarifying with respect to expectation-maximization is Bayesian K-Means as a "Maximization-Expectation" Algorithm (pdf) by Welling and Kurihara.
Suppose we have a probabilistic model p(x,z,θ) with x observations, z hidden random variables, and a total of θ parameters. We are given a dataset D and are forced (by higher powers) to establish p(z,θ|D).
1. Gibbs sampling
We can approximate p(z,θ|D) by sampling. Gibbs sampling gives p(z,θ|D) by alternating:
θ∼p(θ|z,D)z∼p(z|θ,D)
2. Variational Bayes
Instead, we can try to establish a distribution q(θ) and q(z) and minimize the difference with the distribution we are after p(θ,z|D). The difference between distributions has a convenient fancy name, the KL-divergence. To minimize KL[q(θ)q(z)||p(θ,z|D)] we update:
q(θ)∝exp(E[logp(θ,z,D)]q(z))q(z)∝exp(E[logp(θ,z,D)]q(θ))
3. Expectation-Maximization
To come up with full-fledged probability distributions for both z and θ might be considered extreme. Why don't we instead consider a point estimate for one of these and keep the other nice and nuanced. In EM the parameter θ is established as the one being unworthy of a full distribution, and set to its MAP (Maximum A Posteriori) value, θ∗.
θ∗=argmaxθE[logp(θ,z,D)]q(z)q(z)=p(z|θ∗,D)
Here θ∗∈argmax would actually be a better notation: the argmax operator can return multiple values. But let's not nitpick. Compared to variational Bayes you see that correcting for the log by exp doesn't change the result, so that is not necessary anymore.
4. Maximization-Expectation
There is no reason to treat z as a spoiled child. We can just as well use point estimates z∗ for our hidden variables and give the parameters θ the luxury of a full distribution.
z∗=argmaxzE[logp(θ,z,D)]q(θ)q(θ)=p(θ|z∗,D)
If our hidden variables z are indicator variables, we suddenly have a computationally cheap method to perform inference on the number of clusters. This is in other words: model selection (or automatic relevance detection or imagine another fancy name).
5. Iterated conditional modes
Of course, the poster child of approximate inference is to use point estimates for both the parameters θ as well as the observations z.
θ∗=argmaxθp(θ,z∗,D)z∗=argmaxzp(θ∗,z,D)
To see how Maximization-Expectation plays out I highly recommend the article. In my opinion, the strength of this article is however not the application to a k-means alternative, but this lucid and concise exposition of approximation.