Tôi nghĩ rằng đây thực sự là một câu hỏi thú vị. Đã ngủ trên đó, tôi nghĩ rằng tôi có một câu trả lời. Vấn đề chính là như sau:
- Bạn đã coi khả năng là một pdf gaussian. Nhưng nó không phải là một phân phối xác suất - đó là một khả năng! Hơn nữa, bạn đã không dán nhãn trục của bạn rõ ràng. Những điều này kết hợp đã nhầm lẫn tất cả mọi thứ sau.
μσP(μ|μ′,σ′)μ′σ′P(X|μ,σ)XP(μ|X,σ,μ′,σ′)μ .
μP(X|μ) có cùng chiều rộng và chiều cao như trước ? Khi bạn phá vỡ nó thực sự là một tình huống kỳ lạ. Hãy suy nghĩ về hình thức trước và khả năng:
P(μ|μ′,σ′)=exp(−(μ−μ′)22σ′2)12πσ′2−−−−−√
P(X|μ,σ)=∏i=1Nexp(−(xi−μ)22σ2)12πσ2−−−−√
The only way I can see that these can have the same width is if σ′2=σ2/N. In other words, your prior is very informative, as its variance is going to be much lower than σ2 for any reasonable value of N. It is literally as informative as the entire observed dataset X!
So, the prior and the likelihood are equally informative. Why isn't the posterior bimodal? This is because of your modelling assumptions. You've implicitly assumed a normal distribution in the way this is set up (normal prior, normal likelihood), and that constrains the posterior to give a unimodal answer. That's just a property of normal distributions, that you have baked into the problem by using them. A different model would not necessarily have done this. I have a feeling (though lack a proof right now) that a cauchy distribution can a have multimodal likelihood, and hence a multimodal posterior.
So, we have to be unimodal, and the prior is as informative as the likelihood. Under these constraints, the most sensible estimate is starting to sound like a point directly between the likelihood and prior, as we have no reasonable way to tell which to believe. But why does the posterior get tighter?
I think the confusion here comes from the fact that in this model, σ is assumed to be known. Were it unknown, and we had a two dimensional distribution over μ and σ the observation of data far from the prior might make a high value of σ more probable, and so increase the variance of the posterior distribution of the mean too (as these two are linked). But we're not in that situation. σ is treated as known here. A such adding more data can only make us more confident in our prediction of the position of μ, and hence the posterior becomes narrower.
(A way to visualise it might be to imagine estimating the mean of a gaussian, with known variance, using just two sample points. If the two sample points are separated by very much more than the width of the gaussian (i.e. they're out in the tails), then that's strong evidence the mean actually lies between them. Shifting the mean just slightly from this position will cause an exponential drop off in the probability of one sample or another.)
In summary, the situation you have described is a bit odd, and by using the model you have you've included some assumptions (e.g. unimodality) into the problem that you didn't realise you had. But otherwise, the conclusion is correct.