Làm thế nào để đánh giá sự giống nhau của hai biểu đồ?

33

Cho hai biểu đồ, làm thế nào để chúng ta đánh giá xem chúng có giống nhau hay không?

Có đủ chỉ đơn giản là nhìn vào hai biểu đồ? Ánh xạ một đến một đơn giản có một vấn đề là nếu một biểu đồ hơi khác một chút và thay đổi một chút thì chúng ta sẽ không nhận được kết quả mong muốn.

Bất kỳ đề xuất?

histogram image-processing

— Mew 3,4
nguồn

2

"Tương tự" có nghĩa là gì? Ví dụ, kiểm tra chi bình phương và kiểm tra KS, kiểm tra xem hai biểu đồ có gần giống nhau không. Nhưng "tương tự" có thể có nghĩa là "có cùng hình dạng", bỏ qua mọi khác biệt về vị trí và / hoặc tỷ lệ. Bạn có thể làm rõ ý định của bạn?

— whuber

8

Một bài báo gần đây có thể đáng đọc là:

Cao, Y. Petzold, L. Giới hạn chính xác và đo lường sai số trong mô phỏng ngẫu nhiên của các hệ thống phản ứng hóa học, 2006.

Mặc dù trọng tâm của bài viết này là so sánh các thuật toán mô phỏng ngẫu nhiên, về cơ bản, ý tưởng chính là làm thế nào để so sánh hai biểu đồ.

Bạn có thể truy cập pdf từ trang web của tác giả.

— csgillespie
nguồn

Xin chào, giấy đẹp của nó, thanx vì đã cung cấp liên kết pdf .. Tôi chắc chắn sẽ xem qua bài viết này ..

— Mew 3,4

12

Thay vì cung cấp một tài liệu tham khảo, sẽ tốt nếu bạn tóm tắt những điểm chính của bài viết. Liên kết chết, vì vậy trong tương lai câu trả lời của bạn có thể trở nên vô dụng đối với những người không đăng ký của tạp chí này (và phần lớn dân số của con người là những người không đăng ký).

— Tim

27

Có rất nhiều thước đo khoảng cách giữa hai biểu đồ. Bạn có thể đọc một phân loại tốt về các biện pháp này trong:

K. Meshgi và S. Ishii, Di biểu đồ mở rộng màu sắc với tính năng tạo lưới để cải thiện độ chính xác theo dõi, đá trong Proc. của MVA'15, Tokyo, Nhật Bản, tháng 5 năm 2015.

Các chức năng khoảng cách phổ biến nhất được liệt kê ở đây để thuận tiện cho bạn:

$L_0$ 　hoặc Hellinger Khoảng cách

$D_{L0} = \sum\limits_{i} h_1(i) \neq h_2(i)$

$L_1$ , Manhattan hoặc Khoảng cách khối thành phố

$D_{L1} = \sum_{i}\lvert h_1(i) - h_2(i) \rvert$

$L=2$ hoặc khoảng cách Euclide

$D_{L2} = \sqrt{\sum_{i}\left( h_1(i) - h_2(i) \right) ^2 }$

L hoặc Chybyshev Khoảng cách $_{\infty}$

$D_{L\infty} = max_{i}\lvert h_1(i) - h_2(i) \rvert$

L hoặc Fractional Khoảng cách (một phần của gia đình khoảng cách Minkowski) $_p$

$D_{Lp} = \left(\sum\limits_{i}\lvert h_1(i) - h_2(i) \rvert ^p \right)^{1/p}$ và $0<p<1$

Giao lộ biểu đồ

$D_{\cap} = 1 - \frac{\sum_{i} \left(min(h_1(i),h_2(i) \right)}{min\left(\vert h_1(i)\vert,\vert h_2(i) \vert \right)}$

Khoảng cách Cosine

$D_{CO} = 1 - \sum_i h_1(i)h2_(i)$

Khoảng cách Canberra

$D_{CB} = \sum_i \frac{\lvert h_1(i)-h_2(i) \rvert}{min\left( \lvert h_1(i)\rvert,\lvert h_2(i)\rvert \right)}$

Hệ số tương quan của Pearson

$D_{CR} = \frac{\sum_i \left(h_1(i)- \frac{1}{n} \right)\left(h_2(i)- \frac{1}{n} \right)}{\sqrt{\sum_i \left(h_1(i)- \frac{1}{n} \right)^2\sum_i \left(h_2(i)- \frac{1}{n} \right)^2}}$

Kolmogorov-Smirnov Divergance

$D_{KS} = max_{i}\lvert h_1(i) - h_2(i) \rvert$

Match Distance

$D_{MA} = \sum\limits_{i}\lvert h_1(i) - h_2(i) \rvert$

Cramer-von Mises Distance

$D_{CM} = \sum\limits_{i}\left( h_1(i) - h_2(i) \right)^2$

$\chi^2$ Statistics

$D_{\chi^2} = \sum_i \frac{\left(h_1(i) - h_2(i)\right)^2}{h_1(i) + h_2(i)}$

Bhattacharyya Distance

$D_{BH} = \sqrt{1-\sum_i \sqrt{h_1(i)h_2(i)}}$ & hellinger

Squared Chord

$D_{SC} = \sum_i\left(\sqrt{h_1(i)}-\sqrt{h_2(i)}\right)^2$

Kullback-Liebler Divergance

$D_{KL} = \sum_i h_1(i)log\frac{h_1(i)}{m(i)}$

Jefferey Divergence

$D_{JD} = \sum_i \left(h_1(i)log\frac{h_1(i)}{m(i)}+h_2(i)log\frac{h_2(i)}{m(i)}\right)$

Earth Mover's Distance (this is the first member of Transportation distances that embed binning information $A$ into the distance, for more information please refer to the abovementioned paper or Wikipedia entry.

$D_{EM} = \frac{min_{f_{ij}}\sum_{i,j}f_{ij}A_{ij}}{sum_{i,j}f_{ij}}$ $\sum_j f_{ij} \leq h_1(i) , \sum_j f_{ij} \leq h_2(j) , \sum_{i,j} f_{ij} = min\left( \sum_i h_1(i) \sum_j h_2(j) \right)$ and $f_{ij}$ represents the flow from $i$ to $j$

Quadratic Distance

$D_{QU} = \sqrt{\sum_{i,j} A_{ij}\left(h_1(i) - h_2(j)\right)^2}$

Quadratic-Chi Distance

$D_{QC} = \sqrt{\sum_{i,j} A_{ij}\left(\frac{h_1(i) - h_2(i)}{\left(\sum_c A_{ci}\left(h_1(c)+h_2(c)\right)\right)^m}\right)\left(\frac{h_1(j) - h_2(j)}{\left(\sum_c A_{cj}\left(h_1(c)+h_2(c)\right)\right)^m}\right)}$ and $\frac{0}{0} \equiv 0$

A Matlab implementation of some of these distances is available from my GitHub repository: https://github.com/meshgi/Histogram_of_Color_Advancements/tree/master/distance Also you can search guys like Yossi Rubner, Ofir Pele, Marco Cuturi and Haibin Ling for more state-of-the-art distances.

Update: Alternative explaination for the distances appears here and there in the literature, so I list them here for sake of completeness.

Canberra distance (another version)

$D_{CB}=\sum_i \frac{|h_1(i)-h_2(i)|}{|h_1(i)|+|h_2(i)|}$

Bray-Curtis Dissimilarity, Sorensen Distance (since the sum of histograms are equal to one, it equals to $D_{L0}$ )

$D_{BC} = 1 - \frac{2 \sum_i h_1(i) = h_2(i)}{\sum_i h_1(i) + \sum_i h_2(i)}$

Jaccard Distance (i.e. intersection over union, another version)

$D_{IOU} = 1 - \frac{\sum_i min(h_1(i),h_2(i))}{\sum_i max(h_1(i),h_2(i))}$

— Kourosh Meshgi
nguồn

Welcome to our site! Thank you for this contribution.

— whuber

Here is the paper link: mva-org.jp/Proceedings/2015USB/papers/14-15.pdf

— neves

Thanks, a list is wonderful, while it doesn't allow to create a comparison operator for histogram, e.g. to say that hist1 < hist2

— Olha Pavliuk

22

The standard answer to this question is the chi-squared test. The KS test is for unbinned data, not binned data. (If you have the unbinned data, then by all means use a KS-style test, but if you only have the histogram, the KS test is not appropriate.)

— David Wright
nguồn

You are correct that the KS test is not appropriate for histograms when it is understood as a hypothesis test about the distribution of the underlying data, but I see no reason why the KS statistic wouldn't work well as a measure of sameness of any two histograms.

— whuber

An explanation of why the Kolmogorov-Smirnov test is not appropriate with binned data would be useful.

— naught101

This may not be as useful in image processing as in statistical fit assessment. Often in image processing, a histogram of data is used as a descriptor for a region of an image, and the goal is for a distance between histograms to reflect the distance between image patches. Little, or possibly nothing at all, may be known about the general population statistics of the underlying image data used to get the histogram. For example, the underlying population statistics when using histograms of oriented gradients would differ considerably based on the actual content of the images.

— ely

1

naught101's question was answered by Stochtastic: stats.stackexchange.com/a/108523/37373

— Lapis

10

You're looking for the Kolmogorov-Smirnov test. Don't forget to divide the bar heights by the sum of all observations of each histogram.

Note that the KS-test is also reporting a difference if e.g. the means of the distributions are shifted relative to one another. If translation of the histogram along the x-axis is not meaningful in your application, you may want to subtract the mean from each histogram first.

— Jonas
nguồn

1

Subtracting the mean changes the null distribution of the KS statistic. @David Wright raises a valid objection to the application of the KS test to histograms anyway.

— whuber

7

As David's answer points out, the chi-squared test is necessary for binned data as the KS test assumes continuous distributions. Regarding why the KS test is inappropriate (naught101's comment), there has been some discussion of the issue in the applied statistics literature that is worth raising here.

An amusing exchange began with the claim (García-Berthou and Alcaraz, 2004) that one third of Nature papers contain statistical errors. However, a subsequent paper (Jeng, 2006, "Error in statistical tests of error in statistical tests" -- perhaps my all-time favorite paper title) showed that Garcia-Berthou and Alcaraz (2005) used KS tests on discrete data, leading to their reporting inaccurate p-values in their meta-study. The Jeng (2006) paper provides a nice discussion of the issue, even showing that one can modify the KS test to work for discrete data. In this specific case, the distinction boils down to the difference between a uniform distribution of the trailing digit on [0,9],

P (x) = \frac{1}{9}, (0 \leq x \leq 9)

$P(x) = \frac{1}{9},\ (0 \leq x \leq 9)$ (in the incorrect KS test) and a comb distribution of delta functions,

P (x) = \frac{1}{10} \sum_{j = 0}^{9} δ (x - j)

$P(x) = \frac{1}{10}\sum_{j=0}^9 \delta(x-j)$ (in the correct, modified form). As a result of the original error, Garcia-Berthou and Alcaraz (2004) incorrectly rejected the null, while the chi-squared and modified KS test do not. In any case, the chi-squared test is the standard choice in this scenario, even if KS can be modified to work here.

— Stochtastic
nguồn

-1

You can compute the cross-correlation (convolution) between both histograms. That will take into account slight traslations.

— Juan Manuel Tonello
nguồn

1

This is being automatically flagged as low quality, probably because it is so short. At present it is more of a comment than an answer by our standards. Can you expand on it? We can also turn it into a comment.

— gung - Reinstate Monica

Since histograms are fairly unstable representations of data, and also because they do not represent probabilities using height alone (they use area), one might reasonably question the applicability, generality, or usefulness of this approach unless more specific guidance is provided.

— whuber