Là bất đẳng thức tam giác được thực hiện cho các khoảng cách dựa trên tương quan?

12

Để phân cụm theo phân cấp, tôi thường thấy hai "số liệu" sau đây (chúng không nói chính xác) để đo khoảng cách giữa hai biến ngẫu nhiên $X$ và $Y$ : $\newcommand{\Cor}{\mathrm{Cor}}$

\begin{aligned} d_{1} (X, Y) & = 1 - | C o r (X, Y) |, \\ d_{2} (X, Y) & = 1 - (C o r (X, Y))^{2} \end{aligned}

$\begin{align} d_1(X,Y) &= 1-|\Cor(X,Y)|, \\ d_2(X,Y) &= 1-(\Cor(X,Y))^2 \end{align}$ Có ai thực hiện bất đẳng thức tam giác không? Nếu vậy tôi phải chứng minh điều đó như thế nào ngoài việc tính toán bruteforce? Nếu chúng không phải là số liệu, ví dụ đơn giản là gì?

— Gia Hân
nguồn

Bạn có thể quan tâm đến việc xem xét bài viết này: arxiv.org/pdf/1208.3145.pdf .

— Chris

5

Các bất đẳng thức tam giác trên của bạn sẽ mang lại: $d_1$ $\newcommand{\Cov}{\mathrm{Cov}}$ $\newcommand{\Cor}{\mathrm{Cor}}$ $\newcommand{\Var}{\mathrm{Var}}$

\begin{aligned} d_{1} (X, Z) & \leq d_{1} (X, Y) + d_{1} (Y, Z) \\ 1 - | C o r (X, Z) | & \leq 1 - | C o r (X, Y) | + 1 - | C o r (Y, Z) | \\ ⟹ | C o r (X, Y) | + | C o r (Y, Z) | & \leq 1 + | C o r (X, Z) | \end{aligned}

$\begin{align*} d_1(X,Z) &\leq d_1(X,Y) + d_1(Y,Z) \\ 1 - |\Cor(X,Z)| &\leq 1 - |\Cor(X,Y)| + 1 - |\Cor(Y,Z)| \\ \implies |\Cor(X,Y)| + |\Cor(Y,Z)| &\leq 1 + |\Cor(X,Z)| \end{align*}$

Điều này có vẻ khá bất bình đẳng dễ dàng để đánh bại. Chúng ta có thể làm cho phía bên tay phải nhỏ nhất có thể (chính xác là một) bằng cách làm cho và độc lập. Sau đó, chúng ta có thể tìm thấy một mà phía bên trái vượt quá một? $X$ $Z$ $Y$

Nếu và và có sai giống hệt nhau, sau đó $Y=X+Z$ $X$ $Z$ và tương tự đối với, do đó, phía bên trái cao hơn một và bất đẳng thức bị vi phạm. Ví dụ về vi phạm này trong R, trong đóvàlà các thành phần của thông thường đa biến: $\Cor(X,Y) = \frac{\sqrt{2}}{2} \approx 0.707$ $\Cor(Y,Z)$ $X$ $Z$

library(MASS)
set.seed(123)
d1 <- function(a,b) {1 - abs(cor(a,b))}

Sigma    <- matrix(c(1,0,0,1), nrow=2) # covariance matrix of X and Z
matrixXZ <- mvrnorm(n=1e3, mu=c(0,0), Sigma=Sigma, empirical=TRUE)
X <- matrixXZ[,1] # mean 0, variance 1
Z <- matrixXZ[,2] # mean 0, variance 1
cor(X,Z) # nearly zero
Y <- X + Z

d1(X,Y) 
# 0.2928932
d1(Y,Z)
# 0.2928932
d1(X,Z)
# 1
d1(X,Z) <= d1(X,Y) + d1(Y,Z)
# FALSE

Mặc dù lưu ý rằng công trình này không hoạt động với của bạn : $d_2$

d2 <- function(a,b) {1 - cor(a,b)^2}
d2(X,Y) 
# 0.5
d2(Y,Z)
# 0.5
d2(X,Z)
# 1
d2(X,Z) <= d2(X,Y) + d2(Y,Z)
# TRUE

Thay vì khởi động một cuộc tấn công lý thuyết vào , ở giai đoạn này tôi chỉ thấy dễ dàng hơn khi chơi xung quanh với ma trận hiệp phương sai trong R cho đến khi một ví dụ đẹp xuất hiện. Cho phép , và cho: $d_2$ Sigma $\Var(X)=2$ $\Var(Z)=1$ $\Cov(X,Z)=1$

V a r (Y) = V a r (X + Y) = V a r (X) + V a r (Z) + 2 C o v (X, Z) = 2 + 1 + 2 = 5

$\Var(Y)=\Var(X+Y)=\Var(X)+\Var(Z)+2\Cov(X,Z)=2+1+2=5$

Chúng tôi cũng có thể điều tra hiệp phương sai:

C o v (X, Y) = C o v (X, X + Z) = C o v (X, X) + C o v (X, Z) = 2 + 1 = 3

$\Cov(X,Y)=\Cov(X,X+Z)=\Cov(X,X)+\Cov(X,Z)=2+1=3$

C o v (Y, Z) = C o v (X + Z, Z) = C o v (X, Z) + C o v (Z, Z) = 1 + 1 = 2

$\Cov(Y,Z)=\Cov(X+Z,Z)=\Cov(X,Z)+\Cov(Z,Z)=1+1=2$

Các tương quan bình phương là:

C o r (X, Z)^{2} = \frac{C o v (X, Z)^{2}}{V a r (X) V a r (Z)} = \frac{1^{2}}{2 \times 1} = 0.5

$\Cor(X,Z)^2 = \frac{\Cov(X,Z)^2}{\Var(X)\Var(Z)}=\frac{1^2}{2\times1}=0.5$

C o r (X, Y)^{2} = \frac{C o v (X, Y)^{2}}{V a r (X) V a r (Y)} = \frac{3^{2}}{2 \times 5} = 0.9

$\Cor(X,Y)^2 = \frac{\Cov(X,Y)^2}{\Var(X)\Var(Y)}=\frac{3^2}{2\times5}=0.9$

C o r (Y, Z)^{2} = \frac{C o v (Y, Z)^{2}}{V a r (Y) V a r (Z)} = \frac{2^{2}}{5 \times 1} = 0.8

$\Cor(Y,Z)^2 = \frac{\Cov(Y,Z)^2}{\Var(Y)\Var(Z)}=\frac{2^2}{5\times1}=0.8$

Khi đó trong khi và nên bất đẳng thức tam giác bị vi phạm bởi một biên đáng kể. $d_2(X,Z)=0.5$ $d_2(X,Y)=0.1$ $d_2(Y,Z)=0.2$

Sigma    <- matrix(c(2,1,1,1), nrow=2) # covariance matrix of X and Z
matrixXZ <- mvrnorm(n=1e3, mu=c(0,0), Sigma=Sigma, empirical=TRUE)
X <- matrixXZ[,1] # mean 0, variance 2
Z <- matrixXZ[,2] # mean 0, variance 1
cor(X,Z) # 0.707
Y  <- X + Z
d2 <- function(a,b) {1 - cor(a,b)^2}
d2(X,Y) 
# 0.1
d2(Y,Z)
# 0.2
d2(X,Z)
# 0.5
d2(X,Z) <= d2(X,Y) + d2(Y,Z)
# FALSE

— Cá bạc
nguồn

5

Hãy để chúng tôi có ba vectơ (nó có thể là biến hoặc cá nhân) , $X$ $Y$ $Z$

$\newcommand{\Cor}{\mathrm{Cor}}$

$d_{XY}^2 = 2(n-1)(1-\cos_{XY})$ $\cos_{XY}$ $r_{XY}$ $2(n-1)$

$d_1(X,Y)=1-|\Cor(X,Y)|$

$|r|$ $|r|$

Đối với "d1" mỗi se, đó là "như" $d$

nhập mô tả hình ảnh ở đây

$\alpha$ $\beta$ $\alpha+\beta$ $r_{XY}$ $r_{XZ}$ $r_{YZ}$ $d_{XY}$ $d_{XZ}$ $d_{YZ}$ $X$ $Z$ $\alpha+\beta$ ). Đó là vị trí trong đó sự vi phạm bất đẳng thức tam giác bằng khoảng cách bình phương là nổi bật nhất.

$d_{YZ}^2 > d_{XY}^2 + d_{XZ}^2$ .

Therefore regarding

$d_1(X,Y)=1-|\Cor(X,Y)|$

distance we can say it is not metric. Because even when all $r$ s were originally positive the distance is the euclidean $d^2$ which itself isn't metric.

What is about the second distance?

$d_2(X,Y)=1-(\Cor(X,Y))^2$

Since correlation $r$ in the case of standardized vectors is $\cos$ , $1-r^2$ is $\sin^2$ . (Indeed, $1-r^2$ is SSerror/SStotal of a linear regression, a quantity which is the squared correlation of the dependent variable with something orthogonal to the predictor.) In that case draw the sines of the vectors, and make them squared (because we are talking about the distance which is $\sin^2$ ):

enter image description here

Although it is not quite obvious visually, the green $\sin_{YZ}^2$ square is again larger than the sum of red areas $\sin_{XY}^2 + \sin_{XZ}^2$ .

It could be proved. On a plane, $\sin(\alpha+\beta) = \sin\alpha \cos\beta + \cos\alpha \sin\beta$ . Square both sides since we are interested in $\sin^2$ .

\begin{aligned} \sin^{2} (α + β) & = \sin^{2} α (1 - \sin^{2} β) + (1 - \sin^{2} α) \sin^{2} β + 2 \sin α \cos β \cos α \sin β \\ = \sin^{2} α + \sin^{2} β - 2 [\sin^{2} α \sin^{2} β] + 2 [\sin α \cos α \sin β \cos β] \end{aligned}

$\begin{align} \sin^2(\alpha+\beta) &= \sin^2\alpha (1-\sin^2\beta) + (1-\sin^2\alpha) \sin^2\beta + 2 \sin\alpha \cos\beta \cos\alpha \sin\beta \\ &= \sin^2\alpha + \sin^2\beta -2 [\sin^2\alpha \sin^2\beta] +2 [\sin\alpha \cos\alpha \sin\beta \cos\beta] \end{align}$

In the last expression, two important terms are shown bracketed. If the second of the two is (or can be) larger than the first one then $\sin^2(\alpha+\beta) > \sin^2\alpha + \sin^2\beta$ , and the "d2" distance violates triangular inequality. And it is so on our picture where $\alpha$ is about 40 degrees and $\beta$ is about 30 degrees (term 1 is .1033 and term 2 is .2132). "D2" isn't metric.

The square root of "d2" distance - the sine dissimilarity measure - is metric though (I believe). You can play with various $\alpha$ and $\beta$ angles on my circle to make sure. Whether "d2" will show to be metric in a non-collinear setting (i.e. three vectors not on a plane) too - I can't say at this time, albeit I tentatively suppose it will.

— ttnphns
nguồn

3

See also this preprint that I wrote: http://arxiv.org/abs/1208.3145 . I still need to take time and properly submit it. The abstract:

We investigate two classes of transformations of cosine similarity and Pearson and Spearman correlations into metric distances, utilising the simple tool of metric-preserving functions. The first class puts anti-correlated objects maximally far apart. Previously known transforms fall within this class. The second class collates correlated and anti-correlated objects. An example of such a transformation that yields a metric distance is the sine function when applied to centered data.

The upshot for your question is that d1, d2 are indeed not metrics and that the square root of d2 is in fact a proper metric.

— micans
nguồn

2

No.

Simplest counter-example:

for $X=(0,0)$ the distance is not defined at all, whatever your $Y$ is.

Any constant series has standard deviation $\sigma=0$ , and thus causes a division by zero in the definition of $Cor$ ...

At most it is a metric on a subset of the data space, not including any constant series.

— Has QUIT--Anony-Mousse
nguồn

Good point! I must mention this in the pre-print mentioned elsewhere.

— micans