Liệt kê các cặp tương quan cao nhất từ Ma trận tương quan lớn ở gấu trúc?

Question 1

Làm thế nào để bạn tìm thấy các mối tương quan hàng đầu trong ma trận tương quan với Gấu trúc? Có nhiều câu trả lời về cách thực hiện điều này với R ( Hiển thị các mối tương quan dưới dạng danh sách có thứ tự, không phải dưới dạng ma trận lớn hoặc Cách hiệu quả để nhận các cặp có tương quan cao từ tập dữ liệu lớn bằng Python hoặc R ), nhưng tôi đang tự hỏi làm thế nào để làm điều đó với gấu trúc? Trong trường hợp của tôi, ma trận là 4460x4460, vì vậy không thể làm điều đó một cách trực quan.

Question 2

Bạn có thể sử dụng DataFrame.valuesđể lấy một mảng dữ liệu numpy và sau đó sử dụng các hàm NumPy chẳng hạn như argsort()để lấy các cặp tương quan nhất.

Nhưng nếu bạn muốn làm điều này ở gấu trúc, bạn có thể unstackvà sắp xếp DataFrame:

import pandas as pd
import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

Đây là kết quả:

2192  1522    0.636198
1522  2192    0.636198
3677  2027    0.641817
2027  3677    0.641817
242   130     0.646760
130   242     0.646760
1171  2733    0.670048
2733  1171    0.670048
1000  2000    0.742340
2000  1000    0.742340
dtype: float64

Question 3

Câu trả lời của @ HYRY là hoàn hảo. Chỉ cần xây dựng câu trả lời đó bằng cách thêm một chút logic để tránh các tương quan trùng lặp và tương quan với nhau cũng như phân loại phù hợp:

import pandas as pd
d = {'x1': [1, 4, 4, 5, 6], 
     'x2': [0, 0, 8, 2, 4], 
     'x3': [2, 8, 8, 10, 12], 
     'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()

print("Correlation Matrix")
print(df.corr())
print()

def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))

Điều đó cho kết quả sau:

Data Frame
   x1  x2  x3  x4
0   1   0   2  -1
1   4   0   8  -4
2   4   8   8  -4
3   5   2  10  -4
4   6   4  12  -5

Correlation Matrix
          x1        x2        x3        x4
x1  1.000000  0.399298  1.000000 -0.969248
x2  0.399298  1.000000  0.399298 -0.472866
x3  1.000000  0.399298  1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248  1.000000

Top Absolute Correlations
x1  x3    1.000000
x3  x4    0.969248
x1  x4    0.969248
dtype: float64

Question 4

Một vài dòng giải pháp mà không có các cặp biến dư thừa:

corr_matrix = df.corr().abs()

#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)

sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                  .stack()
                  .sort_values(ascending=False))

#first element of sol series is the pair with the biggest correlation

Sau đó, bạn có thể lặp lại tên của các cặp biến (là nhiều chỉ mục pandas.Series) và các giá trị của chúng như sau:

for index, value in sol.items():
  # do some staff

Question 5

Kết hợp một số tính năng của câu trả lời của @HYRY và @ arun, bạn có thể in các mối tương quan hàng đầu cho khung dữ liệu dftrong một dòng duy nhất bằng cách sử dụng:

df.corr().unstack().sort_values().drop_duplicates()

Lưu ý: một nhược điểm là nếu bạn có 1,0 tương quan không phải là một biến với chính nó, thì việc drop_duplicates()bổ sung sẽ loại bỏ chúng

Question 6

Sử dụng mã bên dưới để xem các mối tương quan theo thứ tự giảm dần.

# See the correlations in descending order

corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)

Question 7

Bạn có thể thực hiện bằng đồ thị theo mã đơn giản này bằng cách thay thế dữ liệu của mình.

corr = df.corr()

kot = corr[corr>=.9]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Greens")

Question 8

Rất nhiều câu trả lời hay ở đây. Cách dễ nhất mà tôi tìm thấy là sự kết hợp của một số câu trả lời ở trên.

corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
    .sort_values(by='column', ascending=False)\
    .dropna()

Question 9

Sử dụng itertools.combinationsđể lấy tất cả các mối tương quan duy nhất từ ma trận tương quan của riêng gấu trúc .corr(), tạo danh sách các danh sách và đưa nó trở lại DataFrame để sử dụng '.sort_values'. Đặt ascending = Trueđể hiển thị các mối tương quan thấp nhất ở trên cùng

corranklấy DataFrame làm đối số vì nó yêu cầu .corr().

  def corrank(X: pandas.DataFrame):
        import itertools
        df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])    
        print(df.sort_values(by='corr',ascending=False))

  corrank(X) # prints a descending list of correlation pair (Max on top)

Question 10

Tôi không muốn unstackhoặc làm quá phức tạp vấn đề này, vì tôi chỉ muốn loại bỏ một số tính năng tương quan cao như một phần của giai đoạn lựa chọn tính năng.

Vì vậy, tôi đã kết thúc với giải pháp đơn giản sau:

# map features to their absolute correlation values
corr = features.corr().abs()

# set equality (self correlation) as zero
corr[corr == 1] = 0

# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)

# display the highly correlated features
display(corr_cols[corr_cols > 0.8])

Trong trường hợp này, nếu bạn muốn loại bỏ các đối tượng địa lý tương quan, bạn có thể ánh xạ qua corr_colsmảng đã lọc và loại bỏ các đối tượng được lập chỉ mục lẻ (hoặc lập chỉ mục chẵn).

Question 11

Tôi đã thử một số giải pháp ở đây nhưng sau đó tôi thực sự nghĩ ra giải pháp của riêng mình. Tôi hy vọng điều này có thể hữu ích cho lần tiếp theo nên tôi chia sẻ nó ở đây:

def sort_correlation_matrix(correlation_matrix):
    cor = correlation_matrix.abs()
    top_col = cor[cor.columns[0]][1:]
    top_col = top_col.sort_values(ascending=False)
    ordered_columns = [cor.columns[0]] + top_col.index.tolist()
    return correlation_matrix[ordered_columns].reindex(ordered_columns)

Question 12

Đây là mã cải tiến từ @MiFi. Đây là một thứ tự trong abs nhưng không loại trừ các giá trị âm.

   def top_correlation (df,n):
    corr_matrix = df.corr()
    correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                 .stack()
                 .sort_values(ascending=False))
    correlation = pd.DataFrame(correlation).reset_index()
    correlation.columns=["Variable_1","Variable_2","Correlacion"]
    correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
    return correlation.head(n)

top_correlation(ANYDATA,10)

Question 13

Hàm sau sẽ thực hiện thủ thuật. Triển khai này

Loại bỏ các tương quan tự
Loại bỏ các bản sao
Cho phép lựa chọn N tính năng tương quan cao nhất hàng đầu

và nó cũng có thể định cấu hình để bạn có thể giữ cả các tương quan tự cũng như các bản sao. Bạn cũng có thể báo cáo bao nhiêu cặp tính năng tùy thích.

def get_feature_correlation(df, top_n=None, corr_method='spearman',
                            remove_duplicates=True, remove_self_correlations=True):
    """
    Compute the feature correlation and sort feature pairs based on their correlation

    :param df: The dataframe with the predictor variables
    :type df: pandas.core.frame.DataFrame
    :param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
    :param corr_method: Correlation compuation method
    :type corr_method: str
    :param remove_duplicates: Indicates whether duplicate features must be removed
    :type remove_duplicates: bool
    :param remove_self_correlations: Indicates whether self correlations will be removed
    :type remove_self_correlations: bool

    :return: pandas.core.frame.DataFrame
    """
    corr_matrix_abs = df.corr(method=corr_method).abs()
    corr_matrix_abs_us = corr_matrix_abs.unstack()
    sorted_correlated_features = corr_matrix_abs_us \
        .sort_values(kind="quicksort", ascending=False) \
        .reset_index()

    # Remove comparisons of the same feature
    if remove_self_correlations:
        sorted_correlated_features = sorted_correlated_features[
            (sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
        ]

    # Remove duplicates
    if remove_duplicates:
        sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]

    # Create meaningful names for the columns
    sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']

    if top_n:
        return sorted_correlated_features[:top_n]

    return sorted_correlated_features

Question 14

Tôi thích bài đăng của Addison Klinke nhất, vì nó là bài đơn giản nhất, nhưng đã sử dụng gợi ý của Wojciech Moszczyńsk để lọc và lập biểu đồ, nhưng đã mở rộng bộ lọc để tránh các giá trị tuyệt đối, vì vậy, đưa ra một ma trận tương quan lớn, lọc nó, lập biểu đồ và sau đó làm phẳng nó:

Được tạo, lọc và lập biểu đồ

dfCorr = df.corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()

Chức năng

Cuối cùng, tôi đã tạo ra một hàm nhỏ để tạo ma trận tương quan, lọc nó và sau đó làm phẳng nó. Như một ý tưởng, nó có thể dễ dàng được mở rộng, ví dụ, giới hạn trên và dưới không đối xứng, v.v.

def corrFilter(x: pd.DataFrame, bound: float):
    xCorr = x.corr()
    xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
    xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
    return xFlattened

corrFilter(df, .7)

Liệt kê các cặp tương quan cao nhất từ ​​Ma trận tương quan lớn ở gấu trúc?

Liệt kê các cặp tương quan cao nhất từ Ma trận tương quan lớn ở gấu trúc?