Tại sao tất cả các hệ số Lasso trong mô hình 0.0?

Tôi đang sử dụng from sklearn.linear_model import Lassotrong Python 2.7.6

Tôi đã viết một kịch bản mà tôi đã sử dụng để thực hiện hồi quy Lasso cho các Tính năng (X) và Mục tiêu của tôi (y). Tôi đã sử dụng nó trước đây và nó hoạt động, tôi đang sử dụng nó trên một tập dữ liệu mới (loại dữ liệu hoàn toàn khác nhau) và tôi nhận được tất cả 0 hệ số.

Điều đó có nghĩa là gì? Có bất cứ điều gì tôi có thể thay đổi hoặc điều chỉnh để có được dữ liệu?

Tôi đã thử các thông số alpha khác nhau. Đây là chức năng của tôi dưới đây. Tôi sử dụng hệ thống lớp này để lưu trữ các mô hình và công cụ của tôi. Hãy cho tôi biết nếu nó khó hiểu hoặc cần phải được khái quát. Tôi nghĩ rằng nó khá thẳng về phía trước. Ký hiệu của tôi là DF_= DataFrame, D_= Dictionary, SR_= Series

#Create the models for store them
from sklearn.cross_validation import LeavePOut
from sklearn.linear_model import Lasso
import time
from collections import defaultdict

class Models:
    def __init__(self,target=None,description=None,models=[],duration=0.0):
        self.target = target; self.models = models; self.duration = duration; self.description = description
    def summation(self):
        return(float(sum([q[1] for q in self.models])))
    def score(self):
        return(self.summation()/len(self.models)) 

def synthesis(description, query_targets, D_target_Models, DF_attributes, DF_targets,alpha = 1):
    """
    Updates Model object
    Parameters:
    [description] key for dictionary of D_target_Models that stores instances of class
    [query_targets] list of targets to make models for in DF_targets
    [D_target_Models] dictionary of dictionaries:
        Outer dict: {description:targets}; 
        Inner dict: {target:model_instance}
    [DF_attributes] Pandas DataFrame of attributes (index = sample, column = attribute)
    [DF_targets] Pandas DataFrame of targets (index = sample, column = targets)
    [alpha] lambda for regression method
    """
    lpo = LeavePOut(len(DF_attributes.index)/1000, p=2)
    #Check order of indices
    if (list(DF_attributes.index) == list(DF_targets.index)) == True:
#         X.index = Y.index = range(len(X.index))
        for target in query_targets:
            #Create target instance
            D_target_Models[description][target] = Models(target=target)
            #Get query column for target
            SR_target = DF_targets[target]

            #Create and train models
            models = []
            for train_indices,test_indices in lpo:
                #Check if all test sets have values
                #NOTE!(These conditionsaren't essential for understanding the script.  It's how I ensured there were no NAs)
                condition_1 = all([(pd.isnull(SR_target.iloc[test_i]) == False) for test_i in test_indices])
                condition_2 = DF_attributes.iloc[test_i].isnull().values.any() == False
                condition_3 = None #Impute missing data on DF_attributes
                conditions = [condition_1,condition_2]

                if all(conditions) == True: #Assumes data is present for all features
                    #Create model
                    duration_start = time.time() #So I can time the modeling, not essential
                    model = Lasso(alpha=alpha)

                    #Update training indices with non-null target/sensitivity indices
                    train_indices = [train_i for train_i in train_indices if pd.isnull(SR_target.iloc[train_i]) == False]
                    #Assign X and y
                    train_X = DF_attributes.iloc[train_indices,:]
                    test_X = DF_attributes.iloc[test_indices,:]
                    train_y = SR_target.iloc[train_indices]
                    test_y = SR_target.iloc[test_indices]
                    #Fit model
                    model.fit(train_X,train_y)

                    #Predict
                    predicted_values = model.predict(test_X)
                    correct_values = test_y
                    accuracy = int((predicted_values[0] > predicted_values[1]) == (correct_values[0] > correct_values[1]))
                    if accuracy == 1:
                        if len(set(model.coef_)) > 1:
                            print(set(model.coef_)) #ALL COEFFICIENTS ARE 0.0
                    #Store models
                    models.append((model,accuracy,test_indices))

                #Store time for models
            D_target_Models[description][target].models = models
            D_target_Models[description][target].duration = float(time.time() - duration_start)
        return(D_target_Models)
    else:
        return("DF_attributes.index != DF_target.index")

— Ôi
nguồn

Có thể các biến của bạn chỉ đơn giản là không liên quan mạnh mẽ đến phản hồi? (Lưu ý rằng nếu bạn muốn ai đó đọc qua mã của bạn và tìm kiếm các vấn đề, đó sẽ không có chủ đề ở đây - bạn có thể thử Đánh giá mã .)

— gung - Tái lập Monica

Đó có phải là hệ số 0 có nghĩa gì cho tất cả các thuộc tính không? Tôi đã chạy nó với LassoCV thay vì Lasso và có hệ số. Vì vậy, điều đó có nghĩa là alpha của tôi là vấn đề?

— O.rka

Đó có thể là nó.

— gung - Phục hồi Monica

Ở đây, thực tế chính về hồi quy LASSO là nó giảm thiểu tổng sai số bình phương, trong điều kiện ràng buộc là tổng các giá trị tuyệt đối của các hệ số nhỏ hơn một số hằng số . (Xem tại đây .) Vì vậy, để tất cả các hệ số bằng 0, không được có vectơ hệ số nào có giá trị tuyệt đối tổng hợp nhỏ hơn giúp cải thiện lỗi. $c$ $c$

Đối với góc nhìn khác, hãy xem xét chức năng mất LASSO:

\sum_{i = 1}^{n} (Y_{i} - X_{i}^{T} β) + λ \sum_{j = 1}^{p} | β_{j} |

$\sum_{i = 1}^n (Y_i - X_i^T\beta) + \lambda\sum_{j=1}^p|\beta_j|$

Như được đưa ra trong hướng dẫn được tham chiếu ở trên, "Nếu đủ lớn, một số hệ số được dẫn về 0, dẫn đến một mô hình thưa thớt ." Đối với trường hợp các hệ số bằng 0 giảm thiểu hàm này, phải đủ lớn để bất kỳ cải thiện nào về lỗi (thuật ngữ bên trái) nhỏ hơn tổn thất được thêm vào từ định mức tăng (thuật ngữ bên phải). $\lambda$ $\lambda$

Thông thường sử dụng xác thực chéo để đặt tham số này sao cho mô hình giảm thiểu lỗi CV. Đây có thể là lý do tại sao LassoCVmang lại cho bạn kết quả khác nhau. Nó có thể đã đặt cho bạn. $\lambda$

— Phục sinh Sean
nguồn