Sử dụng scikit-learning để phân loại thành nhiều loại

Question 1

Tôi đang cố gắng sử dụng một trong các phương pháp học có giám sát của scikit-learning để phân loại các đoạn văn bản thành một hoặc nhiều danh mục. Chức năng dự đoán của tất cả các thuật toán tôi đã thử chỉ trả về một kết quả phù hợp.

Ví dụ, tôi có một đoạn văn bản:

"Theaters in New York compared to those in London"

Và tôi đã đào tạo thuật toán để chọn một vị trí cho mỗi đoạn văn bản tôi cung cấp cho nó.

Trong ví dụ trên, tôi muốn nó trả về New Yorkvà London, nhưng nó chỉ trả về New York.

Có thể sử dụng scikit-learning để trả về nhiều kết quả không? Hoặc thậm chí trả lại nhãn với xác suất cao nhất tiếp theo?

Cảm ơn bạn đã giúp đỡ.

--- Cập nhật

Tôi đã thử sử dụng OneVsRestClassifiernhưng tôi vẫn chỉ nhận được một tùy chọn trở lại cho mỗi đoạn văn bản. Dưới đây là mã mẫu tôi đang sử dụng

y_train = ('New York','London')


train_set = ("new york nyc big apple", "london uk great britain")
vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5}
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')

X_vectorized = count.transform(train_set).todense()
smatrix2  = count.transform(test_set).todense()


base_clf = MultinomialNB(alpha=1)

clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred

Kết quả: ['New York' 'London' 'London']

Question 2

Những gì bạn muốn được gọi là phân loại nhiều nhãn. Scikits-learning có thể làm được điều đó. Xem tại đây: http://scikit-learn.org/dev/modules/multiclass.html .

Tôi không chắc điều gì đang xảy ra trong ví dụ của bạn, phiên bản sklearn của tôi dường như không có WordNGramAnalyzer. Có lẽ đó là câu hỏi về việc sử dụng nhiều ví dụ đào tạo hơn hoặc thử một bộ phân loại khác? Mặc dù lưu ý rằng bộ phân loại nhiều nhãn mong đợi đích là một danh sách các bộ giá trị / danh sách các nhãn.

Những điều sau đây phù hợp với tôi:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]]
X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new york. enjoy it here and london too'])   
target_names = ['New York', 'London']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_n=1,max_n=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

Đối với tôi, điều này tạo ra đầu ra:

nice day in nyc => New York
welcome to london => London
hello welcome to new york. enjoy it here and london too => New York, London

Hi vọng điêu nay co ich.

Question 3

CHỈNH SỬA: Đã cập nhật cho Python 3, scikit-learning 0.18.1 sử dụng MultiLabelBinarizer như được đề xuất.

Tôi cũng đang nghiên cứu vấn đề này và đã cải tiến một chút để câu trả lời xuất sắc của mwv có thể hữu ích. Nó lấy nhãn văn bản làm đầu vào thay vì nhãn nhị phân và mã hóa chúng bằng MultiLabelBinarizer.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

Điều này cho tôi kết quả sau:

nice day in nyc => new york
welcome to london => london
london is rainy => london
it is raining in britian => london
it is raining in britian and the big apple => new york
it is raining in britian and nyc => london, new york
hello welcome to new york. enjoy it here and london too => london, new york

Question 4

Tôi cũng vừa gặp phải vấn đề này, và vấn đề đối với tôi là y_Train của tôi là một chuỗi các Chuỗi, chứ không phải là một chuỗi các chuỗi Chuỗi. Rõ ràng, OneVsRestClassifier sẽ quyết định dựa trên định dạng nhãn đầu vào xem có nên sử dụng nhiều lớp hay nhiều nhãn hay không. Vì vậy, hãy thay đổi:

y_train = ('New York','London')

đến

y_train = (['New York'],['London'])

Rõ ràng điều này sẽ biến mất trong tương lai, vì tất cả các nhãn đều giống nhau: https://github.com/scikit-learn/scikit-learn/pull/1987

Question 5

Thay đổi dòng này để làm cho nó hoạt động trong các phiên bản mới của python

# lb = preprocessing.LabelBinarizer()
lb = preprocessing.MultiLabelBinarizer()

Question 6

Một số phân loại Nhiều Ví dụ như sau: -

Ví dụ 1:-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array([1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,1])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

Đầu ra là

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Ví dụ 2: -

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array(['Leopard','Lion','Tiger', 'Lion'])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

Đầu ra là

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]]