Python: tf-idf-cosine: để tìm sự giống nhau của tài liệu

Question 1

Tôi đã làm theo một hướng dẫn có sẵn ở Phần 1 và Phần 2 . Thật không may, tác giả không có thời gian cho phần cuối cùng liên quan đến việc sử dụng tính tương tự cosine để thực sự tìm ra khoảng cách giữa hai tài liệu. Tôi đã làm theo các ví dụ trong bài viết với sự trợ giúp của liên kết sau từ stackoverflow , bao gồm mã được đề cập trong liên kết trên (để giúp cuộc sống dễ dàng hơn)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."]  # Documents
test_set = ["The sun in the sky is bright."]  # Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

kết quả của đoạn mã trên, tôi có ma trận sau

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

Tôi không chắc chắn về cách sử dụng đầu ra này để tính độ đồng dạng cosine, tôi biết cách thực hiện độ tương tự cosine đối với hai vectơ có độ dài tương tự nhưng ở đây tôi không chắc chắn cách xác định hai vectơ.

Question 2

Trước hết, nếu bạn muốn trích xuất các tính năng đếm và áp dụng chuẩn hóa TF-IDF và chuẩn hóa euclid theo hàng, bạn có thể thực hiện điều đó trong một thao tác với TfidfVectorizer:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()

>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)
>>> tfidf
<11314x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 1787553 stored elements in Compressed Sparse Row format>

Bây giờ để tìm khoảng cách cosin của một tài liệu (ví dụ: tài liệu đầu tiên trong tập dữ liệu) và tất cả các tài liệu khác, bạn chỉ cần tính các tích số chấm của vectơ đầu tiên với tất cả các tài liệu khác vì vectơ tfidf đã được chuẩn hóa theo hàng.

Như được giải thích bởi Chris Clark trong các nhận xét và ở đây Tương tự Cosine không tính đến độ lớn của các vectơ. Hàng chuẩn hóa có độ lớn là 1 và do đó Hạt nhân tuyến tính đủ để tính toán các giá trị tương tự.

API ma trận thưa thớt scipy hơi kỳ lạ (không linh hoạt như mảng numpy N chiều dày đặc). Để có được vectơ đầu tiên, bạn cần phải chia nhỏ hàng ma trận để có được một ma trận con với một hàng duy nhất:

>>> tfidf[0:1]
<1x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 89 stored elements in Compressed Sparse Row format>

scikit-learning đã cung cấp các số liệu theo cặp (hay còn gọi là hạt nhân trong cách nói của máy học) hoạt động cho cả biểu diễn dày đặc và thưa thớt của tập hợp vectơ. Trong trường hợp này, chúng ta cần một sản phẩm chấm còn được gọi là hạt nhân tuyến tính:

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

Do đó, để tìm 5 tài liệu liên quan hàng đầu, chúng ta có thể sử dụng argsortvà một số phương pháp cắt mảng phủ định (hầu hết các tài liệu liên quan có giá trị tương tự cosine cao nhất, do đó ở cuối mảng chỉ số được sắp xếp):

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

Kết quả đầu tiên là kiểm tra độ tỉnh táo: chúng tôi thấy tài liệu truy vấn là tài liệu tương tự nhất với điểm tương tự cosine là 1 có văn bản sau:

>>> print twenty.data[0]
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----

Tài liệu tương tự thứ hai là thư trả lời trích dẫn thư gốc do đó có nhiều từ phổ biến:

>>> print twenty.data[958]
From: rseymour@reed.edu (Robert Seymour)
Subject: Re: WHAT car is this!?
Article-I.D.: reed.1993Apr21.032905.29286
Reply-To: rseymour@reed.edu
Organization: Reed College, Portland, OR
Lines: 26

In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's my
thing) writes:
>
>  I was wondering if anyone out there could enlighten me on this car I saw
> the other day. It was a 2-door sports car, looked to be from the late 60s/
> early 70s. It was called a Bricklin. The doors were really small. In
addition,
> the front bumper was separate from the rest of the body. This is
> all I know. If anyone can tellme a model name, engine specs, years
> of production, where this car is made, history, or whatever info you
> have on this funky looking car, please e-mail.

Bricklins were manufactured in the 70s with engines from Ford. They are rather
odd looking with the encased front bumper. There aren't a lot of them around,
but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a
performance Ford with new styling slapped on top.

>    ---- brought to you by your neighborhood Lerxst ----

Rush fan?

--
Robert Seymour              rseymour@reed.edu
Physics and Philosophy, Reed College    (NeXTmail accepted)
Artificial Life Project         Reed College
Reed Solar Energy Project (SolTrain)    Portland, OR

Question 3

Với sự giúp đỡ của bình luận của @ excray, tôi đã tìm ra câu trả lời, Điều chúng ta cần làm thực sự là viết một vòng lặp for đơn giản để lặp qua hai mảng đại diện cho dữ liệu tàu và dữ liệu thử nghiệm.

Trước tiên hãy triển khai một hàm lambda đơn giản để giữ công thức cho phép tính cosin:

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

Và sau đó chỉ cần viết một vòng lặp for đơn giản để lặp qua vectơ tới, logic là cho mọi "Đối với mỗi vectơ trong trainVectorizerArray, bạn phải tìm sự tương tự cosin với vectơ trong testVectorizerArray."

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

for vector in trainVectorizerArray:
    print vector
    for testV in testVectorizerArray:
        print testV
        cosine = cx(vector, testV)
        print cosine

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

Đây là đầu ra:

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]
[1 0 1 0]
[0 1 1 1]
0.408
[0 1 0 1]
[0 1 1 1]
0.816

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

Question 4

Tôi biết nó là một bài cũ. nhưng tôi đã thử gói http://scikit-learn.sourceforge.net/stable/ . đây là mã của tôi để tìm sự tương tự cosine. Câu hỏi đặt ra là bạn sẽ tính toán độ tương tự cosin với gói này như thế nào và đây là mã của tôi cho điều đó

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

f = open("/root/Myfolder/scoringDocuments/doc1")
doc1 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc2")
doc2 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc3")
doc3 = str.decode(f.read(), "UTF-8", "ignore")

train_set = ["president of India",doc1, doc2, doc3]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)  #finds the tfidf score with normalization
print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train)  #here the first element of tfidf_matrix_train is matched with other three elements

Ở đây, giả sử truy vấn là phần tử đầu tiên của train_set và doc1, doc2 và doc3 là các tài liệu mà tôi muốn xếp hạng với sự trợ giúp của tính tương tự cosine. thì tôi có thể sử dụng mã này.

Ngoài ra, các hướng dẫn được cung cấp trong câu hỏi rất hữu ích. Đây là tất cả các phần của nó part-I , part-II , part-III

đầu ra sẽ như sau:

[[ 1.          0.07102631  0.02731343  0.06348799]]

ở đây 1 biểu thị rằng truy vấn được đối sánh với chính nó và ba điểm còn lại là điểm số để đối sánh truy vấn với các tài liệu tương ứng.

Question 5

Hãy để tôi cung cấp cho bạn một hướng dẫn khác do tôi viết. Nó trả lời câu hỏi của bạn, nhưng cũng giải thích lý do tại sao chúng tôi đang làm một số điều. Tôi cũng cố gắng làm cho nó ngắn gọn.

Vì vậy, bạn có một list_of_documentschuỗi chỉ là một mảng chuỗi và một chuỗi khác documentchỉ là một chuỗi. Bạn cần tìm tài liệu như vậy từ tài liệu list_of_documentsgiống nhất vớidocument .

Hãy kết hợp chúng với nhau: documents = list_of_documents + [document]

Hãy bắt đầu với các phụ thuộc. Nó sẽ trở nên rõ ràng tại sao chúng tôi sử dụng từng loại trong số chúng.

from nltk.corpus import stopwords
import string
from nltk.tokenize import wordpunct_tokenize as tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine

Một trong những cách tiếp cận có thể được sử dụng là một túi các từ tiếp cận , trong đó chúng tôi xử lý từng từ trong tài liệu độc lập với những từ khác và chỉ cần ném tất cả chúng lại với nhau trong một túi lớn. Từ một góc nhìn, nó mất rất nhiều thông tin (như cách các từ được kết nối), nhưng từ một góc nhìn khác, nó làm cho mô hình trở nên đơn giản.

Trong tiếng Anh và bất kỳ ngôn ngữ nào khác của con người, có rất nhiều từ "vô dụng" như 'a', 'the', 'in' phổ biến đến mức chúng không có nhiều ý nghĩa. Chúng được gọi là các từ dừng và bạn nên xóa chúng đi. Một điều khác mà người ta có thể nhận thấy là những từ như "phân tích", "phân tích", "phân tích" thực sự giống nhau. Chúng có một gốc chung và tất cả có thể được chuyển đổi chỉ thành một từ. Quá trình này được gọi là tạo gốc và tồn tại các loại thân khác nhau khác nhau về tốc độ, độ hung hăng, v.v. Vì vậy, chúng tôi biến đổi từng tài liệu thành danh sách các cụm từ không có từ dừng. Ngoài ra, chúng tôi loại bỏ tất cả các dấu câu.

porter = PorterStemmer()
stop_words = set(stopwords.words('english'))

modified_arr = [[porter.stem(i.lower()) for i in tokenize(d.translate(None, string.punctuation)) if i.lower() not in stop_words] for d in documents]

Vậy túi từ này sẽ giúp ích gì cho chúng ta? Hãy tưởng tượng chúng ta có 3 túi: [a, b, c], [a, c, a]và [b, c, d]. Chúng ta có thể chuyển đổi chúng thành vectơ trong cơ sở [a, b, c, d] . Vì vậy, chúng tôi kết thúc với vectơ: [1, 1, 1, 0], [2, 0, 1, 0]và [0, 1, 1, 1]. Điều tương tự là với các tài liệu của chúng tôi (chỉ có các vectơ sẽ dài hơn). Bây giờ chúng ta thấy rằng chúng ta đã loại bỏ rất nhiều từ và cắt các từ khác cũng để giảm kích thước của các vectơ. Ở đây chỉ có quan sát thú vị. Các tài liệu dài hơn sẽ có nhiều phần tử tích cực hơn là ngắn hơn, đó là lý do tại sao chuẩn hóa vector là một điều tuyệt vời. Đây được gọi là thuật ngữ tần số TF, người ta cũng sử dụng thông tin bổ sung về tần suất từ này được sử dụng trong các tài liệu khác - tần suất tài liệu nghịch đảo IDF. Cùng nhau, chúng ta có một số liệu TF-IDF có một số hương vị. Điều này có thể đạt được với một dòng trong sklearn :-)

modified_doc = [' '.join(i) for i in modified_arr] # this is only to convert our list of lists to list of strings that vectorizer uses.
tf_idf = TfidfVectorizer().fit_transform(modified_doc)

Trên thực tế vectorizer cho phép làm rất nhiều thứ như loại bỏ các từ dừng và viết thường. Tôi đã thực hiện chúng trong một bước riêng biệt chỉ vì sklearn không có các từ dừng không phải tiếng Anh, nhưng nltk thì có.

Vì vậy, chúng tôi có tất cả các vectơ được tính toán. Bước cuối cùng là tìm cái nào giống cái cuối cùng nhất. Có nhiều cách khác nhau để đạt được điều đó, một trong số đó là khoảng cách Euclid không quá lớn vì lý do được thảo luận ở đây . Một cách tiếp cận khác là tương tự cosine . Chúng tôi lặp lại tất cả các tài liệu và tính toán độ tương tự cosine giữa tài liệu và tài liệu cuối cùng:

l = len(documents) - 1
for i in xrange(l):
    minimum = (1, None)
    minimum = min((cosine(tf_idf[i].todense(), tf_idf[l + 1].todense()), i), minimum)
print minimum

Bây giờ tối thiểu sẽ có thông tin về tài liệu tốt nhất và điểm của nó.

Question 6

Điều này sẽ giúp bạn.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity  

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(train_set)
print tfidf_matrix
cosine = cosine_similarity(tfidf_matrix[length-1], tfidf_matrix)
print cosine

và đầu ra sẽ là:

[[ 0.34949812  0.81649658  1.        ]]

Question 7

Đây là một chức năng so sánh dữ liệu thử nghiệm của bạn với dữ liệu huấn luyện, với biến áp Tf-Idf được trang bị dữ liệu huấn luyện. Ưu điểm là bạn có thể nhanh chóng xoay vòng hoặc nhóm lại để tìm ra n phần tử gần nhất và các phép tính được xử lý theo ma trận.

def create_tokenizer_score(new_series, train_series, tokenizer):
    """
    return the tf idf score of each possible pairs of documents
    Args:
        new_series (pd.Series): new data (To compare against train data)
        train_series (pd.Series): train data (To fit the tf-idf transformer)
    Returns:
        pd.DataFrame
    """

    train_tfidf = tokenizer.fit_transform(train_series)
    new_tfidf = tokenizer.transform(new_series)
    X = pd.DataFrame(cosine_similarity(new_tfidf, train_tfidf), columns=train_series.index)
    X['ix_new'] = new_series.index
    score = pd.melt(
        X,
        id_vars='ix_new',
        var_name='ix_train',
        value_name='score'
    )
    return score

train_set = pd.Series(["The sky is blue.", "The sun is bright."])
test_set = pd.Series(["The sun in the sky is bright."])
tokenizer = TfidfVectorizer() # initiate here your own tokenizer (TfidfVectorizer, CountVectorizer, with stopwords...)
score = create_tokenizer_score(train_series=train_set, new_series=test_set, tokenizer=tokenizer)
score

   ix_new   ix_train    score
0   0       0       0.617034
1   0       1       0.862012