Lưu / tải scipy thưa thớt csr_matrix ở định dạng dữ liệu di động

Question 1

Làm thế nào để bạn lưu / tải một scipy thưa thớt csr_matrixở định dạng di động? Ma trận thưa thớt scipy được tạo trên Python 3 (Windows 64-bit) để chạy trên Python 2 (Linux 64-bit). Ban đầu, tôi đã sử dụng pickle (với protocol = 2 và fix_imports = True) nhưng điều này không hoạt động khi chuyển từ Python 3.2.2 (Windows 64-bit) sang Python 2.7.2 (Windows 32-bit) và gặp lỗi:

TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).

Tiếp theo, đã thử numpy.savevà numpy.loadcũng như scipy.io.mmwrite()và scipy.io.mmread()và không có phương pháp nào trong số này hoạt động.

Question 2

chỉnh sửa: SciPy 1.19 hiện có scipy.sparse.save_npzvà scipy.sparse.load_npz.

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

Đối với cả hai hàm, fileđối số cũng có thể là một đối tượng giống tệp (tức là kết quả của open) thay vì tên tệp.

Có câu trả lời từ nhóm người dùng Scipy:

Một csr_matrix có 3 dữ liệu thuộc tính rằng vấn đề: .data, .indices, và .indptr. Tất cả đều là ndarrays đơn giản, vì vậy numpy.savesẽ hoạt động trên chúng. Lưu ba mảng với numpy.savehoặc numpy.savez, tải lại chúng bằng numpy.loadhoặc sau đó tạo lại đối tượng ma trận thưa thớt với:
new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

Ví dụ:

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

Question 3

Mặc dù bạn viết, scipy.io.mmwritevà scipy.io.mmreadkhông hiệu quả với bạn, tôi chỉ muốn thêm vào cách chúng hoạt động. Câu hỏi này là không. 1 cú đánh của Google, vì vậy bản thân tôi đã bắt đầu np.savezvà pickle.dumptrước khi chuyển sang các hàm scipy đơn giản và rõ ràng. Chúng làm việc cho tôi và không nên bị giám sát bởi những người chưa thử chúng.

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

Question 4

Dưới đây là so sánh hiệu suất của ba câu trả lời được ủng hộ nhiều nhất bằng máy tính xách tay Jupyter. Đầu vào là ma trận thưa ngẫu nhiên 1M x 100K với mật độ 0,001, chứa 100M giá trị khác 0:

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

`io.mmwrite` / `io.mmread`

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.

(lưu ý rằng định dạng đã được thay đổi từ csr sang coo).

`np.savez` / `np.load`

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

`cPickle`

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

Lưu ý : cPickle không hoạt động với các đối tượng quá lớn (xem câu trả lời này ). Theo kinh nghiệm của tôi, nó không hoạt động đối với ma trận 2,7M x 50k với 270 triệu giá trị khác 0. np.savezgiải pháp hoạt động tốt.

Phần kết luận

(dựa trên bài kiểm tra đơn giản này cho ma trận CSR) cPicklelà phương pháp nhanh nhất, nhưng nó không hoạt động với ma trận quá lớn, np.savezchỉ chậm hơn một chút, trong khi io.mmwritechậm hơn nhiều, tạo ra tệp lớn hơn và khôi phục về định dạng sai. Vì vậy, np.savezlà người chiến thắng ở đây.

Question 5

Bây giờ bạn có thể sử dụng scipy.sparse.save_npz: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

Question 6

Giả sử bạn có scipy trên cả hai máy, bạn chỉ có thể sử dụng pickle.

Tuy nhiên, hãy đảm bảo chỉ định một giao thức nhị phân khi chọn các mảng phức tạp. Nếu không, bạn sẽ kết thúc với một tệp lớn.

Ở bất kỳ mức độ nào, bạn sẽ có thể làm điều này:

import cPickle as pickle
import numpy as np
import scipy.sparse

# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)

with open('test_sparse_array.dat', 'wb') as outfile:
    pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)

Sau đó, bạn có thể tải nó bằng:

import cPickle as pickle

with open('test_sparse_array.dat', 'rb') as infile:
    x = pickle.load(infile)

Question 7

Kể từ scipy 0.19.0, bạn có thể lưu và tải các ma trận thưa thớt theo cách này:

from scipy import sparse

data = sparse.csr_matrix((3, 4))

#Save
sparse.save_npz('data_sparse.npz', data)

#Load
data = sparse.load_npz("data_sparse.npz")

Question 8

CHỈNH SỬA Rõ ràng nó đủ đơn giản để:

def sparse_matrix_tuples(m):
    yield from m.todok().items()

Điều này sẽ mang lại một ((i, j), value)bộ giá trị, dễ dàng để tuần tự hóa và giải mã hóa. Không chắc nó so sánh hiệu suất như thế nào với mã bên dưới csr_matrix, nhưng nó chắc chắn đơn giản hơn. Tôi để lại câu trả lời ban đầu bên dưới vì tôi hy vọng nó có nhiều thông tin.

Thêm hai xu của tôi: đối với tôi, npzkhông phải là di động vì tôi không thể sử dụng nó để xuất ma trận của mình một cách dễ dàng sang các máy khách không sử dụng Python (ví dụ: PostgreSQL - rất vui khi được sửa chữa). Vì vậy, tôi muốn nhận đầu ra CSV cho ma trận thưa thớt (giống như bạn sẽ lấy nó cho bạn print()ma trận thưa thớt). Làm thế nào để đạt được điều này phụ thuộc vào sự biểu diễn của ma trận thưa thớt. Đối với ma trận CSR, đoạn mã sau xuất ra đầu ra CSV. Bạn có thể điều chỉnh cho các đại diện khác.

import numpy as np

def csr_matrix_tuples(m):
    # not using unique will lag on empty elements
    uindptr, uindptr_i = np.unique(m.indptr, return_index=True)
    for i, (start_index, end_index) in zip(uindptr_i, zip(uindptr[:-1], uindptr[1:])):
        for j, data in zip(m.indices[start_index:end_index], m.data[start_index:end_index]):
            yield (i, j, data)

for i, j, data in csr_matrix_tuples(my_csr_matrix):
    print(i, j, data, sep=',')

Nó chậm hơn khoảng 2 lần so với save_npztriển khai hiện tại, từ những gì tôi đã thử nghiệm.

Question 9

Đây là những gì tôi đã sử dụng để tiết kiệm a lil_matrix.

import numpy as np
from scipy.sparse import lil_matrix

def save_sparse_lil(filename, array):
    # use np.savez_compressed(..) for compression
    np.savez(filename, dtype=array.dtype.str, data=array.data,
        rows=array.rows, shape=array.shape)

def load_sparse_lil(filename):
    loader = np.load(filename)
    result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
    result.data = loader["data"]
    result.rows = loader["rows"]
    return result

Tôi phải nói rằng tôi thấy np.load (..) của NumPy rất chậm . Đây là giải pháp hiện tại của tôi, tôi cảm thấy chạy nhanh hơn nhiều:

from scipy.sparse import lil_matrix
import numpy as np
import json

def lil_matrix_to_dict(myarray):
    result = {
        "dtype": myarray.dtype.str,
        "shape": myarray.shape,
        "data":  myarray.data,
        "rows":  myarray.rows
    }
    return result

def lil_matrix_from_dict(mydict):
    result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
    result.data = np.array(mydict["data"])
    result.rows = np.array(mydict["rows"])
    return result

def load_lil_matrix(filename):
    result = None
    with open(filename, "r", encoding="utf-8") as infile:
        mydict = json.load(infile)
        result = lil_matrix_from_dict(mydict)
    return result

def save_lil_matrix(filename, myarray):
    with open(filename, "w", encoding="utf-8") as outfile:
        mydict = lil_matrix_to_dict(myarray)
        json.dump(mydict, outfile)

Question 10

Điều này phù hợp với tôi:

import numpy as np
import scipy.sparse as sp
x = sp.csr_matrix([1,2,3])
y = sp.csr_matrix([2,3,4])
np.savez(file, x=x, y=y)
npz = np.load(file)

>>> npz['x'].tolist()
<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

>>> npz['x'].tolist().toarray()
array([[1, 2, 3]], dtype=int64)

Bí quyết là gọi .tolist()để chuyển đổi mảng đối tượng shape 0 thành đối tượng ban đầu.

Question 11

Tôi được yêu cầu gửi ma trận ở định dạng đơn giản và chung chung:

<x,y,value>

Tôi đã kết thúc với điều này:

def save_sparse_matrix(m,filename):
    thefile = open(filename, 'w')
    nonZeros = np.array(m.nonzero())
    for entry in range(nonZeros.shape[1]):
        thefile.write("%s,%s,%s\n" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))

Lưu / tải scipy thưa thớt csr_matrix ở định dạng dữ liệu di động

io.mmwrite / io.mmread

np.savez / np.load

cPickle

Phần kết luận

`io.mmwrite` / `io.mmread`

`np.savez` / `np.load`

`cPickle`