Cách hiệu quả nhất để tìm chế độ trong mảng numpy

Question 1

Tôi có một mảng 2D chứa các số nguyên (cả dương hoặc âm). Mỗi hàng đại diện cho các giá trị theo thời gian cho một vị trí không gian cụ thể, trong khi mỗi cột biểu thị các giá trị cho các vị trí không gian khác nhau trong một thời gian nhất định.

Vì vậy, nếu mảng như sau:

1 3 4 2 2 7
5 2 2 1 4 1
3 3 2 2 1 1

Kết quả sẽ là

1 3 2 2 2 1

Lưu ý rằng khi có nhiều giá trị cho chế độ, bất kỳ giá trị nào (được chọn ngẫu nhiên) có thể được đặt làm chế độ.

Tôi có thể lặp lại từng chế độ tìm cột nhưng tôi đã hy vọng rằng numpy có thể có một số chức năng tích hợp để làm điều đó. Hoặc nếu có một mẹo để tìm thấy điều đó một cách hiệu quả mà không cần lặp lại.

Question 2

Kiểm tra scipy.stats.mode()(lấy cảm hứng từ bình luận của @ tom10):

import numpy as np
from scipy import stats

a = np.array([[1, 3, 4, 2, 2, 7],
              [5, 2, 2, 1, 4, 1],
              [3, 3, 2, 2, 1, 1]])

m = stats.mode(a)
print(m)

Đầu ra:

ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))

Như bạn có thể thấy, nó trả về cả chế độ cũng như số lượng. Bạn có thể chọn các chế độ trực tiếp thông qua m[0]:

print(m[0])

Đầu ra:

[[1 3 2 2 1 1]]

Question 3

Cập nhật

Các scipy.stats.modechức năng đã được tối ưu hóa đáng kể từ bài này, và sẽ là phương pháp khuyến khích

Câu trả lời cũ

Đây là một vấn đề phức tạp, vì không có nhiều thứ để tính toán chế độ dọc theo một trục. Giải pháp là thẳng thắn cho các mảng 1-D, ở đây numpy.bincountlà tiện dụng, cùng numpy.uniquevới lập luận return_countsnhư True. Hàm n-chiều phổ biến nhất mà tôi thấy là scipy.stats.mode, mặc dù nó rất chậm - đặc biệt là đối với các mảng lớn có nhiều giá trị duy nhất. Như một giải pháp, tôi đã phát triển chức năng này và sử dụng nó rất nhiều:

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

Kết quả:

In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7],
                         [5, 2, 2, 1, 4, 1],
                         [3, 3, 2, 2, 1, 1]])

In [3]: mode(a)
Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))

Một số điểm chuẩn:

In [4]: import scipy.stats

In [5]: a = numpy.random.randint(1,10,(1000,1000))

In [6]: %timeit scipy.stats.mode(a)
10 loops, best of 3: 41.6 ms per loop

In [7]: %timeit mode(a)
10 loops, best of 3: 46.7 ms per loop

In [8]: a = numpy.random.randint(1,500,(1000,1000))

In [9]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 1.01 s per loop

In [10]: %timeit mode(a)
10 loops, best of 3: 80 ms per loop

In [11]: a = numpy.random.random((200,200))

In [12]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 3.26 s per loop

In [13]: %timeit mode(a)
1000 loops, best of 3: 1.75 ms per loop

CHỈNH SỬA: Cung cấp nhiều nền hơn và sửa đổi cách tiếp cận để tiết kiệm bộ nhớ hơn

Question 4

Mở rộng trên phương pháp này , được áp dụng để tìm chế độ dữ liệu mà bạn có thể cần chỉ mục của mảng thực tế để xem giá trị cách tâm phân phối bao xa.

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Hãy nhớ loại bỏ chế độ khi len (np.argmax (counts))> 1, cũng để xác nhận xem nó có thực sự đại diện cho phân phối trung tâm của dữ liệu hay không, bạn có thể kiểm tra xem nó có nằm trong khoảng độ lệch chuẩn của bạn hay không.

Question 5

Một giải pháp gọn gàng chỉ sử dụng numpy(không scipyphải Counterlớp):

A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]])

np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)

mảng ([1, 3, 2, 2, 1, 1])

Question 6

Nếu bạn chỉ muốn sử dụng numpy:

x = [-1, 2, 1, 3, 3]
vals,counts = np.unique(x, return_counts=True)

cho

(array([-1,  1,  2,  3]), array([1, 1, 1, 2]))

Và giải nén nó:

index = np.argmax(counts)
return vals[index]

Question 7

Tôi nghĩ một cách rất đơn giản là sử dụng lớp Counter. Sau đó, bạn có thể sử dụng hàm most_common () của cá thể Bộ đếm như đã đề cập ở đây .

Đối với mảng 1-d:

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 #6 is now the mode
mode = Counter(nparr).most_common(1)
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])

Đối với mảng nhiều chiều (sự khác biệt nhỏ):

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 
nparr = nparr.reshape((10,2,5))     #same thing but we add this to reshape into ndarray
mode = Counter(nparr.flatten()).most_common(1)  # just use .flatten() method

# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])

Đây có thể là một cách triển khai hiệu quả hoặc không, nhưng nó rất tiện lợi.

Question 8

from collections import Counter

n = int(input())
data = sorted([int(i) for i in input().split()])

sorted(sorted(Counter(data).items()), key = lambda x: x[1], reverse = True)[0][0]

print(Mean)

Số Counter(data)đếm tần suất và trả về một sắc lệnh mặc định. sorted(Counter(data).items())sắp xếp bằng cách sử dụng các phím, không phải tần số. Cuối cùng, cần phải sắp xếp tần số bằng cách sử dụng được sắp xếp khác với key = lambda x: x[1]. Điều ngược lại yêu cầu Python sắp xếp tần suất từ lớn nhất đến nhỏ nhất.

Question 9

cách đơn giản nhất trong Python để lấy chế độ của danh sách hoặc mảng a

   import statistics
   print("mode = "+str(statistics.(mode(a)))

Đó là nó