Tại sao chúng ta sử dụng ReLU trong các mạng thần kinh và làm thế nào để chúng ta sử dụng nó?

31

Tại sao chúng ta sử dụng các đơn vị tuyến tính chỉnh lưu (ReLU) với các mạng thần kinh? Làm thế nào để cải thiện mạng lưới thần kinh?

Tại sao chúng ta nói rằng ReLU là một chức năng kích hoạt? Không phải là chức năng kích hoạt softmax cho các mạng thần kinh? Tôi đoán rằng chúng tôi sử dụng cả hai, ReLU và softmax, như thế này:

nơ-ron 1 có đầu ra softmax ----> ReLU trên đầu ra của nơ-ron 1, là
đầu vào của nơ-ron 2 ---> nơ-ron 2 có đầu ra softmax -> ...

do đó, đầu vào của nơron 2 về cơ bản là ReLU (softmax (x1)). Điều này có đúng không?

neural-networks

— người dùng2896492634
nguồn

36

$f(x)=\max(0, x).$

Một cách ReLU cải thiện mạng lưới thần kinh là bằng cách tăng tốc đào tạo. Tính toán độ dốc rất đơn giản (0 hoặc 1 tùy thuộc vào dấu của ). Ngoài ra, bước tính toán của ReLU rất dễ dàng: mọi phần tử âm được đặt thành 0,0 - không có số mũ, không có phép nhân hoặc phép chia. $x$

Các lớp của mạng tiếp tuyến logistic và hyperbolic nhỏ hơn phần dương của ReLU. Điều này có nghĩa là phần tích cực được cập nhật nhanh hơn khi quá trình đào tạo tiến triển. Tuy nhiên, điều này đi kèm với một chi phí. Độ dốc 0 ở phía bên trái có vấn đề riêng của nó, được gọi là "nơ-ron chết", trong đó bản cập nhật độ dốc đặt các giá trị đến thành ReLU sao cho đầu ra luôn bằng 0; các đơn vị ReLU đã sửa đổi như ELU (hoặc Leaky ReLU hoặc PReLU, v.v.) có thể cải thiện điều này.

$\frac{d}{dx}\text{ReLU}(x)=1\forall x > 0$ . By contrast, the gradient of a sigmoid unit is at most $0.25$ ; on the other hand, $\tanh$ fares better for inputs in a region near 0 since $0.25 < \frac{d}{dx}\tanh(x) \le 1 \forall x \in [-1.31, 1.31]$ (approximately).

— Sycorax says Reinstate Monica
nguồn

@aginensky You can ask questions by clicking the Ask Question button at the top of the page.

— Sycorax says Reinstate Monica

I see no evidence that I wanted to ask a question or that I participated in this page. Frankly I'm amazed at how well ReLU works, but I've stopped questioning it :).

— aginensky

@aginensky It appears that the comment was removed in the interim.

— Sycorax says Reinstate Monica

The comment was not removed by me nor was I informed. I've stopped answering questions and I guess this means I'm done with commenting too.

— aginensky

@aginensky I don't know why this would cause you to stop commenting. If you have any questions about comments and moderation, you could ask a question in meta.stats.SE.

— Sycorax says Reinstate Monica

4

One important thing to point out is that ReLU is idempotent. Given that ReLU is $\rho(x) = \max(0, x)$ , it's easy to see that $\rho \circ \rho \circ \rho \circ \dots \circ \rho = \rho$ is true for any finite composition. This property is very important for deep neural networks, because each layer in the network applies a nonlinearity. Now, let's apply two sigmoid-family functions to the same input repeatedly 1-3 times:

You can immediately see that sigmoid functions "squash" their inputs resulting in the vanishing gradient problem: derivatives approach zero as $n$ (the number of repeated applications) approaches infinity.

— Eli Korvigo
nguồn

0

ReLU is the max function(x,0) with input x e.g. matrix from a convolved image. ReLU then sets all negative values in the matrix x to zero and all other values are kept constant.

ReLU is computed after the convolution and therefore a nonlinear activation function like tanh or sigmoid.

Softmax is a classifier at the end of the neural network. That is logistic regression to regularize outputs to values between 0 and 1. (Alternative here is a SVM classifier).

CNN Forward Pass e.g.: input->conv->ReLU->Pool->conv->ReLU->Pool->FC->softmax

— Randy Welt
nguồn

8

Downvoting. This a very bad answer! Softmax is not a classifier! It is a function that normalizes (scales) the outputs to the range [0,1] and ensures they sum up to 1. Logistic regression does not "regularize" anything! The sentence "ReLU is computed after the convolution and therefore a nonlinear activation function like tanh or sigmoid." lacks a verb, or sense.

— Jan Kukacka

1

The answer is not that bad. The sentence without the verb must be "ReLU is computed after the convolution and IS therefore a nonlinear activation function like tanh or sigmoid." Thinking of softmax as a classifier makes sense too. It can be seen as a probabilistic classifier that assigns a probability to each class. It "regularizes"/"normalizes" the outputs to the [0,1] interval.

— user118967

0

ReLU is a literal switch. With an electrical switch 1 volt in gives 1 volt out, n volts in gives n volts out when on. On/Off when you decide to switch at zero gives exactly the same graph as ReLU. The weighted sum (dot product) of a number of weighted sums is still a linear system. For a particular input the ReLU switches are individually on or off. That results in a particular linear projection from the input to the output, as various weighted sums of weighted sum of ... are connected together by the switches. For a particular input and a particular output neuron there is a compound system of weighted sums that actually can be summarized to a single effective weighted sum. Since ReLU switches state at zero there are no sudden discontinuities in the output for gradual changes in the input.

There are other numerically efficient weighted sum (dot product) algorithms around like the FFT and Walsh Hadamard transform. There is no reason you can't incorporate those into an ReLU based neural network and benefit from the computational gains. (eg. Fixed filter bank neural networks.)

— Sean O'Connor
nguồn