Ai đó có thể vui lòng giải thích thuật toán lan truyền ngược? [bản sao]

13

Thuật toán lan truyền ngược là gì và nó hoạt động như thế nào?

algorithms optimization neural-networks

— Ami
nguồn

1

Tôi đặt câu trả lời cho câu hỏi này ở đây nếu có ai quan tâm (tôi không muốn đăng lại).

— Phylliida

14

Thuật toán lan truyền ngược là một độ dốc gốc để phù hợp với mô hình mạng thần kinh. (như được đề cập bởi @Dikran) Hãy để tôi giải thích làm thế nào.

Chính thức: Sử dụng tính toán độ dốc ở cuối bài này trong phương trình [1] bên dưới (đó là định nghĩa của độ dốc gốc) đưa ra thuật toán lan truyền ngược như một trường hợp cụ thể của việc sử dụng độ dốc gốc.

Một mô hình mạng thần kinh Chính thức, chúng tôi sửa các ý tưởng bằng một mô hình lớp đơn giản:

nơi và được biết đến với cho tất cả , và

f (x) = g (A^{1} (s (A^{2} (x))))

$f(x)=g(A^1(s(A^2(x))))$

g : R \to R

$g:\mathbb{R} \rightarrow \mathbb{R}$

s : R^{M} \to R^{M}

$s:\mathbb{R}^M\rightarrow \mathbb{R}^M$

m = 1 \dots, M

$m=1\dots,M$

s (x) [m] = σ (x [m])

$s(x)[m]=\sigma(x[m])$

,

A^{1} : R^{M} \to R

$A^1:\mathbb{R}^M\rightarrow \mathbb{R}$

là các hàm affine chưa biết. Chức năng

được gọi là hàm kích hoạt trong khuôn khổ phân loại.

A^{2} R^{p} \to R^{M}

$A^2\mathbb{R}^p\rightarrow \mathbb{R}^M$

σ : R \to R

$\sigma:\mathbb{R}\rightarrow \mathbb{R}$

Một hàm mất bậc hai được thực hiện để sửa ý tưởng. Do đó các đầu vào vectơ của có thể được trang bị cho sản lượng thực tế của (có thể là vectơ) bằng cách giảm thiểu sự mất mát thực nghiệm: $(x_1,\dots,x_n)$ $\mathbb{R}^p$ $(y_1,\dots,y_n)$ $\mathbb{R}$ liên quan đến sự lựa chọn và

R_{n} (A^{1}, A^{2}) = \sum_{i = 1}^{n} (y_{i} - f (x_{i}))^{2} [1]

$\mathcal{R}_n(A^1,A^2)=\sum_{i=1}^n (y_i-f(x_i))^2\;\;\;\;\;\;\; [1]$

A^{1}

$A^1$

A^{2}

$A^2$ .

Gradient descent Một grandient gốc để giảm thiểu là một thuật toán lặp: Đối với kích thước bước cũng chọn (hay còn gọi là tỷ lệ học trong khuôn khổ tuyên truyền trở lại). Nó đòi hỏi sự tính toán của gradient của . Trong trường hợp được xem xét $\mathcal{R}$

a_{l + 1} = a_{l} - γ_{l} \nabla R (a_{l}), l \geq 0.

$\mathbf{a}_{l+1}=\mathbf{a}_l-\gamma_l \nabla \mathcal{R}(\mathbf{a}_l),\ l \ge 0.$

(γ_{l})_{l}

$(\gamma_l)_l$

R

$\mathcal{R}$

.

a_{l} = (A_{l}^{1}, A_{l}^{2})

$\mathbf{a}_l=(A^1_{l},A^2_{l})$

$\mathcal{R}$ $\nabla_1 \mathcal{R}$ $\mathcal{R}$ $A^1$ $\nabla_2\mathcal{R}$ $\mathcal{R}$ $A^2$ $z_i=A^1(s(A^2(x_i)))$

\nabla_{1} R [1 : M] = - 2 \times \sum_{i = 1}^{n} z_{i} g^{'} (z_{i}) (y_{i} - f (x_{i}))

$\nabla_1 \mathcal{R}[1:M] =-2\times \sum_{i=1}^n z_i g'(z_i) (y_i-f(x_i))$ for all

m = 1, \dots, M

$m=1,\dots,M$

\nabla_{2} R [1 : p, m] = - 2 \times \sum_{i = 1}^{n} x_{i} g^{'} (z_{i}) z_{i} [m] σ^{'} (A^{2} (x_{i}) [m]) (y_{i} - f (x_{i}))

$\nabla_2 \mathcal{R}[1:p,m] =-2\times \sum_{i=1}^n x_i g'(z_i) z_i[m]\sigma'(A^2(x_i)[m]) (y_i-f(x_i))$

Here I used the R notation: $x[a:b]$ is the vector composed of the coordinates of $x$ from index $a$ to index $b$ .

— robin girard
nguồn

11

Back-propogation is a way of working out the derivative of the error function with respect to the weights, so that the model can be trained by gradient descent optimisation methods - it is basically just the application of the "chain rule". There isn't really much more to it than that, so if you are comfortable with calculus that is basically the best way to look at it.

If you are not comfortable with calculus, a better way would be to say that we know how badly the output units are doing because we have a desired output with which to compare the actual output. However we don't have a desired output for the hidden units, so what do we do? The back-propagation rule is basically a way of speading out the blame for the error of the output units onto the hidden units. The more influence a hidden unit has on a particular output unit, the more blame it gets for the error. The total blame associated with a hidden unit then give an indication of how much the input-to-hidden layer weights need changing. The two things that govern how much blame is passed back is the weight connecting the hidden and output layer weights (obviously) and the output of the hidden unit (if it is shouting rather than whispering it is likely to have a larger influence). The rest is just the mathematical niceties that turn that intuition into the derivative of the training criterion.

I'd also recommend Bishops book for a proper answer! ;o)

— Dikran Marsupial
nguồn

2

It's an algorithm for training feedforward multilayer neural networks (multilayer perceptrons). There are several nice java applets around the web that illustrate what's happening, like this one: http://neuron.eng.wayne.edu/bpFunctionApprox/bpFunctionApprox.html. Also, Bishop's book on NNs is the standard desk reference for anything to do with NNs.

— Stephen Turner
nguồn

In trying to build a permanent repository of high-quality statistical information in the form of questions & answers, we try to avoid link-only answers. If you're able, could you expand this, perhaps by giving a summary of the information at the link?

— Glen_b -Reinstate Monica