Ví dụ từng bước về phân biệt tự động chế độ đảo ngược

Không chắc câu hỏi này có thuộc về vấn đề này không, nhưng nó liên quan chặt chẽ đến các phương pháp gradient trong tối ưu hóa, dường như là chủ đề ở đây. Dù sao, hãy thoải mái di chuyển nếu bạn nghĩ rằng một số cộng đồng khác có chuyên môn tốt hơn trong chủ đề này.

Nói tóm lại, tôi đang tìm một ví dụ từng bước về phân biệt tự động chế độ đảo ngược . Không có nhiều tài liệu về chủ đề ngoài kia và việc triển khai hiện có (như bài viết trong TensorFlow ) thật khó hiểu nếu không biết lý thuyết đằng sau nó. Vì vậy, tôi rất biết ơn nếu ai đó có thể trình bày chi tiết những gì chúng tôi vượt qua , cách chúng tôi xử lý nó và những gì chúng tôi đưa ra khỏi biểu đồ tính toán.

Một vài câu hỏi mà tôi gặp khó khăn nhất:

hạt giống - tại sao chúng ta cần chúng ở tất cả?
quy tắc phân biệt ngược - Tôi biết làm thế nào để tạo sự khác biệt về phía trước, nhưng làm thế nào để chúng ta đi lùi? Ví dụ như trong ví dụ từ phần này , làm thế nào để chúng ta biết rằng $\bar{w_2}=\bar{w_3}w_1$ ?
chúng ta chỉ làm việc với các biểu tượng hoặc chuyển qua các giá trị thực tế ? Ví dụ, trong cùng một ví dụ , các ký hiệu hoặc giá trị $w_i$ và $\bar{w_i}$ ?

— bạn trai
nguồn

"Học máy thực hành với Scikit-Learn & TensorFlow" Phụ lục D đưa ra một lời giải thích rất tốt theo quan điểm của tôi. Tôi khuyến khích điều đó.

— Agustin Barrachina

Giả sử chúng ta có biểu thức $z = x_1x_2 + \sin(x_1)$ và muốn tìm đạo hàm $\frac{dz}{dx_1}$ và $\frac{dz}{dx_2}$ . AD chế độ đảo ngược chia nhiệm vụ này thành 2 phần, cụ thể là chuyển tiếp và đảo ngược.

Chuyển tiếp qua

Đầu tiên, chúng ta phân tách biểu thức phức tạp của chúng ta thành một tập hợp các biểu thức nguyên thủy, tức là các biểu thức bao gồm nhiều nhất là một hàm gọi. Lưu ý rằng tôi cũng đổi tên các biến đầu vào và đầu ra để thống nhất, mặc dù không cần thiết:

w_{1} = x_{1}

$w_1 = x_1$

w_{2} = x_{2}

$w_2 = x_2$

w_{3} = w_{1} w_{2}

$w_3 = w_1w_2$

w_{4} = \sin (w_{1})

$w_4 = \sin(w_1)$

w_{5} = w_{3} + w_{4}

$w_5 = w_3 + w_4$

z = w_{5}

$z = w_5$

Ưu điểm của biểu diễn này là các quy tắc phân biệt cho từng biểu thức riêng biệt đã được biết đến. Ví dụ, chúng ta biết rằng đạo hàm của $\sin$ là $\cos$ , và do đó $\frac{dw_4}{dw_1} = \cos(w_1)$ . Chúng tôi sẽ sử dụng thực tế này trong vượt qua dưới đây.

Về cơ bản, chuyển tiếp bao gồm đánh giá từng biểu thức này và lưu kết quả. Giả sử, đầu vào của chúng tôi là: $x_1 = 2$ và $x_2 = 3$ . Sau đó chúng tôi có:

w_{1} = x_{1} = 2

$w_1 = x_1 = 2$

w_{2} = x_{2} = 3

$w_2 = x_2 = 3$

w_{3} = w_{1} w_{2} = 6

$w_3 = w_1w_2 = 6$

w_{4} = \sin (w_{1}) = 0.9

$w_4 = \sin(w_1) ~= 0.9$

w_{5} = w_{3} + w_{4} = 6.9

$w_5 = w_3 + w_4 = 6.9$

z = w_{5} = 6.9

$z = w_5 = 6.9$

Đảo ngược

Đây là sự khởi đầu kỳ diệu, và nó bắt đầu với quy tắc chuỗi . Ở dạng cơ bản, quy tắc chuỗi nói rằng nếu bạn có biến $t(u(v))$ phụ thuộc vào $u$ , đến lượt nó, phụ thuộc vào $v$ , thì:

\frac{d t}{d v} = \frac{d t}{d u} \frac{d u}{d v}

$\frac{dt}{dv} = \frac{dt}{du}\frac{du}{dv}$

hoặc, nếu $t$ phụ thuộc vào $v$ thông qua một số đường dẫn / biến $u_i$ , ví dụ:

u_{1} = f (v)

$u_1 = f(v)$

u_{2} = g (v)

$u_2 = g(v)$

t = h (u_{1}, u_{2})

$t = h(u_1, u_2)$

sau đó (xem bằng chứng ở đây ):

\frac{d t}{d v} = \sum_{i} \frac{d t}{d u_{i}} \frac{d u_{i}}{d v}

$\frac{dt}{dv} = \sum_i \frac{dt}{du_i}\frac{du_i}{dv}$

Về mặt biểu đồ, nếu chúng ta có một nút cuối cùng $z$ và các nút đầu vào $w_i$ và đường dẫn từ $z$ đến $w_i$ đi qua các nút trung gian $w_p$ (tức là $z = g(w_p)$ trong đó $w_p = f(w_i)$ ), chúng ta có thể tìm đạo hàm $\frac{dz}{dw_i}$ như

\frac{d z}{d w_{i}} = \sum_{p \in p a r e n t s (i)} \frac{d z}{d w_{p}} \frac{d w_{p}}{d w_{i}}

$\frac{dz}{dw_i} = \sum_{p \in parents(i)} \frac{dz}{dw_p} \frac{dw_p}{dw_i}$

Nói cách khác, để tính đạo hàm của biến đầu ra $z$ wrt bất kỳ biến trung gian hoặc biến đầu vào $w_i$ , chúng ta chỉ cần biết đạo hàm của cha mẹ của nó và công thức để tính đạo hàm của biểu thức nguyên thủy $w_p = f(w_i)$ .

Đảo ngược bắt đầu ở cuối (tức là $\frac{dz}{dz}$ ) and propagates backward to all dependencies. Here we have (expression for "seed"):

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$

That may be read as "change in $z$ results in exactly the same change in $z$ ", which is quite obvious.

Then we know that $z = w_5$ and so:

\frac{d z}{d w_{5}} = 1

$\frac{dz}{dw_5} = 1$

$w_5$ linearly depends on $w_3$ and $w_4$ , so $\frac{dw_5}{dw_3} = 1$ $\frac{dw_5}{dw_4} = 1$ . Sử dụng quy tắc chuỗi chúng tôi tìm thấy:

\frac{d z}{d w_{3}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{3}} = 1 \times 1 = 1

$\frac{dz}{dw_3} = \frac{dz}{dw_5} \frac{dw_5}{dw_3} = 1 \times 1 = 1$

\frac{d z}{d w_{4}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{4}} = 1 \times 1 = 1

$\frac{dz}{dw_4} = \frac{dz}{dw_5} \frac{dw_5}{dw_4} = 1 \times 1 = 1$

From definition $w_3 = w_1w_2$ and rules of partial derivatives, we find that $\frac{dw_3}{dw_2} = w_1$ . Thus:

\frac{d z}{d w_{2}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{2}} = 1 \times w_{1} = w_{1}

$\frac{dz}{dw_2} = \frac{dz}{dw_3} \frac{dw_3}{dw_2} = 1 \times w_1 = w_1$

Which, as we already know from forward pass, is:

\frac{d z}{d w_{2}} = w_{1} = 2

$\frac{dz}{dw_2} = w_1 = 2$

Finally, $w_1$ contributes to $z$ via $w_3$ and $w_4$ . Once again, from the rules of partial derivatives we know that $\frac{dw_3}{dw_1} = w_2$ and $\frac{dw_4}{dw_1} = \cos(w_1)$ . Thus:

\frac{d z}{d w_{1}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{1}} + \frac{d z}{d w_{4}} \frac{d w_{4}}{d w_{1}} = w_{2} + \cos (w_{1})

$\frac{dz}{dw_1} = \frac{dz}{dw_3} \frac{dw_3}{dw_1} + \frac{dz}{dw_4} \frac{dw_4}{dw_1} = w_2 + \cos(w_1)$

And again, given known inputs, we can calculate it:

\frac{d z}{d w_{1}} = w_{2} + \cos (w_{1}) = 3 + \cos (2) = 2.58

$\frac{dz}{dw_1} = w_2 + \cos(w_1) = 3 + \cos(2) ~= 2.58$

Since $w_1$ and $w_2$ are just aliases for $x_1$ and $x_2$ , we get our answer:

\frac{d z}{d x_{1}} = 2.58

$\frac{dz}{dx_1} = 2.58$

\frac{d z}{d x_{2}} = 2

$\frac{dz}{dx_2} = 2$

And that's it!

This description concerns only scalar inputs, i.e. numbers, but in fact it can also be applied to multidimensional arrays such as vectors and matrices. Two things that one should keep in mind when differentiating expressions with such objects:

Derivatives may have much higher dimensionality than inputs or output, e.g. derivative of vector w.r.t. vector is a matrix and derivative of matrix w.r.t. matrix is a 4-dimensional array (sometimes referred to as a tensor). In many cases such derivatives are very sparse.
Each component in output array is an independent function of 1 or more components of input array(s). E.g. if $y = f(x)$ and both $x$ and $y$ are vectors, $y_i$ never depends on $y_j$ , but only on subset of $x_k$ . In particular, this means that finding derivative $\frac{dy_i}{dx_j}$ boils down to tracking how $y_i$ depends on $x_j$ .

The power of automatic differentiation is that it can deal with complicated structures from programming languages like conditions and loops. However, if all you need is algebraic expressions and you have good enough framework to work with symbolic representations, it's possible to construct fully symbolic expressions. In fact, in this example we could produce expression $\frac{dz}{dw_1} = w_2 + \cos(w_1) = x_2 + \cos(x_1)$ and calculate this derivative for whatever inputs we want.

— ffriend
nguồn

Very useful question/answer. Thanks. Just a litte criticism: you seem to move on a tree structure without explaining (that's when you start talking about parents, etc..)

— MadHatter

Also it won't hurt clarifying why we need seeds.

— MadHatter

@MadHatter thanks for the comment. I tried to rephrase a couple of paragraphs (these that refer to parents) to emphasize a graph structure. I also added "seed" to the text, although this name itself may be misleading in my opinion: in AD seed is always a fixed expression -

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$ , not something you can choose or generate.

— ffriend

Thanks! I noticed when you have to set more than one "seed", generally one chooses 1 and 0. I'd like to know why. I mean, one takes the "quotient" of a differential w.r.t. itself, so "1" is at least intuitively justified.. But what about 0? And what if one has to pick more than 2 seeds?

— MadHatter

As far as I understand, more than one seed is used only in forward-mode AD. In this case you set the seed to 1 for an input variable you want to differentiate with respect to and set the seed to 0 for all the other input variables so that they don't contribute to the output value. In reverse-mode you set the seed to an output variable, and you normally have only one output variable. I guess, you can construct reverse-mode AD pipeline with several output variables and set all of them but one to 0 to get the same effect as in forward mode, but I have never investigated this option.

— ffriend