Độ phức tạp thời gian để đào tạo một mạng lưới thần kinh bằng cách sử dụng lan truyền ngược là gì?

16

Giả sử rằng NN chứa lớp ẩn, ví dụ đào tạo, tính năng và nút trong mỗi lớp. Độ phức tạp thời gian để đào tạo NN này bằng cách sử dụng lan truyền ngược là gì? $n$ $m$ $x$ $n_i$

Tôi có một ý tưởng cơ bản về cách họ tìm thấy độ phức tạp thời gian của các thuật toán, nhưng ở đây có 4 yếu tố khác nhau để xem xét ở đây, đó là lặp lại, lớp, nút trong mỗi lớp, ví dụ đào tạo và có thể nhiều yếu tố hơn. Tôi tìm thấy một câu trả lời ở đây nhưng nó không đủ rõ ràng.

Có những yếu tố khác, ngoài những yếu tố tôi đã đề cập ở trên, có ảnh hưởng đến độ phức tạp thời gian của thuật toán đào tạo của NN không?

— DuttaA
nguồn

Xem thêm https://qr.ae/TWttzq .

— nbro

9

Tôi chưa thấy câu trả lời từ một nguồn đáng tin cậy, nhưng tôi sẽ cố gắng tự trả lời câu hỏi này, với một ví dụ đơn giản (với kiến thức hiện tại của tôi).

Nói chung, lưu ý rằng đào tạo MLP bằng cách truyền ngược thường được thực hiện với ma trận.

Độ phức tạp thời gian của phép nhân ma trận

Độ phức tạp thời gian của phép nhân ma trận cho $M_{ij} * M_{jk}$ chỉ đơn giản là $\mathcal{O}(i*j*k)$ .

Lưu ý rằng chúng ta đang giả sử thuật toán nhân đơn giản nhất ở đây: tồn tại một số thuật toán khác với độ phức tạp thời gian tốt hơn một chút.

Thuật toán thông qua Feedforward

Thuật toán truyền dẫn feedforward như sau.

Đầu tiên, để đi từ lớp $i$ đến $j$ , bạn làm

S_{j} = W_{j i} * Z_{i}

$S_j = W_{ji}*Z_i$

Sau đó, bạn áp dụng chức năng kích hoạt

Z_{j} = f (S_{j})

$Z_j = f(S_j)$

Nếu chúng ta có $N$ lớp (bao gồm cả lớp đầu vào và đầu ra), lớp này sẽ chạy $N-1$ lần.

Thí dụ

Ví dụ, hãy tính độ phức tạp thời gian cho thuật toán chuyển tiếp cho MLP với $4$ lớp, trong đó $i$ biểu thị số lượng nút của lớp đầu vào, $j$ số lượng nút trong lớp thứ hai, $k$ số lượng nút trong lớp thứ ba và $l$ số lượng nút trong lớp đầu ra.

Vì có $4$ lớp, bạn cần $3$ ma trận để biểu diễn các trọng số giữa các lớp này. Hãy biểu thị chúng bằng $W_{ji}$ , $W_{kj}$ và $W_{lk}$ , trong đó $W_{ji}$ là một ma trận với các hàng $j$ và cột $i$ ( $W_{ji}$ do đó chứa các trọng số đi từ lớp $i$ đến lớp $j$ ).

Giả sử bạn có ví dụ đào tạo $t$ . Để nhân giống từ lớp $i$ đến $j$ , trước tiên chúng ta có

S_{j t} = W_{j i} * Z_{i t}

$S_{jt} = W_{ji} * Z_{it}$

và thao tác này (tức là nhân ma trận) có độ phức tạp thời gian $\mathcal{O}(j*i*t)$ . Sau đó, chúng tôi áp dụng chức năng kích hoạt

Z_{j t} = f (S_{j t})

$Z_{jt} = f(S_{jt})$

và điều này có độ phức tạp thời gian $\mathcal{O}(j*t)$ , bởi vì nó là một hoạt động khôn ngoan.

Vì vậy, trong tổng số, chúng ta có

O (j * i * t + j * t) = O (j * t * (t + 1)) = O (j * i * t)

$\mathcal{O}(j*i*t + j*t) = \mathcal{O}(j*t*(t + 1)) = \mathcal{O}(j*i*t)$

Sử dụng cùng một logic, cho đi $j \to k$ , chúng ta có $\mathcal{O}(k*j*t)$ , và, cho $k \to l$ , chúng tôi có $\mathcal{O}(l*k*t)$ .

In total, the time complexity for feedforward propagation will be

O (j * i * t + k * j * t + l * k * t) = O (t * (i j + j k + k l))

$\mathcal{O}(j*i*t + k*j*t + l*k*t) = \mathcal{O}(t*(ij + jk + kl))$

I'm not sure if this can be simplified further or not. Maybe it's just $\mathcal{O}(t*i*j*k*l)$ , but I'm not sure.

Back-propagation algorithm

The back-propagation algorithm proceeds as follows. Starting from the output layer $l \to k$ , we compute the error signal, $E_{lt}$ , a matrix containing the error signals for nodes at layer $l$

E_{l t} = f^{'} (S_{l t}) ⊙ (Z_{l t} - O_{l t})

$E_{lt} = f'(S_{lt}) \odot {(Z_{lt} - O_{lt})}$

where $\odot$ means element-wise multiplication. Note that $E_{lt}$ has $l$ rows and $t$ columns: it simply means each column is the error signal for training example $t$ .

We then compute the "delta weights", $D_{lk} \in \mathbb{R}^{l \times k}$ (between layer $l$ and layer $k$ )

D_{l k} = E_{l t} * Z_{t k}

$D_{lk} = E_{lt} * Z_{tk}$

where $Z_{tk}$ is the transpose of $Z_{kt}$ .

We then adjust the weights

W_{l k} = W_{l k} - D_{l k}

$W_{lk} = W_{lk} - D_{lk}$

For $l \to k$ , we thus have the time complexity $\mathcal{O}(lt + lt + ltk + lk) = \mathcal{O}(l*t*k)$ .

Now, going back from $k \to j$ . We first have

E_{k t} = f^{'} (S_{k t}) ⊙ (W_{k l} * E_{l t})

$E_{kt} = f'(S_{kt}) \odot (W_{kl} * E_{lt})$

Then

D_{k j} = E_{k t} * Z_{t j}

$D_{kj} = E_{kt} * Z_{tj}$

And then

W_{k j} = W_{k j} - D_{k j}

$W_{kj} = W_{kj} - D_{kj}$

where $W_{kl}$ is the transpose of $W_{lk}$ . For $k \to j$ , we have the time complexity $\mathcal{O}(kt + klt + ktj + kj) = \mathcal{O}(k*t(l+j))$ .

And finally, for $j \to i$ , we have $\mathcal{O}(j*t(k+i))$ . In total, we have

O (l t k + t k (l + j) + t j (k + i)) = O (t * (l k + k j + j i))

$\mathcal{O}(ltk + tk(l + j) + tj (k + i)) = \mathcal{O}(t*(lk + kj + ji))$

which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be

O (t * (i j + j k + k l)) .

$O(t*(ij + jk + kl)).$

This time complexity is then multiplied by number of iterations (epochs). So, we have

O (n * t * (i j + j k + k l)),

$O(n*t*(ij + jk + kl)),$ where

n

$n$ is number of iterations.

Notes

Note that these matrix operations can greatly be paralelized by GPUs.

Conclusion

We tried to find the time complexity for training a neural network that has 4 layers with respectively $i$ , $j$ , $k$ and $l$ nodes, with $t$ training examples and $n$ epochs. The result was $\mathcal{O}(nt*(ij + jk + kl))$ .

We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)

Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.

I'm not sure what the results would be using other optimizers such as RMSprop.

Sources

The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.

If you're not familiar with back-propagation, check this article:

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4

— M.kazem Akhgary
nguồn

Your answer is great..I could not find any ambiguity till now, but you forgot the no. of iterations part, just add it...and if no one answers in 5 days i'll surely accept your answer

— DuttaA

@DuttaA I tried to put every thing I knew. it may not be 100% correct so feel free to leave this unaccepted :) I'm also waiting for other answers to see what other points I missed.

— M.kazem Akhgary

3

For the evaluation of a single pattern, you need to process all weights and all neurons. Given that every neuron has at least one weight, we can ignore them, and have $\mathcal{O}(w)$ where $w$ is the number of weights, i.e., $n * n_i$ , assuming full connectivity between your layers.

The back-propagation has the same complexity as the forward evaluation (just look at the formula).

So, the complexity for learning $m$ examples, where each gets repeated $e$ times, is $\mathcal{O}(w*m*e)$ .

The bad news is that there's no formula telling you what number of epochs $e$ you need.

— maaartinus
nguồn

From the above answer don't you think itdepends on more factors?

— DuttaA

1

@DuttaA No. There's a constant amount of work per weight, which gets repeated e times for each of m examples. I didn't bother to compute the number of weights, I guess, that's the difference.

— maaartinus

I think the answers are same. in my answer I can assume number of weights w = ij + jk + kl. basically sum of n * n_i between layers as you noted.

— M.kazem Akhgary