Giải thích chức năng mất Yolo

15

Tôi đang cố gắng để hiểu chức năng mất Yolo v2:

\begin{aligned} λ_{c o o r d} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{o b j} [(x_{i} - {\hat{x}}_{i})^{2} + (y_{i} - {\hat{y}}_{i})^{2}] \\ + λ_{c o o r d} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{o b j} [(\sqrt{w_{i}} - \sqrt{{\hat{w}}_{i}})^{2} + (\sqrt{h_{i}} - \sqrt{{\hat{h}}_{i}})^{2}] \\ + \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{o b j} (C_{i} - {\hat{C}}_{i})^{2} + λ_{n o o b j} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{n o o b j} (C_{i} - {\hat{C}}_{i})^{2} \\ + \sum_{i = 0}^{S^{2}} 1_{i}^{o b j} \sum_{c \in c l a s s e s} (p_{i} (c) - {\hat{p}}_{i} (c))^{2} \end{aligned}

$\begin{align} &\lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}[(x_i-\hat{x}_i)^2 + (y_i-\hat{y}_i)^2 ] \\&+ \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}[(\sqrt{w_i}-\sqrt{\hat{w}_i})^2 +(\sqrt{h_i}-\sqrt{\hat{h}_i})^2 ]\\ &+ \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}(C_i - \hat{C}_i)^2 + \lambda_{noobj}\sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{noobj}(C_i - \hat{C}_i)^2 \\ &+ \sum_{i=0}^{S^2} \mathbb{1}_{i}^{obj}\sum_{c \in classes}(p_i(c) - \hat{p}_i(c))^2 \\ \end{align}$

Nếu bất kỳ người nào có thể chi tiết các chức năng.

— Kamel BOUYACOUB
nguồn

5

không ai có thể giúp bạn mà không có ngữ cảnh ... ít nhất là cho chúng tôi biết đây là giấy tờ gì.

— bdeonovic

1

"Tôi không hiểu" và "chi tiết chức năng" quá rộng. Hãy cố gắng xác định các câu hỏi cụ thể. Lưu ý rằng đã có rất nhiều câu hỏi liên quan đến Yolo , một số câu hỏi có thể cung cấp cho bạn ít nhất một phần những gì bạn tìm kiếm

— Glen_b -Reinstate Monica

Tôi sẽ thêm câu trả lời của mình nếu bạn chỉ ra những gì không rõ ràng từ lời giải thích tuyệt vời này: Medium.com/@jonathan_hui/iêu

— Aksakal

Trong blog này ở đây có một lời giải thích đồ họa chi tiết về yolo và yolov2. Nó trả lời câu hỏi liên quan đến chức năng mất. Ifind nó rất hữu ích cho người mới bắt đầu và người dùng cao cấp hơn.

— MBoaretto

18

Giải thích về các điều khoản khác nhau:

3 hằng chỉ là hằng số để đưa vào tài khoản nhiều hơn một khía cạnh của hàm tổn thất. Trong bài viết là mức cao nhất để có tầm quan trọng hơn trong nhiệm kỳ đầu tiên $\lambda$ $\lambda_{coord}$
Dự đoán của YOLO là một vector: dự đoán bbox cho mỗi tế bào lưới và dự đoán lớp cho mỗi tế bào lưới (trong đó là số lớp). 5 đầu ra bbox của hộp j của ô i là tọa độ tâm tte của bbox , chiều cao , chiều rộng và chỉ số độ tin cậy $S*S*(B*5+C)$ $B$ $C$ $C$ $x_{ij}$ $y_{ij}$ $h_{ij}$ $w_{ij}$ $C_{ij}$
Tôi tưởng tượng rằng các giá trị với một chiếc mũ là cái thực sự được đọc từ nhãn hiệu và cái không có mũ là những thứ được dự đoán. Vì vậy, giá trị thực từ nhãn cho điểm tín nhiệm đối với mỗi bbox là những gì ? Nó là giao điểm của liên kết của hộp giới hạn dự đoán với hộp từ nhãn. $\hat{C}_{ij}$
làkhi có một đối tượng trong ôvàở nơi khác $\mathbb{1}_{i}^{obj}$ $1$ $i$ $0$
"biểu thị rằng bộdự đoán hộp giới hạn thứtrong ôchịu trách nhiệm cho dự đoán đó". Nói cách khác, nó bằngnếu có một đối tượng trong ôvà độ tin cậy của cácyếu tố dự đoán thứcủa ô này là cao nhất trong số tất cả các yếu tố dự đoán của ô này. gần như giống nhau ngoại trừ giá trị 1 khi KHÔNG có đối tượng trong ô $\mathbb{1}_{ij}^{obj}$ $j$ $i$ $1$ $i$ $j$ $\mathbb{1}_{ij}^{noobj}$ $i$

Lưu ý rằng tôi đã sử dụng hai chỉ số và cho mỗi dự đoán bbox, đây không phải là trường hợp trong bài viết vì luôn có yếu tố hoặc nên không có cách giải thích mơ hồ: những được chọn là một trong những tương ứng với số điểm cao nhất sự tự tin trong ô đó. $i$ $j$ $\mathbb{1}_{ij}^{obj}$ $\mathbb{1}_{ij}^{noobj}$ $j$

Giải thích chung hơn về mỗi kỳ hạn của tổng:

thuật ngữ này xử phạt nội địa hóa xấu của trung tâm tế bào
thuật ngữ này xử phạt hộp giới hạn với chiều cao và chiều rộng không chính xác. Căn bậc hai có mặt để các erors trong các hộp giới hạn nhỏ bị phạt nhiều hơn các lỗi trong các hộp giới hạn lớn.
thuật ngữ này cố gắng làm cho điểm tin cậy bằng với IOU giữa đối tượng và dự đoán khi có một đối tượng
Cố gắng làm cho điểm tự tin gần bằng khi không có đối tượng trong ô $0$
Đây là một mất mát phân loại đơn giản (không được giải thích trong bài viết)

— người dùng753566
nguồn

1

Là điểm thứ hai được cho là B*(5+C)? Atleast đó là trường hợp của YOLO v3.

— sachinruk

@sachinruk điều này phản ánh những thay đổi trong mô hình giữa YOLO ban đầu và đó là v2 và v3.

— David Refaeli

12

\begin{aligned} λ_{c o o r d} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{o b j} [(x_{i} - {\hat{x}}_{i})^{2} + (y_{i} - {\hat{y}}_{i})^{2}] \\ + λ_{c o o r d} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{o b j} [(\sqrt{w_{i}} - \sqrt{{\hat{w}}_{i}})^{2} + (\sqrt{h_{i}} - \sqrt{{\hat{h}}_{i}})^{2}] \\ + \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{o b j} (C_{i} - {\hat{C}}_{i})^{2} + λ_{n o o b j} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{n o o b j} (C_{i} - {\hat{C}}_{i})^{2} \\ + \sum_{i = 0}^{S^{2}} 1_{i}^{o b j} \sum_{c \in c l a s s e s} (p_{i} (c) - {\hat{p}}_{i} (c))^{2} \end{aligned}

$\begin{align} &\lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}[(x_i-\hat{x}_i)^2 + (y_i-\hat{y}_i)^2 ] \\&+ \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}[(\sqrt{w_i}-\sqrt{\hat{w}_i})^2 +(\sqrt{h_i}-\sqrt{\hat{h}_i})^2 ]\\ &+ \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}(C_i - \hat{C}_i)^2 + \lambda_{noobj}\sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{noobj}(C_i - \hat{C}_i)^2 \\ &+ \sum_{i=0}^{S^2} \mathbb{1}_{i}^{obj}\sum_{c \in classes}(p_i(c) - \hat{p}_i(c))^2 \\ \end{align}$

Doesn't the YOLOv2 Loss function looks scary? It's not actually! It is one of the boldest, smartest loss function around.

Let's first look at what the network actually predicts.

If we recap, YOLOv2 predicts detections on a 13x13 feature map, so in total, we have 169 maps/cells.

We have 5 anchor boxes. For each anchor box we need Objectness-Confidence Score (whether any object was found?), 4 Coordinates ( $t_x, t_y, t_w,$ and $t_h$ ) for the anchor box, and 20 top classes. This can crudely be seen as 20 coordinates, 5 confidence scores, and 100 class probabilities for all 5 anchor box predictions put together.

We have few things to worry about:

$x_i, y_i$ , which is the location of the centroid of the anchor box
$w_i, h_i$ , which is the width and height of the anchor box
$C_i$ , which is the Objectness, i.e. confidence score of whether there is an object or not, and
$p_i(c)$ , which is the classification loss.
We not only need to train the network to detect an object if there is an object in a cell, we also need to punish the network, it if predicts an object in a cell, when there wasn't any. How do we do this? We use a mask ( $𝟙_{i}^{obj}$ and $𝟙_{i}^{noobj}$ ) for each cell. If originally there was an object $𝟙_{i}^{obj}$ is 1 and other no-object cells are 0. $𝟙_{i}^{noobj}$ is just inverse of $𝟙_{i}^{obj}$ , where it is 1 if there was no object in the cell and 0 if there was.
We need to do this for all 169 cells, and
We need to do this 5 times (for each anchor box).

All losses are mean-squared errors, except classification loss, which uses cross-entropy function.

Now, let's break the code in the image.

We need to compute losses for each Anchor Box (5 in total)
- $\sum_{j=0}^B$ represents this part, where B = 4 (5 - 1, since the index starts from 0)
We need to do this for each of the 13x13 cells where S = 12 (since we start index from 0)
- $\sum_{i=0}^{S^2}$ represents this part.
$𝟙_{ij}^{obj}$ is 1 when there is an object in the cell $i$ , else 0.
$𝟙_{ij}^{noobj}$ is 1 when there is no object in the cell $i$ , else 0.
$𝟙_{i}^{obj}$ is 1 when there is a particular class is predicted, else 0.
λs are constants. λ is highest for coordinates in order to focus more on detection (remember, in YOLOv2, we first train it for recognition and then for detection, penalizing heavily for recognition is waste of time, rather we focus on getting best bounding boxes!)
We can also notice that $w_i, h_i$ are under square-root. This is done to penalize the smaller bounding boxes as we need better prediction on smaller objects than on bigger objects (author's call). Check out the table below and observe how the smaller values are punished more if we follow "square-root" method (look at the inflection point when we have 0.3 and 0.2 as the input values) (PS: I have kept the ratio of var1 and var2 same just for explanation):

var1 | var2 | (var1 - var2)^2 | (sqrtvar1 - sqrtvar2)^2

0.0300 | 0.020 | 9.99e-05 | 0.001

0.0330 | 0.022 | 0.00012 | 0.0011

0.0693 | 0.046 | 0.000533 | 0.00233

0.2148 | 0.143 | 0.00512 | 0.00723

0.3030 | 0.202 | 0.01 | 0.01

0.8808 | 0.587 | 0.0862 | 0.0296

4.4920 | 2.994 | 2.2421 | 0.1512

Not that scary, right!

Read HERE for further details.

— RShravan
nguồn

1

Should i and j in \sum start from 1 instead of 0?

— webbertiger

1

Yes, that's correct webertiger, have updated the answer accordingly. Thanks!

— RShravan

Isnt

1_{i j}^{o b j}

$\mathbb{1}_{ij}^{obj}$ 1 when there is an object in cell i of bounding box j? and not for all j? how do we choose which j to set to one and the rest to zero. i.e. what is the correct scale/ anchor where it is turned on.

— sachinruk

1

I believe S should still be 13 but if the summation starts in 0 it should end in

S^{2} - 1

$S^2 -1$

— Julian

3

@RShravan, you say: "All losses are mean-squared errors, except classification loss, which uses cross-entropy function". Could you explain? In this equation, it looks as MSE also. Thanks in advance

— Julian

3

Your loss function is for YOLO v1 and not YOLO v2. I was also confused with the difference in the two loss functions and seems like many people are: https://groups.google.com/forum/#!topic/darknet/TJ4dN9R4iJk

YOLOv2 paper explains the difference in architecture from YOLOv1 as follows:

We remove the fully connected layers from YOLO(v1) and use anchor boxes to predict bounding boxes... When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchorbox.

This means that the confidence probability $p_i(c)$ above should depend not only on $i$ and $c$ but also an anchor box index, say $j$ . Therefore, the loss needs to be different from above. Unfortunately, YOLOv2 paper does not explicitly state its loss function.

I try to make a guess on the loss function of YOLOv2 and discuss it here: https://fairyonice.github.io/Part_4_Object_Detection_with_Yolo_using_VOC_2012_data_loss.html

— FairyOnIce
nguồn

1

Here is my Study Note

Loss function: sum-squared error

a. Reason: Easy to optimize b. Problem: (1) Does not perfectly align with our goal of maximize average precision. (2) In every image, many grid cells do not contain any object. This pushes the confidence scores of those cells towards 0, often overpowering the gradient from cells that do contain an object. c. Solution: increase loss from bounding box coordinate predictions and decrease the loss from confidence predictions from boxes that don't contain objects. We use two parameters
$λ_{c o o r d} = 5$ $\lambda_{coord} = 5$ and $\lambda_{noobj}$ = 0.5 d. Sum-squared error also equally weights errors in large boxes and small boxes
Only one bounding box should be responsible for each obejct. We assign one predictor to be responsible for predicting an object based on which prediction has the highest current IOU with the ground truth.

a. Loss from bound box coordinate (x, y) Note that the loss comes from one bounding box from one grid cell. Even if obj not in grid cell as ground truth.

{\begin{cases} λ_{c o o r d} \sum_{i = 0}^{S^{2}} [(x_{i} - {\hat{x}}_{i})^{2} + (y_{i} - \hat{y_{i}})^{2}] & responsible bounding box \\ 0 & other \end{cases}

$\begin{cases} \lambda_{coord} \sum^{S^2}_{i=0} [(x_i - \hat{x}_i)^2 + (y_i - \hat{y_i})^2] &\text{responsible bounding box} \\ 0 &\text{ other} \\ \end {cases}$

b. Loss from width w and height h. Note that the loss comes from one bounding box from one grid cell, even if the object is not in the grid cell as ground truth.

{\begin{cases} λ_{c o o r d} \sum_{i = 0}^{S^{2}} [(\sqrt{w_{i}} - \sqrt{{\hat{w}}_{i}})^{2} + (\sqrt{h_{i}} - \sqrt{{\hat{h}}_{i}})^{2}] & responsible bounding box \\ 0 & other \end{cases}

$\begin {cases} \lambda_{coord} \sum^{S^2}_{i=0} [(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2] &\text{responsible bounding box} \\ 0 &\text{ other} \\ \end {cases}$

c. Loss from the confidence in each bound box. Not that the loss comes from one bounding box from one grid cel, even if the object is not in the grid cell as ground truth.

{\begin{cases} \sum_{i = 0}^{S^{2}} (C_{i} - {\hat{C}}_{i})^{2} & obj in grid cell and responsible bounding box \\ λ_{n o o b j} \sum_{i = 0}^{S^{2}} (C_{i} - {\hat{C}}_{i})^{2} & obj not in grid cell and responsible bounding box \\ 0 & other \end{cases}

$\begin {cases} \sum^{S^2}_{i=0}(C_i - \hat{C}_i)^2 &\text{obj in grid cell and responsible bounding box} \\ \lambda_{noobj} \sum^{S^2}_{i=0}(C_i - \hat{C}_i)^2 &\text{obj not in grid cell and responsible bounding box} \\ 0 &\text{other} \end {cases}$ d. Loss from the class probability of grid cell, only when object is in the grid cell as ground truth.

{\begin{cases} \sum_{i = 0}^{S^{2}} \sum_{c \in c l a s s e s} (p_{i} (c) - {\hat{p}}_{i} (c))^{2} & obj in grid cell \\ 0 & other \end{cases}

$\begin {cases} \sum^{S^2}_{i=0} \sum_{c \in classes} (p_i(c) - \hat{p}_i(c))^2 &\text{obj in grid cell}\\ 0 &\text{other} \\ \end {cases}$

Loss function only penalizes classification if obj is present in the grid cell. It also penalize bounding box coordinate if that box is responsible for the ground box (highest IOU)

— roy
nguồn

Question about 'C', in the paper, confidence is the object-or-no object value outputted multiply by the IOU; is that just for test time or is that used for training cost function as well? I thought we just subtract C value from output and label (just like we did with grid values), but that is incorrect?

— moondra

0

The loss formula you wrote is of the original YOLO paper loss, not the v2, or v3 loss.

There are some major differences between versions. I suggest reading the papers, or checking the code implementations. Papers: v2, v3.

Some major differences I noticed:

Class probability is calculated per bounding box (hence output is now S∗S∗B*(5+C) instead of SS(B*5 + C))
Bounding box coordinates now have a different representation
In v3 they use 3 boxes across 3 different "scales"

You can try getting into the nitty-gritty details of the loss, either by looking at the python/keras implementation v2, v3 (look for the function yolo_loss) or directly at the c implementation v3 (look for delta_yolo_box, and delta_yolo_class).

— David Refaeli
nguồn