Wise-IoU: Bounding Box Regression Loss
with Dynamic Focusing Mechanism
Abstract
The loss function for bounding box regression (BBR) is essential to object detection. Its good definition will bring significant performance improvement to the model. Most existing works assume that the examples in the training data are high-quality and focus on strengthening the fitting ability of BBR loss. If we blindly strengthen BBR on low-quality examples, it will jeopardize localization performance. Focal-EIoU v1 was proposed to solve this problem, but due to its static focusing mechanism (FM), the potential of non-monotonic FM was not fully exploited. Based on this idea, we propose an IoU-based loss with a dynamic non-monotonic FM named Wise-IoU (WIoU). When WIoU is applied to the state-of-the-art real-time detector YOLOv7, the on the MS-COCO dataset is improved from 53.03% to 54.50%.
Index Terms:
Object detection, bounding box regression, dynamic non-monotonic focusing mechanism, generalization performanceI Introduction
The real-time detectors of the YOLO series have been recognized by most researchers and applied in many scenarios [1, 2, 3, 4, 5, 6] since their inception. Such as YOLOv1 [7], which constructs a loss function weighted by BBR loss, classification loss, and objectness loss. Until now, this construction is still the most effective loss function paradigm [7, 8, 9, 10, 11, 12, 13, 14] for object detection tasks, where the BBR loss directly determines the localization performance of the model. To further improve the localization performance of the model, a well-designed BBR loss is essential.
I-A norm Loss
For the anchor box , the values in it correspond to the center coordinates and size of the bounding box. Similarly, describes the properties of the target box.
YOLOv1 [7] and YOLOv2 [8] are quite similar in the definition of BBR loss. YOLOv2 defines the BBR loss as:
(1) |
However, this form of the loss function cannot shield the interference of the size of the bounding box, making YOLOv2 [8] poor localization performance for small objects. Although YOLOv3 [9] constructs in an attempt to reduce the model’s attention to large objects, the localization performance brought by this BBR loss to the model is still very limited.
I-B Intersection over Union
Intersection over Union [19] (IoU) is used to measure the degree of overlap between the anchor box and the target box in the object detection task. It effectively shields the interference of bounding box size in the form of proportion, which makes the model can well balance the learning of large objects and small objects when (Eq. 2) is used as the BBR loss.
(2) |
However, has another fatal flaw which can be observed in Eq. 3. when there is no overlap between the bounding boxes ( or ), the gradient back-propagated by vanishes. As a result, the width of the overlapping region (Fig. 1) cannot be updated during training.
(3) |
Existing works [15, 16, 17, 18] consider many geometric factors related to the bounding box and construct the penalty term to solve this problem. The existing BBR loss follows the following paradigm:
(4) |
I-C Focusing Mechanism
Fig. 2 shows some low-quality examples in the training data. A well-performing model will produce large when it produces high-quality anchor boxes for low-quality examples. If the monotonic FM assigns these anchor boxes large gradient gains, the learning of the model will be jeopardized.
In [17], Yifan Zhang et al. proposed Focal-EIoU v1 using the non-monotonic FM. The FM of Focal-EIoU v1 is static, which specifies the boundary value of the anchor boxes so that the anchor box with equal to the boundary value has the highest gradient gain. Focal-EIoU v1 does not notice that the quality evaluation of anchor boxes is reflected in mutual comparison. It does not fully exploit the potential of the non-monotonic FM.
We define a dynamic FM by estimating the outlier degree of the anchor box as . Our FM enables the BBR to focus on ordinary-quality anchor boxes by assigning small gradient gains to high-quality anchor boxes with small . At the same time, this mechanism assigns small gradient gains to low-quality anchor boxes with large , which effectively weakens the harm of low-quality examples to the BBR.
We combine such a wise FM with IoU-based loss and call it Wise-IoU (WIoU). To evaluate our proposed method, we incorporate WIoU into the state-of-the-art real-time detector YOLOv7 [11]. The main contributions of this paper are summarized as follows:
-
•
We propose the attention-based loss WIoU v1 for BBR, which achieves a lower regression error than the state-of-the-art SIoU [18] in simulation experiments.
-
•
We design WIoU v2 with monotonic FM, and WIoU v3 with dynamic non-monotonic FM. Benefiting from the wise gradient gain allocation strategy of the dynamic non-monotonic FM, WIoU v3 achieves superior performance.
-
•
We perform a series of detailed studies of the influence of low-quality examples, demonstrating the effectiveness and efficiency of the dynamic non-monotonic FM.
II Related work
II-A Loss Functions for BBR
To compensate for the scale sensitivity of the -norm loss, YOLOv1 [7] weakens the influence of large bounding boxes by performing square root transformation on the size of the bounding boxes. YOLOv3 [9] proposes to construct a penalty term to reduce the competitiveness of large boxes. However, the -norm loss ignores the correlation between the bounding box’s properties, making this type of BBR loss less effective.
To solve the gradient vanishing problem of IoU loss, GIoU [15] uses the penalty term constructed by the smallest enclosing box. DIoU [16] uses the penalty term constructed by distance metric, and CIoU [16] is obtained by adding the aspect ratio metric based on the DIoU. Zhora Gevorgyan constructs SIoU [18] with angle cost, distance cost, and shape cost, which has a faster convergence rate and better performance.
II-B Loss Functions with FM
Cross-entropy loss is widely used in binary classification tasks. However, a salient property of this loss function is that even easy examples produce a large loss value, competing with hard examples. Tsungyi Lin et al. proposed focal loss [20] with the monotonic FM, which effectively reduces the competitiveness of easy examples.
In [17], Yifan Zhang et al. proposed Focal-EIoU v1 with the non-monotonic FM and Focal-EIoU with the monotonic FM. In their experiments, monotonic FM was shown to be a better choice than non-monotonic FM.
The FM of Focal-EIoU v1 is static, which specifies the quality demarcation standard of the anchor boxes. It gives the highest gradient gain to the anchor box when the IoU loss of the anchor box equals the bound value. It is not noticed that the quality evaluation of the anchor boxes is reflected in the intercomparison, so it does not fully exploit the potential of the non-monotonic FM.
III Method
III-A Simulation Experiment
To preliminarily compare each loss function for BBR, we used the simulation experiment proposed by Zhaohui Zheng et al. [16] for evaluation. We generated target boxes (all with area 1/32) at (0.5, 0.5) with 7 aspect ratios (i.e., 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1). In circular region centered at (0.5, 0.5) with radius , anchor points are uniformly generated. Meanwhile, 49 anchor boxes with 7 scales (i.e., 1/32, 1/24, 3/64, 1/16, 1/12, 3/32, 1/8) and 7 aspect ratios (i.e., 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, 4:1) are placed for each anchor points. Each anchor box needs to be fitted to each target box, and there are 6860000 regression cases. To compare the convergence rate in different periods, we set up the following experimental environments:
We also define the loss value as overall regression cases and optimize it by using the gradient descent algorithm with a learning rate of 0.01.
III-B The Solutions of Gradient Vanishing Problem
Existing BBR losses [15, 16, 17, 18] are addition-based and follow the paradigm as shown in Eq. 4.
Distance IoU: Zhaohui Zheng et al. defined [16] as the normalized distance between central points of two bounding boxes:
(5) |
This term not only solves the gradient vanishing problem of , but also serves as a geometric factor. allows DIoU to make a more intuitive choice when faced with anchor boxes with the same .
(6) |
Meanwhile, provides a negative gradient for the size of the smallest enclosing box, which will make and increase and hinder the overlap between the anchor box and the target box. However, there is no denying that distance metric is indeed an extremely effective solution and becomes a necessary metric for BBR [17, 18]. On this basis, Yifan Zhang et al. increased the punishment for distance metric and proposed EIoU [17]:
(7) |
Complete IoU: On the basis of , Zhaohui Zheng et al. added the consideration of aspect ratio and proposed [16]:
(8) |
where describes the consistency of the aspect ratio:
(9) |
(10) |
Yifan Zhang et al. [17] argued that the irrationality of CIoU is that , which means that cannot provide gradients of the same sign for the width and height of the anchor box. In the previous analysis of DIoU, it can be seen that will produce a negative gradient (Eq. 6). When this negative gradient exactly offsets the gradient generated by on the anchor box, the anchor box will not be optimized. The consideration of aspect ratio by CIoU will break this deadlock (Fig. 3b).
Scylla IoU: Zhora Gevorgyan [18] proved that the center-aligned anchor box would have a faster convergence speed, and constructed SIoU in terms of angle cost, distance cost, and shape cost.
The angle cost describes the minimum angle between the central points’ connection (Fig. 1) and the x-y axis:
(11) |
When the central points are aligned on the x-axis or y-axis, . When the central points’ connection is at 45° to the x-axis, . This term can guide the anchor box to drift to the nearest axis of the target box, reducing the total number of degrees of freedom of BBR.
The distance cost describes the distance between central points, and its penalty is positively correlated with the angle cost. The distance cost is defined as:
(12) |
(13) |
The shape cost describes the size difference between the bounding boxes. When the sizes of the bounding boxes are inconsistent, . And it is defined as:
(14) |
(15) |
is similar to , they both consist of distance cost and shape cost:
(16) |
Since the penalty of on the distance metric increases with the increase of shape cost, the models trained by SIoU have faster convergence speed and lower regression error.
III-C The Proposed Methods
Because training data inevitably contain low-quality examples, geometric factors such as distance and aspect ratio will aggravate the penalty for low-quality examples and thus degrade the generalization performance of the model. A good loss function should weaken the penalty of geometric factors when the anchor box coincides well with the target box, and less intervention in training will make the model get better generalization ability. Based on this, we construct distance attention (Eq. 17) and obtain WIoU v1 with two layers of attention mechanism:
-
•
, which will significantly amplify of the ordinary-quality anchor box.
-
•
, which will significantly reduce of the high-quality anchor box and its focus on the distance between central points when the anchor box coincides well with the target box.
(17) |
where are the size of the smallest enclosing box (Fig. 1). In order to prevent from producing the gradient that hinders convergence, are detached from the computational graph (the superscript indicates this operation). Because it effectively eliminates the factor that hinders convergence, we do not introduce new metrics such as aspect ratio.
Through the simulation experiment mentioned in III-A, we compare the performance of BBR losses without FMs. From the results of Fig. 6, we have the following observations:
-
1.
Among a series of BBR losses mentioned in existing works, SIoU [18] has the fastest convergence rate.
-
2.
For the main cases in the BBR, all BBR losses have extremely similar convergence rates. It follows that the difference in convergence rate mainly comes from non-overlapping bounding boxes. Our proposed attention-based WIoU v1 has the best effect on this part.
Learning from focal loss: Focal loss [20] designs a monotonic FM for cross-entropy, which effectively reduces the contribution of easy examples to the loss value. Thus, the model can focus on hard examples and obtain classification performance improvement. Similarly, we construct monotonic focusing coefficient for .
(18) |
Due to the addition of the focusing coefficient, the gradient back-propagated by WIoU v2 also changes:
(19) |
Note that the gradient gain is . During the model’s training, the gradient gain decreases with the decrease of , resulting in a slow convergence rate in the late stages of training. Therefore, the mean of is introduced as the normalizing factor:
(20) |
where is the running mean with momentum . Dynamically updating the normalizing factor keeps the gradient gain at a high level overall, which solves the problem of slow convergence in the late stages of training.
Dynamic non-monotonic FM: The outlier degree of the anchor box is characterized by the ratio of to :
(21) |
A small outlier degree means that the anchor box is high-quality. We assign a small gradient gain to it in order to focus the BBR on ordinary-quality anchor boxes. Additionally, assigning a small gradient gain to the anchor box with a large outlier degree will effectively prevent large harmful gradients from low-quality examples. We construct a non-monotonic focusing coefficient using and applied it to WIoU v1:
(22) |
where makes when . As shown in Fig. 8, the anchor box will enjoy the highest gradient gain when its outlier degree satisfies ( is a constant value). Since is dynamic, the quality demarcation standard of anchor boxes is also dynamic, which allows WIoU v3 to make the gradient gain allocation strategy that is most in line with the current situation at every moment.
In order to prevent low-quality anchor boxes from being left behind in the early stages of training, we initialize , making the anchor box that enjoys the highest gradient gain. To maintain such a strategy in the early stages of training, it is necessary to set a small momentum to delay the time when approaches the real value . For the training with the number of data batches , we propose to set the momentum as:
(23) |
this setup makes after epochs of training.
In the middle and late stages of training, WIoU v3 assigns small gradient gains to low-quality anchor boxes to reduce the harmful gradients. At the same time, it also focuses on ordinary-quality anchor boxes to improve the localization performance of the model.
IV Experiments
IV-A Experimental Setup
For a fair comparison, all of our experiments were performed on the PyTorch framework [21]. For the dataset, we selected 20 categories in the MS-COCO dataset [22], and selected 28474 images as the training data and 1219 images as the validation data. For the model, we chose YOLOv7-w6 [11] with the layer channel multiple of 0.75 for training. The models are trained for 120 epochs with different BBR losses.
The anchor boxes produced by the detection heads of YOLOv7 mainly contain two parts: the anchor boxes from lead heads (ABLH) and the anchor boxes from auxiliary heads (ABAH). The ABLH tends to have better-fitting results and less information, and the ABAH is the opposite. If we only count the mean of the ABLH, it will cause the gradient gains of the ABAH to disappear gradually, making the FM ignore the rich information amount of the ABAH. Therefore, our mean statistics include the ABLH and the ABAH.
CIoU [16] | 53.03 | 63.14 | 45.20 |
---|---|---|---|
CIoU v2 () | 53.47 (+0.44) | 63.41 (+0.27) | 45.12 |
CIoU v3 () | 53.25 (+0.22) | 63.34 (+0.20) | 44.76 |
CIoU v3 () | 53.68 (+0.65) | 63.34 (+0.20) | 45.10 |
CIoU v3 () | 53.04 | 62.92 | 44.91 |
SIoU [18] | 53.15 | 63.46 | 45.21 |
SIoU v2 () | 53.07 | 63.12 | 44.66 |
SIoU v3 () | 53.27 (+0.12) | 64.13 (+0.67) | 45.15 |
SIoU v3 () | 53.21 | 63.48 | 44.89 |
SIoU v3 () | 53.42 (+0.27) | 63.28 | 45.03 |
EIoU [17] | 53.55 | 63.17 | 45.39 |
Focal-EIoU [17] | 52.88 | 63.37 (+0.20) | 44.75 |
WIoU v1 | 52.82 | 63.15 | 44.87 |
WIoU v2 () | 53.67 (+0.85) | 64.15 (+1.00) | 45.56 (+0.68) |
WIoU v3 () | 53.75 (+1.07) | 64.05 (+0.90) | 45.15 (+0.28) |
WIoU v3 () | 53.91 (+1.09) | 64.16 (+1.01) | 45.44 (+0.57) |
WIoU v3 () | 54.50 (+1.68) | 64.20 (+1.05) | 45.68 (+0.81) |
airplane | train | truck | traffic light | stop sign | parking meter | bench | elephant | |
---|---|---|---|---|---|---|---|---|
SIoU | 74.74 | 78.85 | 51.37 | 54.96 | 58.33 | 58.18 | 40.27 | 86.32 |
Focal-EIoU | 72.92 | 84.26 | 53.37 | 51.95 | 55.10 | 54.55 | 38.13 | 85.17 |
WIoU v3 | 71.58 | 81.82 | 55.94 | 55.45 | 64.58 | 62.26 | 36.81 | 87.83 |
IV-B Ablation Study
We applied the FMs to BBR losses to investigate the effect of the FMs on addition-based losses. Version 2 of these BBR losses used a setting of , to align with the monotonic FM of Focal-EIoU [17]. Their version 3 uses the dynamic non-monotonic FM proposed in this paper.
By comparing BBR losses’ version 2 with the original version (TABLE I), it is known that the monotonic FM negatively affects the performance of both SIoU [18] and EIoU [17]. Because these two punish the distance metric more strongly, larger harmful gradients are synthesized under the action of the monotonic FM. CIoU [16] and WIoU v1 are less penalized for the distance metric, which allows them to effectively weaken the amplification of the harmful gradient by the monotonic FM.
By comparing BBR losses’ version 3 with the original version (TABLE I), we can know that the non-monotonic FM can effectively improve the performance of BBR losses. For each BBR loss, there is a set of unique parameters that can maximize this performance gain.
In addition, we compare the regression results for the anchor boxes (Fig. 5). WIoU v2 with monotonic FM is affected by low-quality examples, resulting in poor predictions. Our WIoU v3 benefits from the dynamic non-monotonic FM, which effectively shields the influence from low-quality examples and achieves ideal predictions.
IV-C Comparison Study
In TABLE I, the performance ranking of the BBR losses’ original version is: EIoU SIoU CIoU WIoU v1. Such an order also agrees with the strength of their penalties for distance metric. However, when the FMs are applied, the performance gains of BBR losses are in the opposite order. In the experiments we performed, the model trained by WIoU v3 achieved the best performance.
We monitor the change in the precision of YOLOv7 during training (Fig. 9). Due to the dynamic non-monotonic FM, our proposed WIoU v3 effectively shields many negative effects during training, so the model’s precision can increase faster.
We compare WIoU v3 with the state-of-the-art BBR losses and obtained several categories with large differences in precision (TABLE II). Benefiting from the ability to identify low-quality examples, the model trained by WIoU v3 has greatly improved precision for some categories. At the same time, the model’s precision for airplanes and benches has decreased.
We noticed that some of the labels of airplanes are controversial (Fig. 7), and some of the selected airplanes lack prominent features such as fuselage. These examples are as hard to learn as low-quality examples, and this part of hard examples is discarded by the FM of WIoU v3. In addition, there are a large number of errors in the labels of benches, and there are also a large number of benches that have not been labeled. This is unfair for models that generalize well and detect more benches.
Learning appropriate knowledge with limited parameters is the key to the success of real-time detectors. WIoU v3 improves the overall performance of the model by weighing the learning of low-quality examples and high-quality examples.
V Conclusion
In this paper, we observe that the low-quality examples in the training data will hinder the generalization of the object detection model. Most existing works are limited to static FM, which does not fully exploit the potential of non-monotonic FM. Although the monotonic FM they advocate can improve localization performance, it does not solve this problem. We propose a dynamic non-monotonic FM that can reduce the competitiveness of high-quality anchor boxes and mask the influence of low-quality examples. When WIoU with this mechanism is applied to the state-of-the-art real-time detector YOLOv7, the model’s generalization ability is effectively improved.
References
- [1] Jing Nie, Yanwei Pang, Shengjie Zhao, Jungong Han, and Xuelong Li, “Efficient selective context network for accurate object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3456–3468, 2020.
- [2] Song Peng, Zhenfeng Shao, Xiao Huang, Yi Zhu, Ruiqian Zhang, and Junwei Zha, “Bicsnet: A bidirectional cross-scale backbone for recognition and localization,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
- [3] Jung Uk Kim, Jungsu Kwon, Hak Gu Kim, and Yong Man Ro, “Bbc net: Bounding-box critic network for occlusion-robust object detection,” IEEE transactions on circuits and systems for video technology, vol. 30, no. 4, pp. 1037–1050, 2019.
- [4] Meng Cheng, Hanli Wang, and Yu Long, “Meta-learning-based incremental few-shot object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2158–2169, 2021.
- [5] Kaiyou Song, Hua Yang, and Zhouping Yin, “Multi-scale attention deep neural network for fast accurate object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 10, pp. 2972–2985, 2018.
- [6] Shifeng Zhang, Longyin Wen, Zhen Lei, and Stan Z Li, “Refinedet++: Single-shot refinement neural network for object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 2, pp. 674–687, 2020.
- [7] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
- [8] Joseph Redmon and Ali Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
- [9] Joseph Redmon and Ali Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
- [10] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
- [11] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv preprint arXiv:2207.02696, 2022.
- [12] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
- [13] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He, “Fcos: A simple and strong anchor-free object detector,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- [14] Xiang Long, Kaipeng Deng, Guanzhong Wang, Yang Zhang, Qingqing Dang, Yuan Gao, Hui Shen, Jianguo Ren, Shumin Han, Errui Ding, et al., “Pp-yolo: An effective and efficient implementation of object detector,” arXiv preprint arXiv:2007.12099, 2020.
- [15] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 658–666.
- [16] Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” in Proceedings of the AAAI conference on artificial intelligence, 2020, vol. 34, pp. 12993–13000.
- [17] Yi-Fan Zhang, Weiqiang Ren, Zhang Zhang, Zhen Jia, Liang Wang, and Tieniu Tan, “Focal and efficient iou loss for accurate bounding box regression,” Neurocomputing, vol. 506, pp. 146–157, 2022.
- [18] Zhora Gevorgyan, “Siou loss: More powerful learning for bounding box regression,” arXiv preprint arXiv:2205.12740, 2022.
- [19] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang, “Unitbox: An advanced object detection network,” in Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 516–520.
- [20] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
- [21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in pytorch,” 2017.
- [22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.