¹⁰The label smoothing technique (Szegedy et al., 2016) reduces overfitting by preventing a network to assign full probability to the correct training example (Pereyra et al., 2017). It means that each positive example in each batch is assigned the probability of 0.8, while the remaining probability mass is evenly redistributed across in-batch negative examples.