6We zero out in training and inference the attention scores for pairs of words if they are further apart than the set maximum relative attention values.