8In fact, rather than computing the self-attended sequence, then reducing it, we reduce the attention weights accordingly, and then directly apply them via matrix multiplication to the input sequence to get the final reduced representation, that is, we fuse these two operations. This is more computationally efficient, avoiding another 3-tensor multiplication.