7Multi-headed attention requires running computations on 4-tensors: [batch, time, head, embedding], while for single-headed attention, this reduces to 3-tensors, and effectively speeds up training without hurting performance.