⁷We also trained a discriminator that learns from a random 15% of the input tokens distinct from the subset that was originally masked out; this model performed slightly worse.