²It has 1/20th the parameters and requires 1/135th the pre-training compute of BERT-Large.