arxiv Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus