Web Datasets

A full crawl of much of what is available on the open Internet.  Over 6 billion documents (current and archived) available as an Amazon S3 Public Data Set.
File formats: ARC raw content, Text Only, and Metadata

