Question about ROOTS corpus: availability & earlier web data

Thanks for ROOTS — it’s an awesome multilingual dataset that’s super helpful.

I have a few questions:

  1. Is there a way to access the full ROOTS corpus (beyond the “large initial subset”)? Or is the full version publicly downloadable?

  2. Does anyone know whether ROOTS or related BigScience projects have plans or workflows for collecting web text from before 2008? Any archives, tools, or datasets people have used for that time period?

  3. If I wanted to combine ROOTS with other historical web datasets (or reconstruct earlier web snapshots), would the preprocessing / filtering tools from the data-preparation GitHub repo be helpful for that?

Thanks a lot for any pointers or suggestions!

Best,
Patrick

1 Like

There doesn’t seem to be any full data…

There’s far too much data that once existed on the internet but is now lost. Internet archaeology is tough…