Update README.md
Browse files
README.md
CHANGED
|
@@ -113,7 +113,7 @@ The following datasets were used for continual pre-training.
|
|
| 113 |
- [English Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
|
| 114 |
- [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
|
| 115 |
- [Laboro ParaCorpus](https://github.com/laboroai/Laboro-ParaCorpus)
|
| 116 |
-
- [Swallow Corpus Version 2](https://arxiv.org/abs/2404.17733)
|
| 117 |
- [The-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)
|
| 118 |
|
| 119 |
## Risks and Limitations
|
|
|
|
| 113 |
- [English Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
|
| 114 |
- [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
|
| 115 |
- [Laboro ParaCorpus](https://github.com/laboroai/Laboro-ParaCorpus)
|
| 116 |
+
- [Swallow Corpus Version 2](https://arxiv.org/abs/2404.17733) (filtered using [Swallow Education Classifier(Wiki-based)](https://huggingface.co/tokyotech-llm/edu-classifier))
|
| 117 |
- [The-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)
|
| 118 |
|
| 119 |
## Risks and Limitations
|