Educational resources: multilingual materials, mathematical datasets.
nyuuzyou PRO
nyuuzyou
AI & ML interests
None yet
Recent Activity
posted
an
update
about 9 hours ago
🇨🇳 Gitee Code Dataset - The Missing Piece of the Stack
https://huggingface.co/datasets/nyuuzyou/gitee-code
Gitee is not included in the Software Heritage archive, meaning it is currently missing from datasets like The Stack. This release fills that massive gap, serving as the largest Chinese code dataset and one of the largest code corpuses overall.
- 819,472,785 files from 3,105,923 repositories
- 536 GB compressed Parquet storage
- 554 programming languages
- Extensive quality filtering: Removed vendor code, artifacts, and generated files
- Rich Chinese language understanding: High volume of Chinese comments and docs
Huge thanks to Hugging Face for the storage grant that made hosting this (and all my other datasets) possible!
I have also already dropped several other new code datasets and rolled out QoL improvements for older ones. I will be dropping posts on those throughout the week.
new activity
4 days ago
nyuuzyou/joyreactor:[bot] Conversion to Parquet