Curating datasets directly on the Hub
You can now edit datasets directly on the Hub. This is huge - no more download/edit/upload cycles for fixes and quick data curation. It's early days, but this will fundamentally change dataset workflows for AI.
The most interesting part, in my opinion, is collaborative dataset curation. Multiple people across your organization can make commits to the same dataset, review changes, and improve data quality together - all with full versioning and traceability.
In this post, I'll walk you through how it works with a practical example.
Requirements
Currently, you can edit datasets on the Hub if:
- The dataset contains a single CSV file (more formats coming).
- You have write access (your personal datasets or any org/dataset where you have write permissions).
- It has textual (
string) columns.
Walkthrough: fixing dataset errors
Say you or your team published a sentiment analysis dataset, and someone spots errors. Here's how you fix them.
Go to the dataset page.
Go to
Data Studioto inspect the dataset. For example, in the screenshot below, you can spot an error in the label distribution, with some valuesnegativinstead of negative.
- If you have write access, you will see a
Toggle Edit Modebutton. If you click, you will be able to edit individual cells instringcolumns like in the screenshot below:
- Once you're happy with your edits, click on
Committo submit your changes. This will commit your changes to the dataset repo and let you define a descriptive commit message:
This is the resulting change in the dataset, which lets you trace back to all curation actions:
- Once you're done with a round of edits, you can make more changes iteratively. Let's say you identify mislabelled examples (e.g.,
positiveinstead ofnegative), you just need to edit the target cells and commit your changes with a new message:
What's next
This is just the beginning for dataset curation on the Hub. The team is actively working on what comes next. I'm personally excited to see how AI models can help you curate data faster and better directly on the Hub and in your browser. Stay tuned.
And get involved!
Try it out and share your feedback. Leave a comment on this blog post.