Generating Croissant Metadata for Custom Image Dataset

eztao · April 14, 2025, 2:17am

Dear Hugging Face Team,

I am currently hosting a 3D reconstruction image dataset on the Hugging Face Hub: yinyue27/RefRef · Datasets at Hugging Face. The dataset is in Blender format and contains multiple subsets, where each subset corresponds to a scene. Each scene is further divided into three splits: train, validation, and test. Each scene includes three JSON files that store the image paths for each split.

Since Hugging Face could not automatically recognise the dataset’s structure, I implemented a custom data loader script. However, I noticed that the Croissant metadata is not being generated automatically.

Would it be possible for someone to help review my loader script and provide guidance on properly structuring the dataset so that the Croissant metadata can be generated?

I appreciate your time and any advice you can offer.

John6666 · April 14, 2025, 4:59am

I wonder if Croissant is not automatically generated for data set repositories that use loading scripts…? @lhoestq

from datasets import load_dataset
ds = load_dataset("yinyue27/RefRef", "single-convex", trust_remote_code=True) # error
print(ds)

The dataset viewer automatically generates the metadata in Croissant format (JSON-LD) for every dataset on the Hugging Face Hub. It lists the dataset’s name, description, URL, and the distribution of the dataset as Parquet files, including the columns’ metadata. The Croissant metadata is available for all the datasets that can be converted to Parquet format.

eztao · April 14, 2025, 5:17am

Hi John666,

Thanks for looking into my problem! I think if you modify the script a little bit by adding this ‘scene’ keyword I customised, you will load the dataset correctly:

from datasets import load_dataset
ds = load_dataset("yinyue27/RefRef",  split="textured_sphere_scene", name="multiple-non-convex", scene="beaker", trust_remote_code=True) 
print(ds)

John6666 · April 14, 2025, 5:21am

It worked!

from datasets import load_dataset
ds = load_dataset("yinyue27/RefRef", split="textured_sphere_scene", name="multiple-non-convex", scene="beaker", trust_remote_code=True)
print(ds)
#Generating textured_sphere_scene split: 300 examples [00:06, 47.17 examples/s]
#Generating textured_cube_scene split: 300 examples [00:05, 55.87 examples/s]
#Dataset({
#    features: ['image', 'depth', 'mask', 'transform_matrix', 'rotation'],
#    num_rows: 300
#})

#ds.push_to_hub("yinyue27/RefRef_parquet") # it will work with DatasetViewer but just a part of dataset...

eztao · April 14, 2025, 5:29am

Yes, so I assume the data loader is good to use for the dataset. Any clues on how I can generate the croissant metadata?

lhoestq · April 14, 2025, 12:58pm

The croissant metadata on HF are generated automatically for datasets in supported formats like Parquet or ImageFolder (folder of images and a metadata file ). If you convert your dataset to Parquet, or if you structure your dataset as a ImageFolder, the croissant metadata will be available.

There is no way to automatically get a croissant metadata file for a dataset based on a python script.

eztao · April 15, 2025, 1:44am

Hi John, I’m now trying to load my dataset and use push_to_hub to push it to a new dataset. This is the script I’m using:

from datasets import load_dataset

dataset = load_dataset(
    # path="eztao/RefRef_test",
    path="yinyue27/RefRef",
    name="single-convex",
    scene="ball",
    split="textured_sphere_scene",
    trust_remote_code=True
)

print(dataset)  # Should show the dataset structure

dataset.push_to_hub("eztao/RefRef_parquet")

But I’m getting this error:

Dataset({
   features: ['image', 'depth', 'mask', 'transform_matrix', 'rotation'],
   num_rows: 300
})
Traceback (most recent call last):
 File "/home/u7543832/PhD/DataBuilder.py", line 14, in <module>
   dataset.push_to_hub("eztao/RefRef_parquet")
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5549, in push_to_hub
   additions, uploaded_size, dataset_nbytes = self._push_parquet_shards_to_hub(
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5349, in _push_parquet_shards_to_hub
   dataset_nbytes = self._estimate_nbytes()
                    ^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5177, in _estimate_nbytes
   table_visitor(table, extra_nbytes_visitor)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2378, in table_visitor
   _visit(table[name], feature)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2358, in _visit
   _visit(chunk, feature)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2362, in _visit
   function(array, feature)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5172, in extra_nbytes_visitor
   size = xgetsize(x["path"])
          ^^^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 769, in xgetsize
   size = fs.size(main_hop)
          ^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/fsspec/spec.py", line 696, in size
   return self.info(path).get("size", None)
          ^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 727, in info
   _raise_file_not_found(path, None)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 1136, in _raise_file_not_found
   raise FileNotFoundError(msg) from err
FileNotFoundError: datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

Seems that I can load the dataset (I also plotted out the image to make sure of it), and the file path is correct, but I’m constantly getting this error. Could you help me with it? Thanks!

John6666 · April 15, 2025, 1:48am

(I’m assuming you’re passing the token using login() or something similar)
The version of the library in charge of serialization and uploading may be out of date.

pip install -U huggingface_hub

eztao · April 15, 2025, 1:53am

I used huggingface-cli login and generated a token to login, and I’m still getting the error after using pip to update hf_hub (I’m already at the latest version actually)

John6666 · April 15, 2025, 2:31am

self.info(path).get(“size”, None)

It’s strange that it returns None… in other words, it means that this path cannot be found.
I think it’s a bug, but I wonder what kind of bug it is…
It’s different from the case below, and it’s probably looking for a local path, so it’s not related to networking, is it…?

John6666 · April 15, 2025, 2:35am

If it happens with .save_to_disk(), it’s definitely a bug.

eztao · April 15, 2025, 4:34am

Well, I do get the same error when trying to run dataset.save_to_disk("./data/RefRef_test_ball") , and I printed out the main_hop to confirm that it’s a remote path instead of a local one: hf://datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

 File "/home/u7543832/PhD/DataBuilder.py", line 14, in <module>
    dataset.save_to_disk("./data/RefRef_test_ball")
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 1476, in save_to_disk
    dataset_nbytes = self._estimate_nbytes()
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5177, in _estimate_nbytes
    table_visitor(table, extra_nbytes_visitor)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2378, in table_visitor
    _visit(table[name], feature)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2358, in _visit
    _visit(chunk, feature)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2362, in _visit
    function(array, feature)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5172, in extra_nbytes_visitor
    size = xgetsize(x["path"])
           ^^^^^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 769, in xgetsize
    size = fs.size(main_hop)
           ^^^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/fsspec/spec.py", line 696, in size
    return self.info(path).get("size", None)
           ^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 727, in info
    _raise_file_not_found(path, None)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 1136, in _raise_file_not_found
    raise FileNotFoundError(msg) from err
FileNotFoundError: datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

If this is a bug, I’ll have to generate the croissant metadata manually then. Unfortunately I couldn’t find any guide on doing this, any advice?

John6666 · April 15, 2025, 6:03am

github.com/huggingface/dataset-viewer

services/worker/src/worker/job_runners/dataset/croissant_crumbs.py

main

# SPDX-License-Identifier: Apache-2.0
# Copyright 2024 The HuggingFace Authors.

import logging
import re
from collections.abc import Mapping
from itertools import islice
from typing import Any

from datasets import Features
from libcommon.constants import CROISSANT_MAX_CONFIGS
from libcommon.croissant_utils import feature_to_croissant_field, get_record_set
from libcommon.exceptions import PreviousStepFormatError
from libcommon.simple_cache import (
    get_previous_step_or_raise,
)

from worker.dtos import CompleteJobResult
from worker.job_runners.dataset.dataset_job_runner import DatasetJobRunner

This file has been truncated. show original

I couldn’t find documents for writing Croissant metadata manually…
There is the source code…

For now, I think I’ve figured out the order in which the bugs are occurring. The relative path is causing problems.

# FileNotFoundError: datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

# /ball_sphere/./train/r_0.png <=There is a failure in the concatenation of the path.

stefanches · October 14, 2025, 1:13pm

Great script @John6666 , I currently try to port a beautiful dataset of https://idr.openmicroscopy.org/study/idr0012/ to Croissant and would maybe use your solution.

Did you find a way to manually supply a Croissant descriptor? I have a script that converts my metadata to JSON-LD, so thought to take it from there with a manual Croissant JSON-LD after.

John6666 · October 14, 2025, 10:25pm

I tried generating a script to write Croissant files for existing Hugging Face datasets. It seems to work for now, but it probably needs improvement…

Goal: generate a valid croissant.json for an existing Hugging Face dataset repo that does not already expose Croissant.

Summary:

First try auto-Croissant. If your repo is Parquet or ImageFolder-like, Hugging Face exposes /croissant automatically. If it’s not exposed, author croissant.json yourself, validate with mlcroissant, and commit it at the repo root. (Hugging Face)

Background you need

Croissant is JSON-LD for ML datasets. It wraps four layers: metadata, resources, structure, ML semantics. Stable spec is 1.0; the reference repo’s latest release is v1.0.22 (2025-08-25). (docs.mlcommons.org)
Hugging Face publishes Croissant for datasets that can be converted to Parquet or follow ImageFolder. The API endpoint is documented and the JSON-LD is also embedded in dataset pages. (Hugging Face)

Way 0: check auto-Croissant (fast path)

Try either endpoint. Use whichever you prefer.
- https://huggingface.co/api/datasets/<OWNER>/<REPO>/croissant (documented by MLCommons as an HF API example). (docs.mlcommons.org)
- /static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fcroissant%3Fdataset%3D%26lt%3BOWNER%26gt%3B%2F%26lt%3BREPO%26gt%3B%3C%2Fcode%3E (dataset viewer API doc). (Hugging Face)



If it returns JSON-LD, you are done. If it 404s, your repo likely isn’t Parquet/ImageFolder-convertible. Convert or proceed to manual. (Hugging Face)


Way 1: author and commit croissant.json (manual, reliable)
What must be in the file
Minimum, with names per spec:

@context, @type: "Dataset", name, url, conformsTo, distribution (list of cr:FileObject or cr:FileSet), and one or more recordSet with cr:Fields mapping columns to sources. See the spec’s minimal example using contentUrl, encodingFormat, and optional sha256. Use conformsTo: "http://mlcommons.org/croissant/1.0". (docs.mlcommons.org)

Where to put it

Commit croissant.json at the repo root. Many public datasets do this. Examples: CharXiv, TerraIncognita, worldmodelbench. Inspect their structure for patterns. (Hugging Face)

How to generate it quickly (scriptable)
Use the Hub to list files, build distribution, and infer simple recordSets from CSV headers.
# Generates a skeleton croissant.json for an HF dataset repo.
# References:
# - HfApi/HfFileSystem and upload: https://huggingface.co/docs/huggingface_hub/guides/upload
# - Build raw URLs with hf_hub_url: https://huggingface.co/docs/huggingface_hub/guides/download
# - Croissant minimal keys example: https://docs.mlcommons.org/croissant/
from huggingface_hub import HfApi, HfFileSystem, hf_hub_url
import json, re
import pandas as pd

REPO = "OWNER/REPO"  # e.g., "BGLab/TerraIncognita"
api = HfApi()
fs = HfFileSystem()

# List every file in the dataset repo
files = [p.split("datasets/")[1] for p in fs.glob(f"datasets/{REPO}/**")]

# Buckets
csvs = [f for f in files if f.lower().endswith((".csv", ".tsv"))]
images = [f for f in files if re.search(r"\.(jpg|jpeg|png|tif|tiff|bmp|gif)$", f, re.I)]

# Build distribution
dist = []
if images:
    # Group globs per top folder
    topdirs = sorted({f.split("/")[0] for f in images if "/" in f} or {"."})
    for d in topdirs:
        includes = f"{d}/**/*" if d != "." else "**/*"
        dist.append({
            "@type": "cr:FileSet",
            "@id": f"images-{d}".replace("/", "_"),
            "name": f"images-{d}",
            "encodingFormat": "image/*",
            "includes": includes
        })

for c in csvs:
    dist.append({
        "@type": "cr:FileObject",
        "@id": c,
        "name": c,
        "contentUrl": hf_hub_url(repo_id=REPO, filename=c, repo_type="dataset"),  # raw resolve URL
        "encodingFormat": "text/csv" if c.lower().endswith(".csv") else "text/tab-separated-values",
    })

# Basic recordSet from first CSV (optional but useful)
record_sets = []
if csvs:
    url = hf_hub_url(repo_id=REPO, filename=csvs[0], repo_type="dataset")
    cols = list(pd.read_csv(url, nrows=0).columns)
    fields = [{
        "@type": "cr:Field",
        "@id": f"samples/{col}",
        "name": col,
        "dataType": "Text",
        "source": {"fileObject": {"@id": csvs[0]}, "extract": {"column": col}}
    } for col in cols]
    record_sets.append({"@type": "cr:RecordSet", "@id": "samples", "name": "samples", "field": fields})

croissant = {
  "@context": { "@vocab": "https://schema.org/", "cr": "http://mlcommons.org/croissant/" },
  "@type": "Dataset",
  "name": REPO.split("/")[-1],
  "url": f"https://huggingface.co/datasets/{REPO}",
  "conformsTo": "http://mlcommons.org/croissant/1.0",
  "distribution": dist,
  "recordSet": record_sets
}

with open("croissant.json", "w") as f:
    json.dump(croissant, f, indent=2)
print("Wrote croissant.json")

Notes:

Use hf_hub_url(..., filename=..., repo_type="dataset") to produce a raw .../resolve/... URL. Do not point contentUrl at the web UI. (Hugging Face)
Prefer cr:FileSet with includes for folders and cr:FileObject with contentUrl for single files. The spec’s minimal example shows both patterns and sha256 support. (docs.mlcommons.org)

Validate and iterate


Install and validate:

pip install "mlcroissant[parquet]" then python -c "import mlcroissant, sys; print('ok')"
Load/validate in code or from CLI. The library docs show loading a Croissant URL and are kept current in HF docs. (Hugging Face)



Practical loop:

Generate croissant.json.
Validate by constructing an mlcroissant.Dataset(jsonld=...) and iterating records. The README and docs show exact calls. (github.com)
Commit croissant.json to the dataset repo via upload_file or CLI. (Hugging Face)



Optional GUI instead of code

Use the Croissant Editor Space. It infers resources and RecordSets from your files and lets you export JSON-LD. Good for images or nested folders. (Hugging Face)

Way 2: restructure to get auto-Croissant (no manual file)

Convert to Parquet or organize as ImageFolder (+ simple metadata file). HF will auto-publish Parquet and then /croissant appears. Docs and maintainer posts confirm this behavior in 2025. (Hugging Face)

Working example repos to copy from

princeton-nlp/CharXiv uses a top-level croissant.json with recordSet and RAI keys. Shows complete structure. Updated 2024-06-11. (Hugging Face)
BGLab/TerraIncognita shows CSV-based distribution and fields. Updated ~5 months ago. (Hugging Face)
Efficient-Large-Model/worldmodelbench shows recent practice. Updated ~5 months ago. (Hugging Face)

Common pitfalls to avoid

Missing required top-level keys or wrong vocabulary prefix. Follow the minimal example in the spec. (docs.mlcommons.org)
contentUrl pointing to the web UI instead of the raw .../resolve/.... Use hf_hub_url. (Hugging Face)
Expecting your manual croissant.json to change HF’s /croissant endpoint. That endpoint is auto-generated from Parquet/ImageFolder. Your file serves external tools and users. (Hugging Face)
Big private repos: dataset viewer Parquet publishing has limits and requirements. See size and visibility rules. (Hugging Face)

End-to-end checklist (redundant on purpose)


Probe /croissant. If present, reuse it. If absent, proceed. (Hugging Face)


Generate croissant.json:

List files from the repo.
Build distribution with cr:FileSet globs and cr:FileObject URLs from hf_hub_url.
Create a recordSet per main table with cr:Fields mapped by column. (Hugging Face)



Validate with mlcroissant and load a few records. Fix errors. (Hugging Face)


Commit croissant.json at the repo root with upload_file or CLI. (Hugging Face)


Optional: rebuild the repo as Parquet or ImageFolder to get HF auto-Croissant as well. (Hugging Face)


Supplemental materials (curated, dated)
Spec and core

MLCommons Croissant docs and minimal example. Accessed 2025-10-15. (docs.mlcommons.org)
Croissant repo README and examples. Latest release v1.0.22 on 2025-08-25. (github.com)

Hugging Face APIs

Get Croissant metadata via dataset viewer. Accessed 2025-10-15. (Hugging Face)
Dataset viewer overview and Parquet auto-publish rules. Updated 2024-2025. (Hugging Face)
mlcroissant usage on HF docs. Accessed 2025-10-15. (Hugging Face)
Upload files programmatically and with CLI. Updated 2024-07-22 and 2024-??. (Hugging Face)
Build raw URLs with hf_hub_url. Accessed 2025-10-15. (Hugging Face)

HF forum guidance

Auto-Croissant for Parquet/ImageFolder. Posts from 2025-04-14 and 2025-05-12. (Hugging Face Forums)

Editor

Croissant Editor Space. Accessed 2025-10-15. (Hugging Face)

stefanches · October 15, 2025, 10:21am

Great thing @John6666 , thank you so much! I’ll try it out soon.

Topic		Replies	Views
Possible issue of contentUrl in croissant file of the dataset 🤗Datasets	2	17	October 18, 2025
Dataset preview not showing for uploaded DatasetDict 🤗Datasets	6	2161	December 7, 2021
Custom loading dataset script 🤗Datasets	4	521	January 3, 2023
Loading Custom Datasets 🤗Datasets	7	10758	May 25, 2021
About the datasets category 🤗Datasets	1	360	July 7, 2020

Generating Croissant Metadata for Custom Image Dataset

Background you need

Way 0: check auto-Croissant (fast path)

Way 1: author and commit croissant.json (manual, reliable)

What must be in the file

Where to put it

How to generate it quickly (scriptable)

Validate and iterate

Optional GUI instead of code

Way 2: restructure to get auto-Croissant (no manual file)

Working example repos to copy from

Common pitfalls to avoid

End-to-end checklist (redundant on purpose)

Supplemental materials (curated, dated)

Related topics

Way 1: author and commit `croissant.json` (manual, reliable)