Generating Croissant Metadata for Custom Image Dataset

Dear Hugging Face Team,

I am currently hosting a 3D reconstruction image dataset on the Hugging Face Hub: yinyue27/RefRef Ā· Datasets at Hugging Face. The dataset is in Blender format and contains multiple subsets, where each subset corresponds to a scene. Each scene is further divided into three splits: train, validation, and test. Each scene includes three JSON files that store the image paths for each split.

Since Hugging Face could not automatically recognise the dataset’s structure, I implemented a custom data loader script. However, I noticed that the Croissant metadata is not being generated automatically.

Would it be possible for someone to help review my loader script and provide guidance on properly structuring the dataset so that the Croissant metadata can be generated?

I appreciate your time and any advice you can offer.

1 Like

I wonder if Croissant is not automatically generated for data set repositories that use loading scripts…? @lhoestq

from datasets import load_dataset
ds = load_dataset("yinyue27/RefRef", "single-convex", trust_remote_code=True) # error
print(ds)

The dataset viewer automatically generates the metadata in Croissant format (JSON-LD) for every dataset on the Hugging Face Hub. It lists the dataset’s name, description, URL, and the distribution of the dataset as Parquet files, including the columns’ metadata. The Croissant metadata is available for all the datasets that can be converted to Parquet format.

1 Like

Hi John666,

Thanks for looking into my problem! I think if you modify the script a little bit by adding this ā€˜scene’ keyword I customised, you will load the dataset correctly:

from datasets import load_dataset
ds = load_dataset("yinyue27/RefRef",  split="textured_sphere_scene", name="multiple-non-convex", scene="beaker", trust_remote_code=True) 
print(ds)
1 Like

It worked!

from datasets import load_dataset
ds = load_dataset("yinyue27/RefRef", split="textured_sphere_scene", name="multiple-non-convex", scene="beaker", trust_remote_code=True)
print(ds)
#Generating textured_sphere_scene split: 300 examples [00:06, 47.17 examples/s]
#Generating textured_cube_scene split: 300 examples [00:05, 55.87 examples/s]
#Dataset({
#    features: ['image', 'depth', 'mask', 'transform_matrix', 'rotation'],
#    num_rows: 300
#})

#ds.push_to_hub("yinyue27/RefRef_parquet") # it will work with DatasetViewer but just a part of dataset...
1 Like

Yes, so I assume the data loader is good to use for the dataset. Any clues on how I can generate the croissant metadata? :thinking:

1 Like

The croissant metadata on HF are generated automatically for datasets in supported formats like Parquet or ImageFolder (folder of images and a metadata file ). If you convert your dataset to Parquet, or if you structure your dataset as a ImageFolder, the croissant metadata will be available.

There is no way to automatically get a croissant metadata file for a dataset based on a python script.

3 Likes

Hi John, I’m now trying to load my dataset and use push_to_hub to push it to a new dataset. This is the script I’m using:

from datasets import load_dataset

dataset = load_dataset(
    # path="eztao/RefRef_test",
    path="yinyue27/RefRef",
    name="single-convex",
    scene="ball",
    split="textured_sphere_scene",
    trust_remote_code=True
)

print(dataset)  # Should show the dataset structure

dataset.push_to_hub("eztao/RefRef_parquet")

But I’m getting this error:

Dataset({
   features: ['image', 'depth', 'mask', 'transform_matrix', 'rotation'],
   num_rows: 300
})
Traceback (most recent call last):
 File "/home/u7543832/PhD/DataBuilder.py", line 14, in <module>
   dataset.push_to_hub("eztao/RefRef_parquet")
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5549, in push_to_hub
   additions, uploaded_size, dataset_nbytes = self._push_parquet_shards_to_hub(
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5349, in _push_parquet_shards_to_hub
   dataset_nbytes = self._estimate_nbytes()
                    ^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5177, in _estimate_nbytes
   table_visitor(table, extra_nbytes_visitor)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2378, in table_visitor
   _visit(table[name], feature)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2358, in _visit
   _visit(chunk, feature)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2362, in _visit
   function(array, feature)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5172, in extra_nbytes_visitor
   size = xgetsize(x["path"])
          ^^^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 769, in xgetsize
   size = fs.size(main_hop)
          ^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/fsspec/spec.py", line 696, in size
   return self.info(path).get("size", None)
          ^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 727, in info
   _raise_file_not_found(path, None)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 1136, in _raise_file_not_found
   raise FileNotFoundError(msg) from err
FileNotFoundError: datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

Seems that I can load the dataset (I also plotted out the image to make sure of it), and the file path is correct, but I’m constantly getting this error. Could you help me with it? Thanks!

1 Like

(I’m assuming you’re passing the token using login() or something similar)
The version of the library in charge of serialization and uploading may be out of date.

pip install -U huggingface_hub
1 Like

I used huggingface-cli login and generated a token to login, and I’m still getting the error after using pip to update hf_hub (I’m already at the latest version actually) :smiling_face_with_tear:

1 Like

self.info(path).get(ā€œsizeā€, None)

It’s strange that it returns None… in other words, it means that this path cannot be found.
I think it’s a bug, but I wonder what kind of bug it is…
It’s different from the case below, and it’s probably looking for a local path, so it’s not related to networking, is it…?

If it happens with .save_to_disk(), it’s definitely a bug.

1 Like

Well, I do get the same error when trying to run dataset.save_to_disk("./data/RefRef_test_ball") :anguished_face:, and I printed out the main_hop to confirm that it’s a remote path instead of a local one: hf://datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

 File "/home/u7543832/PhD/DataBuilder.py", line 14, in <module>
    dataset.save_to_disk("./data/RefRef_test_ball")
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 1476, in save_to_disk
    dataset_nbytes = self._estimate_nbytes()
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5177, in _estimate_nbytes
    table_visitor(table, extra_nbytes_visitor)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2378, in table_visitor
    _visit(table[name], feature)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2358, in _visit
    _visit(chunk, feature)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2362, in _visit
    function(array, feature)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5172, in extra_nbytes_visitor
    size = xgetsize(x["path"])
           ^^^^^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 769, in xgetsize
    size = fs.size(main_hop)
           ^^^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/fsspec/spec.py", line 696, in size
    return self.info(path).get("size", None)
           ^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 727, in info
    _raise_file_not_found(path, None)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 1136, in _raise_file_not_found
    raise FileNotFoundError(msg) from err
FileNotFoundError: datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

If this is a bug, I’ll have to generate the croissant metadata manually then. :face_with_monocle: Unfortunately I couldn’t find any guide on doing this, any advice?

1 Like

I couldn’t find documents for writing Croissant metadata manually…
There is the source code…

For now, I think I’ve figured out the order in which the bugs are occurring. The relative path is causing problems.

# FileNotFoundError: datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

# /ball_sphere/./train/r_0.png <=There is a failure in the concatenation of the path.
2 Likes

Great script @John6666 , I currently try to port a beautiful dataset of https://idr.openmicroscopy.org/study/idr0012/ to Croissant and would maybe use your solution.

Did you find a way to manually supply a Croissant descriptor? I have a script that converts my metadata to JSON-LD, so thought to take it from there with a manual Croissant JSON-LD after.

1 Like

I tried generating a script to write Croissant files for existing Hugging Face datasets. It seems to work for now, but it probably needs improvement…


Goal: generate a valid croissant.json for an existing Hugging Face dataset repo that does not already expose Croissant.

Summary:

  • First try auto-Croissant. If your repo is Parquet or ImageFolder-like, Hugging Face exposes /croissant automatically. If it’s not exposed, author croissant.json yourself, validate with mlcroissant, and commit it at the repo root. (Hugging Face)

Background you need

  • Croissant is JSON-LD for ML datasets. It wraps four layers: metadata, resources, structure, ML semantics. Stable spec is 1.0; the reference repo’s latest release is v1.0.22 (2025-08-25). (docs.mlcommons.org)
  • Hugging Face publishes Croissant for datasets that can be converted to Parquet or follow ImageFolder. The API endpoint is documented and the JSON-LD is also embedded in dataset pages. (Hugging Face)

Way 0: check auto-Croissant (fast path)

  • Try either endpoint. Use whichever you prefer.

    • https://huggingface.co/api/datasets/<OWNER>/<REPO>/croissant (documented by MLCommons as an HF API example). (docs.mlcommons.org)
    • /static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fcroissant%3Fdataset%3D%26lt%3BOWNER%26gt%3B%2F%26lt%3BREPO%26gt%3B%3C%2Fcode%3E (dataset viewer API doc). (Hugging Face)
  • If it returns JSON-LD, you are done. If it 404s, your repo likely isn’t Parquet/ImageFolder-convertible. Convert or proceed to manual. (Hugging Face)

Way 1: author and commit croissant.json (manual, reliable)

What must be in the file

Minimum, with names per spec:

  • @context, @type: "Dataset", name, url, conformsTo, distribution (list of cr:FileObject or cr:FileSet), and one or more recordSet with cr:Fields mapping columns to sources. See the spec’s minimal example using contentUrl, encodingFormat, and optional sha256. Use conformsTo: "http://mlcommons.org/croissant/1.0". (docs.mlcommons.org)

Where to put it

  • Commit croissant.json at the repo root. Many public datasets do this. Examples: CharXiv, TerraIncognita, worldmodelbench. Inspect their structure for patterns. (Hugging Face)

How to generate it quickly (scriptable)

Use the Hub to list files, build distribution, and infer simple recordSets from CSV headers.

# Generates a skeleton croissant.json for an HF dataset repo.
# References:
# - HfApi/HfFileSystem and upload: https://huggingface.co/docs/huggingface_hub/guides/upload
# - Build raw URLs with hf_hub_url: https://huggingface.co/docs/huggingface_hub/guides/download
# - Croissant minimal keys example: https://docs.mlcommons.org/croissant/
from huggingface_hub import HfApi, HfFileSystem, hf_hub_url
import json, re
import pandas as pd

REPO = "OWNER/REPO"  # e.g., "BGLab/TerraIncognita"
api = HfApi()
fs = HfFileSystem()

# List every file in the dataset repo
files = [p.split("datasets/")[1] for p in fs.glob(f"datasets/{REPO}/**")]

# Buckets
csvs = [f for f in files if f.lower().endswith((".csv", ".tsv"))]
images = [f for f in files if re.search(r"\.(jpg|jpeg|png|tif|tiff|bmp|gif)$", f, re.I)]

# Build distribution
dist = []
if images:
    # Group globs per top folder
    topdirs = sorted({f.split("/")[0] for f in images if "/" in f} or {"."})
    for d in topdirs:
        includes = f"{d}/**/*" if d != "." else "**/*"
        dist.append({
            "@type": "cr:FileSet",
            "@id": f"images-{d}".replace("/", "_"),
            "name": f"images-{d}",
            "encodingFormat": "image/*",
            "includes": includes
        })

for c in csvs:
    dist.append({
        "@type": "cr:FileObject",
        "@id": c,
        "name": c,
        "contentUrl": hf_hub_url(repo_id=REPO, filename=c, repo_type="dataset"),  # raw resolve URL
        "encodingFormat": "text/csv" if c.lower().endswith(".csv") else "text/tab-separated-values",
    })

# Basic recordSet from first CSV (optional but useful)
record_sets = []
if csvs:
    url = hf_hub_url(repo_id=REPO, filename=csvs[0], repo_type="dataset")
    cols = list(pd.read_csv(url, nrows=0).columns)
    fields = [{
        "@type": "cr:Field",
        "@id": f"samples/{col}",
        "name": col,
        "dataType": "Text",
        "source": {"fileObject": {"@id": csvs[0]}, "extract": {"column": col}}
    } for col in cols]
    record_sets.append({"@type": "cr:RecordSet", "@id": "samples", "name": "samples", "field": fields})

croissant = {
  "@context": { "@vocab": "https://schema.org/", "cr": "http://mlcommons.org/croissant/" },
  "@type": "Dataset",
  "name": REPO.split("/")[-1],
  "url": f"https://huggingface.co/datasets/{REPO}",
  "conformsTo": "http://mlcommons.org/croissant/1.0",
  "distribution": dist,
  "recordSet": record_sets
}

with open("croissant.json", "w") as f:
    json.dump(croissant, f, indent=2)
print("Wrote croissant.json")

Notes:

  • Use hf_hub_url(..., filename=..., repo_type="dataset") to produce a raw .../resolve/... URL. Do not point contentUrl at the web UI. (Hugging Face)
  • Prefer cr:FileSet with includes for folders and cr:FileObject with contentUrl for single files. The spec’s minimal example shows both patterns and sha256 support. (docs.mlcommons.org)

Validate and iterate

  • Install and validate:

    • pip install "mlcroissant[parquet]" then python -c "import mlcroissant, sys; print('ok')"
    • Load/validate in code or from CLI. The library docs show loading a Croissant URL and are kept current in HF docs. (Hugging Face)
  • Practical loop:

    1. Generate croissant.json.
    2. Validate by constructing an mlcroissant.Dataset(jsonld=...) and iterating records. The README and docs show exact calls. (github.com)
    3. Commit croissant.json to the dataset repo via upload_file or CLI. (Hugging Face)

Optional GUI instead of code

  • Use the Croissant Editor Space. It infers resources and RecordSets from your files and lets you export JSON-LD. Good for images or nested folders. (Hugging Face)

Way 2: restructure to get auto-Croissant (no manual file)

  • Convert to Parquet or organize as ImageFolder (+ simple metadata file). HF will auto-publish Parquet and then /croissant appears. Docs and maintainer posts confirm this behavior in 2025. (Hugging Face)

Working example repos to copy from

  • princeton-nlp/CharXiv uses a top-level croissant.json with recordSet and RAI keys. Shows complete structure. Updated 2024-06-11. (Hugging Face)
  • BGLab/TerraIncognita shows CSV-based distribution and fields. Updated ~5 months ago. (Hugging Face)
  • Efficient-Large-Model/worldmodelbench shows recent practice. Updated ~5 months ago. (Hugging Face)

Common pitfalls to avoid

  • Missing required top-level keys or wrong vocabulary prefix. Follow the minimal example in the spec. (docs.mlcommons.org)
  • contentUrl pointing to the web UI instead of the raw .../resolve/.... Use hf_hub_url. (Hugging Face)
  • Expecting your manual croissant.json to change HF’s /croissant endpoint. That endpoint is auto-generated from Parquet/ImageFolder. Your file serves external tools and users. (Hugging Face)
  • Big private repos: dataset viewer Parquet publishing has limits and requirements. See size and visibility rules. (Hugging Face)

End-to-end checklist (redundant on purpose)

  1. Probe /croissant. If present, reuse it. If absent, proceed. (Hugging Face)

  2. Generate croissant.json:

    • List files from the repo.
    • Build distribution with cr:FileSet globs and cr:FileObject URLs from hf_hub_url.
    • Create a recordSet per main table with cr:Fields mapped by column. (Hugging Face)
  3. Validate with mlcroissant and load a few records. Fix errors. (Hugging Face)

  4. Commit croissant.json at the repo root with upload_file or CLI. (Hugging Face)

  5. Optional: rebuild the repo as Parquet or ImageFolder to get HF auto-Croissant as well. (Hugging Face)

Supplemental materials (curated, dated)

Spec and core

  • MLCommons Croissant docs and minimal example. Accessed 2025-10-15. (docs.mlcommons.org)
  • Croissant repo README and examples. Latest release v1.0.22 on 2025-08-25. (github.com)

Hugging Face APIs

  • Get Croissant metadata via dataset viewer. Accessed 2025-10-15. (Hugging Face)
  • Dataset viewer overview and Parquet auto-publish rules. Updated 2024-2025. (Hugging Face)
  • mlcroissant usage on HF docs. Accessed 2025-10-15. (Hugging Face)
  • Upload files programmatically and with CLI. Updated 2024-07-22 and 2024-??. (Hugging Face)
  • Build raw URLs with hf_hub_url. Accessed 2025-10-15. (Hugging Face)

HF forum guidance

  • Auto-Croissant for Parquet/ImageFolder. Posts from 2025-04-14 and 2025-05-12. (Hugging Face Forums)

Editor

  • Croissant Editor Space. Accessed 2025-10-15. (Hugging Face)
2 Likes

Great thing @John6666 , thank you so much! I’ll try it out soon.

1 Like