Dataset viewer broke after repo rename

Hi,

After renaming my dataset repository, the dataset viewer began failing with the following error, although it worked before the rename. Also, the internally generated refs/convert/parquet branch that had previously been created by the parquet-converter bot is now missing after the rename.

The full dataset viewer is not available (click to read why). Only showing a preview of the rows.

Error code:   DatasetGenerationError
Exception:    IndexError
Message:      list index out of range
Traceback:    Traceback (most recent call last):
                File "/usr/local/lib/python3.12/site-packages/datasets/builder.py", line 1904, in _prepare_split_single
                  original_shard_lengths[original_shard_id] += len(table)
                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
              IndexError: list index out of range
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1342, in compute_config_parquet_and_info_response
                  parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
                                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 907, in stream_convert_to_parquet
                  builder._prepare_split(split_generator=splits_generators[split], file_format="parquet")
                File "/usr/local/lib/python3.12/site-packages/datasets/builder.py", line 1739, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/builder.py", line 1925, in _prepare_split_single
                  raise DatasetGenerationError("An error occurred while generating the dataset") from e
              datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
1 Like

Renaming the dataset itself may simply be a trigger in this case, not the cause:
just in case, @lhoestq


The most likely explanation is that the rename triggered a fresh viewer-side Parquet rebuild, and that rebuild failed inside Hugging Face’s backend conversion path. The missing refs/convert/parquet ref is consistent with that failure, because Hugging Face documents that the dataset viewer’s Parquet copy is published on refs/convert/parquet, and the Hub API treats converts as internal preprocessed refs separate from ordinary branches and tags. (huggingface.co, huggingface.co)

1. What the dataset viewer actually does

The full dataset viewer is not just rendering your raw files directly from the repo. Hugging Face exposes a dedicated dataset-viewer API with endpoints such as /is-valid, /first-rows, /rows, /search, /filter, /parquet, and /size. For the full viewer, Hugging Face builds and serves a Parquet-backed representation of the dataset; the Parquet docs explicitly say those files are published on refs/convert/parquet. (huggingface.co, huggingface.co)

That architectural detail matters because it means there are really two layers involved:

  1. your visible dataset repo and its contents, and
  2. an internal, generated viewer layer that mirrors the dataset in Parquet for browsing and querying. (huggingface.co, huggingface.co)

So when the UI says:

“The full dataset viewer is not available. Only showing a preview of the rows.”

that does not automatically mean your dataset files are broken. Hugging Face’s validity docs explicitly document that preview and viewer are separate capabilities, so a dataset can remain previewable while the full viewer is unavailable. (huggingface.co)

2. What your traceback says, technically

The important part of your traceback is not the outer DatasetGenerationError. The important part is the inner crash:

original_shard_lengths[original_shard_id] += len(table)
IndexError: list index out of range

That line comes from the dataset-building logic used during the viewer’s Parquet-generation job. In other words, the failure is happening while Hugging Face is preparing the split for Parquet output, not while the browser is simply reading an already-existing table. (github.com)

This is especially significant because Hugging Face already has a public upstream fix for that exact failure class. There is a huggingface/datasets PR titled “Fix index out of bound error with original_shard_lengths”, and the related datasets 4.6.0 release notes include “Support empty shard in from_generator.” That is the strongest single clue in your entire report. It means the crash pattern itself is already known to Hugging Face and is not just something unique to your repo rename. (github.com, github.com)

3. Why the rename likely triggered it

A repo rename can be completely harmless for the raw data and still break the viewer layer.

Hugging Face documents repo moves through repository tooling, but the viewer’s Parquet mirror is documented separately as generated state living in refs/convert/parquet, and the Hub API classifies these as internal converts refs. That means a rename is not just a Git rename from the viewer backend’s perspective; it can require the backend to re-resolve, regenerate, or republish the derived Parquet artifacts under the new repo identity. (huggingface.co, huggingface.co, huggingface.co)

That gives a very plausible sequence for your case:

  1. the dataset worked before because the old hidden Parquet mirror already existed,
  2. the repo was renamed,
  3. the viewer backend had to rebuild or reattach the Parquet mirror,
  4. the rebuild hit the original_shard_lengths bug,
  5. the Parquet publish step never completed,
  6. refs/convert/parquet is now missing,
  7. the page falls back to preview-only mode.

That sequence is an inference, but it is strongly supported by the official architecture docs and the public bug history. (huggingface.co, github.com, github.com)

4. Why the missing refs/convert/parquet ref is such a strong clue

The missing ref is not just a side symptom. It is one of the most important parts of the diagnosis.

Hugging Face’s Parquet docs say the viewer’s Parquet files are published on refs/convert/parquet. Meanwhile, the Hub API docs explain that converts are internal refs used to push preprocessed data in dataset repos. So if that ref existed before the rename and is absent afterward, the natural reading is:

  • the old generated viewer state is gone or no longer attached, and
  • the new generated viewer state failed to build. (huggingface.co, huggingface.co)

There is also public evidence that these generated refs can become stale or out of sync after repo changes. In one Hugging Face discussion, a maintainer explains that the auto-generated refs/convert/* branches are updated only when the viewer updates, and the user shows refs/convert/parquet not matching newer content on main. In a separate dataset-viewer issue, Hugging Face describes a corner case where old Parquet files remain on the Hub after the dataset is updated, so the viewer layer and main can diverge. (huggingface.co, github.com)

So the rename-specific angle in your case is not “renaming destroys data.” It is more like “renaming forced the system back through a fragile generated-state path.”

5. What is most likely happening in your specific case

My best technical reading is this:

  • your underlying dataset files are probably still fine,
  • the viewer backend tried to regenerate the Parquet mirror for the renamed repo,
  • during split preparation, it encountered a shard bookkeeping pattern that the older code handled incorrectly,
  • the Parquet-generation job aborted before it could republish refs/convert/parquet,
  • the viewer UI now has only the preview path available. (github.com, huggingface.co, huggingface.co)

If I had to rank the causes:

Most likely

A known Hugging Face backend bug in Parquet generation around shard bookkeeping, exposed when the rename forced regeneration. (github.com, github.com)

Also plausible

A stale or desynchronized hidden Parquet ref problem after repo change, where the viewer’s generated state no longer lines up cleanly with main. (huggingface.co, github.com)

Less likely

A real corruption or format defect in your dataset content itself. The traceback is pointing much more strongly at the generation layer than at raw-data parsing. (github.com)

Least likely

A generic upstream rate limit or external hosting failure. Hugging Face does have dataset-viewer failures where external 429/403 errors bubble up as a generic generation failure, but those cases have a different shape than your original_shard_lengths crash. (github.com)

6. Why I do not think the rename directly “broke the data”

A rename changes the repo identity. It does not normally rewrite the actual dataset contents. The evidence in your traceback points to the conversion job that rebuilds viewer artifacts, not to a change in the rows themselves. The official viewer docs and the Hub API docs reinforce that distinction: the full viewer depends on separate generated Parquet state, and that state is managed through hidden convert refs. (huggingface.co, huggingface.co)

That is why the right mental model is:

rename = trigger
conversion bug / stale viewer state = root problem

not:

rename = data corruption

7. What to do now

Step 1. Check the viewer state directly

Run:

curl "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fis-valid%3Fdataset%3D%26lt%3Bnamespace%26gt%3B%2F%26lt%3Brepo%26gt%3B"
curl "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fparquet%3Fdataset%3D%26lt%3Bnamespace%26gt%3B%2F%26lt%3Brepo%26gt%3B"

and:

from huggingface_hub import HfApi

api = HfApi()
print(api.list_repo_refs("<namespace>/<repo>", repo_type="dataset"))

These tell you three different things:

Step 2. Make one tiny commit

Hugging Face’s dataset-viewer issue history indicates that dataset updates trigger backend jobs through a webhook path. So a small README or dataset card edit is a reasonable way to retrigger Parquet-and-info generation. It is not guaranteed to work if the worker still carries the buggy code path, but it is the simplest clean retry. (github.com)

Step 3. Open a dataset discussion with the exact details

This matters. The dataset-viewer repo itself says that when a dataset page shows a viewer error, the efficient route is to open a discussion on the dataset page and tag the viewer team. Your report should include:

  • repo was renamed,
  • refs/convert/parquet existed before and is now missing,
  • exact traceback,
  • especially the original_shard_lengths[...] IndexError. (github.com)

That gives Hugging Face maintainers the strongest possible signal that this is a backend conversion problem, not a generic UI complaint.

8. The best self-service workaround

If you need a durable fix without waiting for Hugging Face to repair or rerun the conversion, the cleanest workaround is to publish the dataset natively as Parquet on main.

Why that works:

  • Hugging Face documents that if the dataset is already in Parquet, the refs/convert/parquet branch can usually just link to the original Parquet files instead of performing a new conversion. (huggingface.co)
  • The datasets docs say Dataset.push_to_hub() publishes the dataset as a Parquet dataset and exposes shard controls such as max_shard_size and num_shards. (huggingface.co)

So in your case, publishing Parquet directly is not just an optimization. It is a way to bypass the exact conversion layer that is currently failing.

9. If you regenerate locally

If your workflow involves regenerating the dataset locally before upload, then use a datasets version new enough to include the fix path associated with the original_shard_lengths bug. The public evidence points to that class of fix being present by datasets 4.6.0. That does not control the version Hugging Face is running in its backend workers, but it does reduce the chance of reproducing the same problem on your side while you generate and upload Parquet yourself. (github.com, github.com)

10. My bottom-line diagnosis

Here is my actual opinion, stated plainly:

The rename probably exposed a Hugging Face viewer-generation bug or stale hidden-ref state. The missing refs/convert/parquet ref is best understood as a symptom of failed or incomplete viewer regeneration. The exact traceback points strongly toward a known shard-indexing bug in the backend conversion path. Your dataset contents are probably not the real problem. (huggingface.co, huggingface.co, github.com, github.com, huggingface.co, github.com)

11. Recommended action order

  1. Check /is-valid, /parquet, and list_repo_refs(). (huggingface.co, huggingface.co, huggingface.co)
  2. Make one tiny commit to retrigger the viewer jobs. (github.com)
  3. Open a dataset discussion with the exact traceback and the rename correlation. (github.com)
  4. If you need a self-service fix, republish as native Parquet. (huggingface.co, huggingface.co)
1 Like

Hi ! can you share which repo if it’s a public one ? Which file format are you using ?

2 Likes

Hi @lhoestq

Thanks for your reply!

The repository is:

Please note that my Parquet files do not contain image bytes; they only store relative image paths.

So far, I have tried the following:

  1. I added the following to the configs section of README.md. In this case, no error appears, but image thumbnails are no longer shown as they were before the rename, and the viewer shows only the relative paths instead:
configs:
  - config_name: default
    data_files:
      - split: train
        path: train/metadata.parquet
      - split: validation
        path: val/metadata.parquet
      - split: test
        path: test/metadata.parquet
  1. I also tried JSONL files, which are now in the repository root. In those files, the file_name column contains fully resolved URLs such as https://huggingface.co/datasets/amir-kazemi/aidovecl-vehicle-detection-classification-localization/resolve/main/test/images/<image_name> rather than relative paths. In that case, the previewer never fully loaded into the full viewer. A few initial rows were displayed successfully, but then the viewer showed an error, which I believe was related to PyArrow and masks. For the JSONL files, I also tried to align the objects field more closely with standard computer vision formats, but I suspect the error may come from the viewer interpreting my annotations as masks, even though they are only bounding boxes.

Please let me know if you need more information.

1 Like