403 error on dataset fineweb-2

Hi,

I was training a small model just for fun when the error occured (after more 100k steps) :

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2/resolve/a8a99b128121a41b17d95901715603386f6b1daf/data/fra_Latn/train/000_00000.parquet

I’m wondering if I have reach some rate limits or else ? I guess it shoul failed way earlier if I was doing it wrong ?

I’m using it with streaming on:

    ds_fr = load_dataset(
        "HuggingFaceFW/fineweb-2",
        name="fra_Latn",
        split="train",
        streaming=True
    )

Any idea what the problem can be ?

Thanks,

1 Like

HTTPError: 403 Client Error: Forbidden for url

When streaming=True, shards are fetched on-demand, so it’s not unusual for errors to occur midway through fetching. Judging from the error message, it appears to be a CDN or network error, so I don’t think it’s a code issue.

Since the retry limit is likely less restrictive during login, how about doing huggingface_hub.login() beforehand during training and configuring datasets settings like increasing the retry count to enhance error tolerance?

Although I don’t think it’s the case this time, it’s not unheard of for the dataset repository to be updated while streaming the dataset—a rare scenario. To avoid this, explicitly specifying the revision would be the surest way.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.