SageMaker FastFileMode, dataset streaming and memory mapping

joaopcm · January 24, 2024, 7:14pm

Hey ,

I have a scenario where I’ll need to run distributed training on SageMaker. Couple questions on integration with Fast File mode, IterableDatasets, memory mapping and performance:

With streaming=True, is the dataset memory-mapped, since it’s not actually on disk to map to/from? If not, is streaming less performant than loading from memory-mapped files, as indicated here?
FastFileMode on SageMaker exposes S3 objects as if they were in local disk, but they are actually streamed on demand as they are accessed. If using a standard Dataset, I imagine each file needs to be streamed from S3 via FastFile in its entirety before it can be memory mapped, is that correct? In this case, if using standard Dataset should I avoid FastFile to avoid this 2 step process, and just download all the data upfront?
Is native HF dataset streaming performant compared to multi-process, byte-range fetches directly from a large file on S3 or other object storage? I see IterableDataset does not support multiple workers.
On this same line, if Dataset is based on the Arrow format, why doesn’t IterableDataset allow streaming Arrow files (according to docs) from remote storage, or loading Arrow files progressively from a local file? Is there a fundamental limitation on this, or would it just not provide better performance in theory over progressively loading a JSON or CSV file?

Thanks in advance. @lhoestq

joaopcm · January 24, 2024, 7:15pm

Couldn’t add final 2 links: streaming docs and progressively loading a local file.

lhoestq · January 29, 2024, 9:36pm

Hi !

With streaming=True, the data is streamed directly from the source. So it doesn’t use memory mapping, which is the mechanism we use to load cached datasets in arrow format
I don’t know but I would bet that memory mapping would be quite slow with this (similarly to FUSE for example)
It’s super fast, especially from HF or HTTP urls. It also has experimental features for S3 and it will use the fsspec s3fs implementation which is ok but not the fastest afaik
IterableDataset does allow streaming from local, from HF or from HTTP. It’s faster than reading JSON or CSV since there is no parsing/deserialization needed

Topic		Replies	Views
Creating Vision dataset with images on s3 Amazon SageMaker	9	2606	September 15, 2022
Save map cache to s3 bucket 🤗Datasets	2	724	September 9, 2021
Use load dataset to load a sample of the dataset 🤗Datasets	3	1301	May 24, 2021
Roadmap/timeline for dataset streaming 🤗Datasets	9	2287	July 5, 2021
How to use S3 path with `load_dataset` with streaming=True? 🤗Datasets	11	7906	November 23, 2022

SageMaker FastFileMode, dataset streaming and memory mapping

Related topics