Hey ,
I have a scenario where I’ll need to run distributed training on SageMaker. Couple questions on integration with Fast File mode, IterableDatasets, memory mapping and performance:
- With
streaming=True, is the dataset memory-mapped, since it’s not actually on disk to map to/from? If not, is streaming less performant than loading from memory-mapped files, as indicated here? FastFileModeon SageMaker exposes S3 objects as if they were in local disk, but they are actually streamed on demand as they are accessed. If using a standardDataset, I imagine each file needs to be streamed from S3 viaFastFilein its entirety before it can be memory mapped, is that correct? In this case, if using standardDatasetshould I avoidFastFileto avoid this 2 step process, and just download all the data upfront?- Is native HF dataset streaming performant compared to multi-process, byte-range fetches directly from a large file on S3 or other object storage? I see
IterableDatasetdoes not support multiple workers. - On this same line, if
Datasetis based on the Arrow format, why doesn’t IterableDataset allow streaming Arrow files (according to docs) from remote storage, or loading Arrow files progressively from a local file? Is there a fundamental limitation on this, or would it just not provide better performance in theory over progressively loading a JSON or CSV file?
Thanks in advance. @lhoestq