xet documentation

Xet Protocol Specification

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Xet Protocol Specification

Version 1.0.0 The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 RFC2119 RFC8174 when, and only when, they appear in all capitals, as shown here.

This specification defines the end-to-end Xet protocol for content-addressed data: chunking and hashing rules, deduplication strategy, xorb and shard object formats, file reconstruction semantics, authentication, and the CAS APIs for upload and download. Its goal is interoperability and determinism: independent implementations MUST produce the same hashes, objects, and API behavior so data written by one client can be read by another with integrity and performance. Implementors can create their own clients, SDKs, and tools that speak the Xet protocol and interface with the CAS service, as long as they MUST adhere to the requirements defined here.

Building a Client Library for Xet Storage

Overall Xet Architecture

  • Content-Defined Chunking: Gearhash-based CDC with parameters, boundary rules, and performance optimizations.
  • Hashing Methods: Descriptions and definitions of the different hashing functions used for chunks, xorbs and term verification entries.
  • File Reconstruction: Defining “term”-based representation of files using xorb hash + chunk ranges.
  • Xorb Format: Explains grouping chunks into xorbs, 64 MiB limits, binary layout, and compression schemes.
  • Shard Format: Binary shard structure (header, file info, CAS info, footer), offsets, HMAC key usage, and bookends.
  • Deduplication: Explanation of chunk level dedupe including global system-wide chunk level dedupe.

Reference Implementation

xet-core: hf-xet + git-xet

The primary reference implementation of the protocol written in Rust 🦀 lives in the xet-core repository under multiple crates:

  • cas_types - Common re-usable types for interacting with CAS API’s
  • cas_client - Client interface that calls CAS API’s, including comprehensive implementation of download protocol.
  • mdb_shard - Library for interacting with shards and the shard binary format.
  • deduplication - Exposes interfaces to deduplicate chunks locally and using global deduplication
  • merklehash - Exports a MerkleHash type extensively used to represent hashes. Exports functions to compute the different hashes used to track chunks, xorbs and files.
  • data - Comprehensive package exposing interfaces to upload and download contents
  • hf_xet - Python bindings to use the Xet protocol for uploads and downloads with the Hugging Face Hub.
  • git-xet - git lfs custom transfer agent that uploads files using the xet protocol to the Hugging Face Hub.

huggingface.js

There is also a second reference implementation in Huggingface.js that can be used when downloading or uploading files with the @huggingface/hub library.

  • Download uses the XetBlob that can be found in XetBlob.ts.
  • The upload implementation is more comprehensive but the root of it begins in uploadShards.
    • The upload process uses xet-core constructs compiled from Rust to WebAssembly, particularly all functions exported from the hf_xet_thin_wasm crate.
Update on GitHub