xet documentation
Download Protocol
Download Protocol
This document describes the complete process of downloading a single file from the Xet protocol using the Content Addressable Storage (CAS) reconstruction API.
Overview
File download in the Xet protocol is a two-stage process:
- Reconstruction Query: Query the CAS API to get file reconstruction metadata
- Data Fetching: Download and reassemble the file using the reconstruction metadata
Stage 1: Calling the Reconstruction API
Single File Reconstruction
To download a file given a file hash, first call the reconstruction API to get the file reconstruction. Follow the steps in api.
Note that you will need at least a read
scope auth token, auth reference.
For large files it is RECOMMENDED to request the reconstruction in batches i.e. the first 10GB, download all the data, then the next 10GB and so on. Clients can use the
Range
header to specify a range of file data.
Stage 2: Understanding the Reconstruction Response
The reconstruction API returns a QueryReconstructionResponse
object with three key components:
QueryReconstructionResponse Structure
{
"offset_into_first_range": 0,
"terms": [
{
"hash": "a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456",
"unpacked_length": 263873,
"range": {
"start": 0,
"end": 4
}
},
...
],
"fetch_info": {
"a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456": [
{
"range": {
"start": 0,
"end": 4
},
"url": "https://transfer.xethub.hf.co/xorb/default/a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456",
"url_range": {
"start": 0,
"end": 131071
}
},
...
],
...
}
}
Fields
offset_into_first_range
- Type:
number
- For a full file or when the specified range start is 0, then this is guaranteed to be
0
- For range queries this is the byte offset into the first term (deserialized/chunks decompressed) to start to keep data from.
- since the requested range may start in the middle of a chunk, and data MUST be downloaded in full chunks (since they may need to be deserialized) then this offset tells a client how many bytes to skip in the first chunk (or possibly multiple chunks within the first term).
terms
- Type:
Array<CASReconstructionTerm>
- Ordered list of reconstruction terms describing what chunks to download from which xorb
- Each
CASReconstructionTerm
contains:hash
: The xorb hash (64-character lowercase hex string)range
: Chunk index range{ start: number, end: number }
within the xorb; end-exclusive[start, end)
unpacked_length
: Expected length after decompression (for validation)
fetch_info
- Type:
Map<Xorb Hash (64 character lowercase hex string), Array<CASReconstructionFetchInfo>>
- Maps xorb hashes to required information to download some of their chunks.
- The mapping is to an array of 1 or more
CASReconstructionFetchInfo
- The mapping is to an array of 1 or more
- Each
CASReconstructionFetchInfo
contains:url
: HTTP URL for downloading the xorb data, presigned URL containing authorization informationurl_range
(bytes_start, bytes_end): Byte range{ start: number, end: number }
for the Range header; end-inclusive[start, end]
- The
Range
header MUST be set asRange: bytes=<start>-<end>
when downloading this chunk range
- The
range
(index_start, index_end): Chunk index range{ start: number, end: number }
that this URL provides; end-exclusive[start, end)
- This range indicates which range of chunk indices within this xorb that this fetch info term is describing
Stage 3: Downloading and Reconstructing the File
Process Overview
Process each
CASReconstructionTerm
in order from theterms
arrayFor each
CASReconstructionTerm
, find matching fetch info using the term’s hash- get the list of fetch_info items under the xorb hash from the
CASReconstructionTerm
. The xorb hash is guaranteed to exist as a key in the fetch_info map. - linearly iterate through the list of
CASReconstructionFetchInfo
and find one which refers to a chunk range that is equal or encompassing the term’s chunk range.- Such a fetch_info item is guaranteed to exist. If none exist the server is at fault.
- get the list of fetch_info items under the xorb hash from the
Download the required data using HTTP
GET
request and MUST set theRange
headerDeserialize the downloaded xorb data to extract chunks
- This series of chunks contains chunks at indices specified by the
CASReconstructionFetchInfo
’srange
field. Trim chunks at the beginning or end to match the chunks specified by the reconstruction term’srange
field. - (for the first term only) skip
offset_into_first_range
bytes
- This series of chunks contains chunks at indices specified by the
Concatenate the results in term order to reconstruct the file
Detailed Download Process
Download Reconstruction
- use the reconstruction api to download the reconstruction object for a given file
file_id = "0123...abcdef"
api_endpoint, token = get_token() # follow auth instructions
url = api_endpoint + "/reconstructions/" + file_id
reconstruction = get(url, headers={"Authorization": "Bearer: " + token})
# break the reconstruction into components
terms = reconstruction["terms"]
fetch_info = reconstruction["fetch_info"]
offset_into_first_range = reconstruction["offset_into_first_range"]
Match Terms to Fetch Info
For each CASReconstructionTerm
in the terms
array:
- Look up the term’s
hash
in thefetch_info
map to get a list ofCASReconstructionFetchInfo
- Find a
CASReconstructionFetchInfo
entry where the fetch info’srange
contains the term’srange
- linearly search through the array of
CASReconstructionFetchInfo
and find the element where the range block ({ "start": number, "end": number }
) of theCASReconstructionFetchInfo
has start <= term’s range start AND end >= term’s range end. - The server is meant to guarantee a match, if there isn’t a match this download is considered failed and the server made an error.
- linearly search through the array of
for term in terms:
xorb_hash = term["hash"]
fetch_info_entries = fetch_info[xorb_hash]
fetch_info_entry = None
for entry in fetch_info_entries:
if entry["range"][start] <= term["range"]["start"] and entry["range"]["end"] >= term["range"]["end"]:
fetch_info_entry = entry
break
if fetch_info_entry is None:
# Error!
Step 2: Download Xorb Data
For each matched fetch info:
- Make an HTTP GET request to the
url
in the fetch info entry - Include a
Range
header:bytes={url_range.start}-{url_range.end}
for term in terms:
...
data_url = fetch_info_entry["url"]
range_header = "bytes=" + fetch_info_entry["url_range"]["start"] + "-" + fetch_info_entry["url_range"]["end"]
data = get(data_url, headers={"Range": range_header})
Deserialize Downloaded Data
The downloaded data is in xorb format and MUST be deserialized:
- Parse xorb structure: The data contains compressed chunks with headers
- Decompress chunks: Each chunk has a header followed by compressed data
- Extract byte indices: Track byte boundaries between chunks for range extraction
- Validate length: Decompressed length MUST match
unpacked_length
from the term
Note: The deserialization process depends on the Xorb format.
for term in terms:
...
chunks = {}
for i in range(fetch_info_entry["range"]["start"], fetch_info_entry["range"]["end"]):
chunk = deserialize_chunk(data) # assume data is a reader that advances forwards
chunks[i] = chunk
# at this point data should be fully consumed
Step 4: Extract Term Data
From the deserialized xorb data:
- Use the term’s
range
to identify which chunks are needed - Extract only the chunks specified by
range.start
torange.end-1
(end-exclusive) - Apply any range offsets if processing a partial file download
file_chunks = []
for term in terms:
...
for i in range(term["range"]["start"], term["range"]["end"]):
chunk = chunks[i]
# it is possible that the offset captures multiple chunks, so we may need to skip whole chunks
if offset_into_first_range > len(chunk):
offset_into_first_range -= len(chunk)
continue
if offset_info_first_range > 0:
chunk = chunk[offset_into_first_range:]
offset_info_first_range = 0
file_chunks.push(chunk)
Step 5: Stitch Results Together
Write all of the chunks to the output file or buffer.
If a range was specified then the total data will need to be truncated to the amount of bytes requested.
When a range is specified but the range does not end on a chunk boundary the last byte of the requested range will be in the middle of the last chunk.
A client knows the start of the data from offset_into_first_range
and can then use the length of the specified range to know end end offset.
with open(file_path) as f:
for chunk in file_chunks:
f.write(chunk)
Range Downloads
For partial file downloads, the reconstruction API supports range queries:
- Include
Range: bytes=start-end
header in reconstruction request - The
offset_into_first_range
field indicates where your range starts within the first term - The end of the content will need to be truncated to fit the requested range.
- Except if the requested range exceeds the total file length, then the returned content will be shorter and no truncation is necessary.
When downloading individual term data:
A client MUST include the Range
header formed with the values from the url_range
field to specify the exact range of data of a xorb that they are accessing. Not specifying this header will cause result in an authorization failure.
Xet global deduplication requires that access to xorbs is only granted to authorized ranges. Not specifying this header will result in an authorization failure.
Performance Considerations
- Range coalescing: Multiple terms may share the same fetch info for efficiency, so a single fetch info may be larger than any 1 term and could be used to fulfil multiple terms. Consider downloading such content only once and reusing the data.
- Parallel downloads: Terms can be downloaded in parallel, but MUST be assembled in order
- On file systems with fast seeking, it MAY be advantageous to open the output file in different threads and writing contents at different offsets
- Caching: Clients SHOULD consider caching downloaded xorb ranges to avoid redundant requests
- Retry logic: Implement exponential backoff for transient failures
Caching recommendations
- It can be ineffective to cache the reconstruction object
- The fetch_info section provides short-expiration pre-signed URL’s hence Clients SHOULD NOT cache the urls beyond their short expiration
- To get those URL’s to access the data you will need to call the reconstruction API again anyway
- Cache chunks by range not just individually
- If you need a chunk from a xorb it is very likely that you will need another, so cache them close
- Caching helps when downloading similar contents. May not be worth to cache data if you are always downloading different things
More complex QueryReconstruction Example
Here’s an example of a serialized QueryReconstructionResponse
struct that shows how file reconstruction would work across multiple xorbs:
{
"offset_into_first_range": 0,
"terms": [
{
"hash": "a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456",
"unpacked_length": 263873,
"range": {
"start": 1,
"end": 4
}
},
{
"hash": "fedcba0987654321098765432109876543210fedcba098765432109876543",
"unpacked_length": 143890,
"range": {
"start": 0,
"end": 3
}
},
{
"hash": "a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456",
"unpacked_length": 3063572,
"range": {
"start": 3,
"end": 43
}
},
],
"fetch_info": {
"a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456": [
{
"range": {
"start": 1,
"end": 43
},
"url": "https://transfer.xethub.hf.co/xorb/default/a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20130721%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20130721T201207Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=d6796aa6097c82ba7e33b4725e8396f8a9638f7c3d4b5a6b7c8d9e0f1a2b3c4d",
"url_range": {
"start": 57980,
"end": 1433008
}
}
],
"fedcba0987654321098765432109876543210fedcba098765432109876543": [
{
"range": {
"start": 0,
"end": 3
},
"url": "https://transfer.xethub.hf.co/xorb/default/fedcba0987654321098765432109876543210fedcba098765432109876543?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20130721%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20130721T201207Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=d6796aa6097c82ba7e33b4725e8396f8a9638f7c3d4b5a6b7c8d9e0f1a2b3c4d",
"url_range": {
"start": 0,
"end": 65670
}
}
]
}
}
This example shows reconstruction of a file that requires:
- Chunks
[1, 4)
from the first xorb (~264KB of unpacked data) - Chunks
[0, 2)
from the second xorb (~144KB of unpacked data) - Chunks
[3, 43)
from the same xorb from the first term (~3MB of unpacked data)
The fetch_info
provides the HTTP URLs and byte ranges needed to download the required chunk data from each xorb. The ranges provided within fetch_info
and term sections are always end-exclusive i.e. { "start": 0, "end": 3 }
is a range of 3 chunks at indices 0, 1 and 2.
The ranges provided under a fetch_info
items’ url_range
key are to be used to form the Range
header when downloading the chunk range.
A "url_range"
value of { "start": X, "end": Y }
creates a Range
header value of bytes=X-Y
.
When downloading and deserializing the chunks from xorb a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456
we will have the chunks at indices [1, 43)
.
We will need to only use the chunks at [1, 4)
to fulfill the first term and then chunks [3, 43)
to fulfill the third term.
Note that in this example the chunk at index 3 is used twice! This is the benefit of deduplication; we only need to download the chunk content once.