Spaces:

transformers-community
/

Transformers-tenets

Running

App Files Files Community

tense

by reach-vb HF Staff - opened 22 days ago

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+17

-17

Files changed (1) hide show

app/src/content/article.mdx +17 -17

app/src/content/article.mdx CHANGED Viewed

@@ -18,7 +18,7 @@ authors:
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
-published: "October 2, 2025"
 tags: [transformers, engineering, design-philosophy]
 tableOfContentsAutoCollapse: true
 acknowledgements: "Special thanks to all the reviewers on this! <a href='https://huggingface.co/reach-vb '>Vaibhav Srivastav</a>, <a href='https://huggingface.co/cyrilvallez '>Cyril Vallez</a>, <a href='https://huggingface.co/yonigozlan'>Yoni Gozlan</a> also for his excellent work on fast image processors, <a href='https://huggingface.co/ArthurZ'>Arthur Zucker</a> for his guidance, and of course the wonderful <a href='https://huggingface.co/tfrere'>Thibaud Frere</a> for designing this template and helping me out with it!<br><br>Most importantly: thanks to the entire Open-Source community, sincerely."
@@ -56,7 +56,7 @@ This scale presents a monumental engineering challenge.
 How do you keep such a ship afloat, made of so many moving, unrelated parts, contributed to by a buzzing hivemind? Especially as the pace of ML research accelerates? We receive constant feedback on everything from function signatures with hundreds of arguments to duplicated code and optimization concerns, and we listen to all of it, or try to. The library's usage keeps on growing, and we are a small team of maintainers and contributors, backed by hundreds of open-source community members.
 We continue to support all new models and expect to do so for the foreseeable future.
-This post dissects the design philosophy that makes this possible. It's the result of an evolution from our older principles, detailed on our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, as well as its accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy). More recently (and we strongly recommend the read) we published a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers), focusing on what makes the library faster today. All of these developments are only made possible thanks to these principles.
 We formalize and articulate the "tenets" that have been guiding our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library's sustainability and growth.
@@ -74,7 +74,7 @@ Conventions used throughout this post:
 Breadcrumb boxes summarize what you just learned, connect it to the tenets, and point to what's coming <strong>Next</strong>. Think of them as narrative signposts to help you keep track.
 </Note>
-We will get started by enumerating the tenets. Then we'll look at concrete examples that show how they shape our decision-making. These examples are necessarily detailed, and sometimes complex, because they illustrate the challenges to maintain and grow a large codebase that caters to multiple collectives, has millions of users, hundreds of contributors, and always strives for simplicity and consistency.
 ## The core tenets of transformers
@@ -166,7 +166,7 @@ But the LOC count kept creeping up. Each new model copied over hundreds of lines
 We needed to separate two principles that were so far intertwined, <Tenet term="do-repeat-yourself" display="repetition" position="top" /> and <Tenet term="one-model-one-file" display="hackability" position="top" />.
-What was the solution to this? Let's talk about modular transformers.
 <Note variant="info">
 <strong>TL;DR:</strong> Read the code in one place, <Tenet term="one-model-one-file" display="one model, one file." position="top" />. Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don't Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>).
@@ -182,7 +182,7 @@ Transformers is an opinionated library. The previous [philosophy](https://huggin
 We amended the principle of <Tenet term="do-repeat-yourself" display="DRY*" position="top" /> by progressively removing all pieces of code that were "copied from" another file.
 It works as follows. In order to contribute a model, `GLM` for instance, we define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_ already existing in the library.
-The modular file can use inheritance across models: and then, it will be unravelled into a fully functional modeling file.
 <Reference name="generated-modeling" caption="<strong>Left:</strong> Clean modular definition with inheritance. <strong>Right:</strong> Auto-expanded version with all inherited functionality visible.">
 <Wide>
@@ -348,7 +348,7 @@ You can see below the difference between `GlmAttention` and `LlamaAttention`, wi
 What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.
-When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests run on the modeling code.
 More importantly, the auto-generated modeling file is what users _read_ to understand the code, what they step through in their debuggers and what they hack for their needs.
@@ -383,7 +383,7 @@ We recently underwent a deep refactor of the attention implementation. You've li
 The _attention computation_ itself happens at a _lower_ level of abstraction than the model itself.
-However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn't a <Tenet term="minimal-user-api" display="minimal user api" position="top" />. Next section explains what we did.
 <Note variant="info">
 Evidence: effective (i.e., maintainable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
@@ -411,7 +411,7 @@ Another strength of the new attention interface is the possibility to enforce sp
 Backend integrations sometimes require specific kwargs.
-We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a <Tenet term="minimal-user-api" display="minimal user api" position="top" />.
 We reduce that surface and document expectations; where flexibility is necessary, we plan to use `typing.Annotated` to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future:
@@ -545,9 +545,9 @@ Models define semantics; kernels define how to run them faster. Use decorations
 With `modular` transformers, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
 It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
-So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
-To get this graph, I used the heuristic of modular inheritance.
 1. Does this model have a `modular` file?
 2. In this `modular` file, what models, configurations and processings are imported?
 3. Recurse through the model list that way.
@@ -594,7 +594,7 @@ Llama-lineage is a hub; several VLMs remain islands — engineering opportunity
 ### Many models, but not enough yet, are alike
-I looked into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
 It is interesting, for our comparison, to look at _when_ we deployed the modular logic and what was its rippling effect on the library. Looking at the timeline makes it obvious: adding modular allowed to connect more and more models to solid reference points.
@@ -691,7 +691,7 @@ The following [Pull request to standardize placeholder masking](https://github.c
         return special_image_mask, special_video_mask
 ```
-But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the <Tenet term="one-model-one-file" display="One model, one file tenet." position="top" />
 What do we conclude? Going forward, we should aim for VLMs to have a form of centrality similar to that of `Llama` for text-only models. This centrality should not be achieved at the cost of abstracting and hiding away crucial inner workings of said models.
@@ -705,7 +705,7 @@ Keep VLM embedding mix in the modeling file (semantics), standardize safe helper
 Deciding to become a `torch`-first library meant relieving a tremendous amount of support for `jax ` and `TensorFlow`, and it also meant that we could be more lenient about the amount of torch-dependent utilities that we were able to accept. One of these is the _fast processing_ of images. Where inputs were once minimally assumed to be ndarrays, enforcing native `torch` and `torchvision` inputs allowed us to massively improve processing speed for each model.
-The gains in performance are immense, up to 20x speedup for most models when using compiled torchvision ops. Furthermore, lets us run the whole pipeline solely on GPU.
 <Image src={fastImageProcessors} alt="Fast Image Processors Performance" zoomable caption="<strong>Figure 9:</strong> Performance gains of fast image processors, up to 20x acceleration with compiled torchvision." />
@@ -798,7 +798,7 @@ Forward interception and nested JSON logging align ports to reference implementa
 ### Cooking faster CUDA warmups
-Having a clean _external_ API allows us to work on the <Tenet term="code-is-product" display="true inner workings of transformers" position="top" />. One of a few recent additions was the _CUDA warmup_ via `caching_allocator_warmup`, which dramatically improved loading times by pre-allocating GPU memory to avoid malloc bottlenecks during model loading. It can achieve a 7x speedup factor for an 8B model, or 6x for a 32B one, as you can check in [the PR](https://github.com/huggingface/transformers/pull/36380)!
 <Wide>
 <HtmlEmbed title="Mem allocation patterns during model loading" align="center" src="transformers/warmup_demo.html" />
@@ -816,7 +816,7 @@ Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B i
 ### Transformers-serve and continuous batching
-Having all these models readily available and sharing the same interface allowed us to implement transformers-serve, a CLI tool to expose models through a standard OpenAI http API.
 ```bash
 transformers serve
@@ -845,7 +845,7 @@ OpenAI-compatible surface + continuous batching; kernels/backends slot in becaus
 ## Community reusability
-The transformers-serve CLI built on transformers, for sure, but the library is made first and foremost to be _reused_ at large by the open-source ecosystem.
 Adding a model to transformers means:
@@ -869,6 +869,6 @@ The next major version of `transformers` is just around the corner (and will hav
 We will lean further into a modular toolbox, not a framework. You should not be forced to rewrite modeling code. It's better when a model can inherit from `PreTrainedModel` and opt into Tensor Parallel, `from_pretrained`, sharding, `push_to_hub`, loss plumbing, and external stacks like PEFT/TRL/SGLang/vLLM.
-We wrote this to make our design philosophy legible. Transformers is built by thousands of contributors, but it only stays usable if its core principles are explicit and upheld. These tenets are our pact with you: they ensure that whether you are shipping a new model, contributing an optimized kernel, or simply debugging a forward pass, the code remains transparent and hackable.
 This is a living document, not a stone tablet. Tell us where these tenets fall short or should evolve next. We’ll keep working, and we'll be here to share the journey with you all.

 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
+published: "October 6, 2025"
 tags: [transformers, engineering, design-philosophy]
 tableOfContentsAutoCollapse: true
 acknowledgements: "Special thanks to all the reviewers on this! <a href='https://huggingface.co/reach-vb '>Vaibhav Srivastav</a>, <a href='https://huggingface.co/cyrilvallez '>Cyril Vallez</a>, <a href='https://huggingface.co/yonigozlan'>Yoni Gozlan</a> also for his excellent work on fast image processors, <a href='https://huggingface.co/ArthurZ'>Arthur Zucker</a> for his guidance, and of course the wonderful <a href='https://huggingface.co/tfrere'>Thibaud Frere</a> for designing this template and helping me out with it!<br><br>Most importantly: thanks to the entire Open-Source community, sincerely."
 How do you keep such a ship afloat, made of so many moving, unrelated parts, contributed to by a buzzing hivemind? Especially as the pace of ML research accelerates? We receive constant feedback on everything from function signatures with hundreds of arguments to duplicated code and optimization concerns, and we listen to all of it, or try to. The library's usage keeps on growing, and we are a small team of maintainers and contributors, backed by hundreds of open-source community members.
 We continue to support all new models and expect to do so for the foreseeable future.
+This post dissects the design philosophy that makes this possible. It's the result of an evolution from our older principles, detailed on our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, as well as its accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy). More recently (and we strongly recommend the read) we publish a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers), focusing on what makes the library faster today. All of these developments are only made possible thanks to these principles.
 We formalize and articulate the "tenets" that have been guiding our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library's sustainability and growth.
 Breadcrumb boxes summarize what you just learned, connect it to the tenets, and point to what's coming <strong>Next</strong>. Think of them as narrative signposts to help you keep track.
 </Note>
+We get started by enumerating the tenets. Then we look at concrete examples that show how they shape our decision-making. These examples are necessarily detailed, and sometimes complex, because they illustrate the challenges to maintain and grow a large codebase that caters to multiple collectives, has millions of users, hundreds of contributors, and always strives for simplicity and consistency.
 ## The core tenets of transformers
 We needed to separate two principles that were so far intertwined, <Tenet term="do-repeat-yourself" display="repetition" position="top" /> and <Tenet term="one-model-one-file" display="hackability" position="top" />.
+What is the solution to this? Let's talk about modular transformers.
 <Note variant="info">
 <strong>TL;DR:</strong> Read the code in one place, <Tenet term="one-model-one-file" display="one model, one file." position="top" />. Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don't Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>).
 We amended the principle of <Tenet term="do-repeat-yourself" display="DRY*" position="top" /> by progressively removing all pieces of code that were "copied from" another file.
 It works as follows. In order to contribute a model, `GLM` for instance, we define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_ already existing in the library.
+The modular file can use inheritance across models: and then, it is unravelled into a fully functional modeling file.
 <Reference name="generated-modeling" caption="<strong>Left:</strong> Clean modular definition with inheritance. <strong>Right:</strong> Auto-expanded version with all inherited functionality visible.">
 <Wide>
 What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.
+When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is run, and all the tests run on the modeling code.
 More importantly, the auto-generated modeling file is what users _read_ to understand the code, what they step through in their debuggers and what they hack for their needs.
 The _attention computation_ itself happens at a _lower_ level of abstraction than the model itself.
+However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it isn't a <Tenet term="minimal-user-api" display="minimal user api" position="top" />. Next section explains what we do.
 <Note variant="info">
 Evidence: effective (i.e., maintainable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
 Backend integrations sometimes require specific kwargs.
+We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and continue to reduce in order to improve readability - with them, the current system is a <Tenet term="minimal-user-api" display="minimal user api" position="top" />.
 We reduce that surface and document expectations; where flexibility is necessary, we plan to use `typing.Annotated` to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future:
 With `modular` transformers, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
 It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
+So I want to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
+To get this graph, I use the heuristic of modular inheritance.
 1. Does this model have a `modular` file?
 2. In this `modular` file, what models, configurations and processings are imported?
 3. Recurse through the model list that way.
 ### Many models, but not enough yet, are alike
+I look into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also try code-embedding models that rank candidates better in practice, but for this post we stick to the deterministic Jaccard index.
 It is interesting, for our comparison, to look at _when_ we deployed the modular logic and what was its rippling effect on the library. Looking at the timeline makes it obvious: adding modular allowed to connect more and more models to solid reference points.
         return special_image_mask, special_video_mask
 ```
+But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It does not move away from it, because it'd break the <Tenet term="one-model-one-file" display="One model, one file tenet." position="top" />
 What do we conclude? Going forward, we should aim for VLMs to have a form of centrality similar to that of `Llama` for text-only models. This centrality should not be achieved at the cost of abstracting and hiding away crucial inner workings of said models.
 Deciding to become a `torch`-first library meant relieving a tremendous amount of support for `jax ` and `TensorFlow`, and it also meant that we could be more lenient about the amount of torch-dependent utilities that we were able to accept. One of these is the _fast processing_ of images. Where inputs were once minimally assumed to be ndarrays, enforcing native `torch` and `torchvision` inputs allowed us to massively improve processing speed for each model.
+The gains in performance are immense, up to 20x speedup for most models when using compiled torchvision ops. Furthermore, let us run the whole pipeline solely on GPU.
 <Image src={fastImageProcessors} alt="Fast Image Processors Performance" zoomable caption="<strong>Figure 9:</strong> Performance gains of fast image processors, up to 20x acceleration with compiled torchvision." />
 ### Cooking faster CUDA warmups
+Having a clean _external_ API allows us to work on the <Tenet term="code-is-product" display="true inner workings of transformers" position="top" />. One of a few recent additions is the _CUDA warmup_ via `caching_allocator_warmup`, which dramatically improves loading times by pre-allocating GPU memory to avoid malloc bottlenecks during model loading. It can achieve a 7x speedup factor for an 8B model, or 6x for a 32B one, as you can check in [the PR](https://github.com/huggingface/transformers/pull/36380)!
 <Wide>
 <HtmlEmbed title="Mem allocation patterns during model loading" align="center" src="transformers/warmup_demo.html" />
 ### Transformers-serve and continuous batching
+Having all these models readily available and sharing the same interface allows us to implement transformers-serve, a CLI tool to expose models through a standard OpenAI http API.
 ```bash
 transformers serve
 ## Community reusability
+The transformers-serve CLI is built on transformers, for sure, but the library is made first and foremost to be _reused_ at large by the open-source ecosystem.
 Adding a model to transformers means:
 We will lean further into a modular toolbox, not a framework. You should not be forced to rewrite modeling code. It's better when a model can inherit from `PreTrainedModel` and opt into Tensor Parallel, `from_pretrained`, sharding, `push_to_hub`, loss plumbing, and external stacks like PEFT/TRL/SGLang/vLLM.
+We write this to make our design philosophy legible. Transformers is built by thousands of contributors, but it only stays usable if its core principles are explicit and upheld. These tenets are our pact with you: they ensure that whether you are shipping a new model, contributing an optimized kernel, or simply debugging a forward pass, the code remains transparent and hackable.
 This is a living document, not a stone tablet. Tell us where these tenets fall short or should evolve next. We’ll keep working, and we'll be here to share the journey with you all.