license: apache-2.0
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
      - Student
      - Research Graduate
      - AI researcher
      - AI developer/engineer
      - Reporter
      - Other
  geo: ip_location
  By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
extra_gated_description: >-
  The information you provide will be collected, stored, processed and shared in
  accordance with the [Meta Privacy
  Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content: Submit
language:
  - en
tags:
  - <relevant tags to be included in HF filters>
Physics of Language Models 4.2: LlamaCanon Release
Our released paper, Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers, demonstrates that the Canon layer is a powerful architecture add-on that improves language model performance on multiple fronts using a synthetic pretraining playground, perhaps for every possible architecture (original Transformer or linear models).
In this release, we provide code and pre-trained models to showcase how these findings extend to real-world pretraining. Specifically, we compare the vanilla Llama architecture with our modified LlamaCanon variant, both pretrained under the same controlled settings.
 Figure 1: Quick illustration of performance vs. model size/training time.
Figure 1: Quick illustration of performance vs. model size/training time.
✨Highlights of the Release
- Broad Model Availability: We release 16 base models (1B, 3B, and 8B) pretrained on the open-sourced Nemotron-CC dataset for 1T or 2T tokens.
- Controlled Experiment: In each setting, we pretrain two versions of LlamaCanon (using two learning rates) and compare them against two corresponding versions of the original Llama pretrained with identical hyperparameters. This ensures a rigorous architectural comparison.
- Performance Gain: LlamaCanon consistently surpasses Llama in all eight controlled comparisons, achieving, for instance, a 2% gain in the MMLU benchmark.
- Comparison to Open Models: Our experiments are benchmarked against open-sourced models trained on similar datasets, ensuring that we study a realistic pretraining setup rather than an artificial scenario.
⚙️Model Configurations
A quick summary of the 16 models we release along with their parameters can be seen below:
 Figure 2: Names and parameters of the released models.
Figure 2: Names and parameters of the released models.
🔗Links
📊Performance Metrics
The table below illustrates how LlamaCanon performs in comparison to vanilla Llama models, as well as some open-sourced pretraining benchmarks.
 Figure 3: Cross-benchmark performance evaluation of the released models.
Figure 3: Cross-benchmark performance evaluation of the released models.
📈Training Curves
To further showcase the advantage of Canon layers over the entirety of the pretraining process, we provide detailed training-time performance curves. Interactive versions and additional benchmark metrics are available in our GitHub repository.
 Figure 4: MMLU accuracy vs. training tokens.
Figure 4: MMLU accuracy vs. training tokens.
📌Model Details
- Model Type: Llama Transformer + LlamaCanon Transformer
- Language: English
- License: Apache 2.0
- Type: Base model without any instruction fine-tuning or post-training.
- Context length: 4096 tokens (+ ~50% for LlamaCanon).  - Note: The models were pretrained with context length 4096. However, unlike traditional RoPE transformers, LlamaCanon demonstrates strong length generalization, extending to ~50% more tokens (as detailed in our paper). While long-context fine-tuning could further enhance this capability, we have deliberately avoided it to maintain a clean and controlled comparison of base-model pretraining, highlighting the effectiveness of Canon layers.
 
🧩Installation and Dependencies
It is highly recommended to pip install causal-conv1d for CUDA efficiency, as our implementation of Canon layers relies on depth-wise conv1d. 
The code is tested with transformers==4.47.1 and 4.53.3 but should be compatible with many earlier versions. Ensure you enable trust_remote_code=True to download the architecture code automatically.
▶️Demo
The following sample demonstrates how to use our pre-trained models:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Choose any of our 16 released models
# model_name = "facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.003"
model_name = "facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.005"
# model_name = "facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.003"
# Below is simply a wrapper of either the Llama2 tokenizer (for <=3B models) 
#   or Llama3 (for 8B models); alternatively, you can download your own 
#   Huggingface llama2/3 tokenizers and use that instead
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).cuda()
input_text = "Galileo Galilei climbed the Leaning Tower of Pisa to conduct a controlled experiment."
inputs = tokenizer(input_text, return_tensors="pt")
output_ids = model.generate(inputs['input_ids'].cuda(), max_new_tokens=50)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
⚠️Bias, Risks, and Limitations
The models are released for research purposes only (mainly for controlled experiments comparing Llama and LlamaCanon) and are not intended for applications requiring high factual accuracy, safety-critical use cases, or medical/health contexts. The models were pretrained on open datasets and are not safety- or alignment-tuned, meaning:
- They may generate content that is factually incorrect, biased, harmful, or offensive.
- Outputs may include objectionable content even if such outcomes weren't explicitly intended.
- Users are responsible for ensuring appropriate evaluation and implementing additional filtering or safety mechanisms suitable for their specific use cases.
📖Citation
Please cite the following if you use our models or findings in your research:
@article{Allenzhu2025-canon,
  author = {{Allen-Zhu}, Zeyuan},
  title = {{Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers}},
  year = {2025},
  month = {May},
  journal = {SSRN Electronic Journal},
  note = {\url{https://ssrn.com/abstract=5240330}}
}
Note: A technical report for this release will appear under Physics of Language Models: Part 4.2. Until then, please cite the above paper. Thank you!
Additional Resources
- GitHub Repository includes- Full training recipes, model configurations, and interactive plots (on all benchmarks).
 
Model Card Author
- Zeyuan Allen-Zhu
