Upload folder using huggingface_hub

f74485b verified 3 months ago

11.2 kB

metadata

license: apache-2.0
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
      - Student
      - Research Graduate
      - AI researcher
      - AI developer/engineer
      - Reporter
      - Other
  geo: ip_location
  By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
extra_gated_description: >-
  The information you provide will be collected, stored, processed and shared in
  accordance with the [Meta Privacy
  Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content: Submit
language:
  - en
tags:
  - <relevant tags to be included in HF filters>

Physics of Language Models 4.2: LlamaCanon Release

Our released paper, Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers, demonstrates that the Canon layer is a powerful architecture add-on that improves language model performance on multiple fronts using a synthetic pretraining playground, perhaps for every possible architecture (original Transformer or linear models).

In this release, we provide code and pre-trained models to showcase how these findings extend to real-world pretraining. Specifically, we compare the vanilla Llama architecture with our modified LlamaCanon variant, both pretrained under the same controlled settings.

Figure 1: Quick illustration of performance vs. model size/training time.

✨Highlights of the Release

Broad Model Availability: We release 16 base models (1B, 3B, and 8B) pretrained on the open-sourced Nemotron-CC dataset for 1T or 2T tokens.
Controlled Experiment: In each setting, we pretrain two versions of LlamaCanon (using two learning rates) and compare them against two corresponding versions of the original Llama pretrained with identical hyperparameters. This ensures a rigorous architectural comparison.
Performance Gain: LlamaCanon consistently surpasses Llama in all eight controlled comparisons, achieving, for instance, a 2% gain in the MMLU benchmark.
Comparison to Open Models: Our experiments are benchmarked against open-sourced models trained on similar datasets, ensuring that we study a realistic pretraining setup rather than an artificial scenario.

⚙️Model Configurations

A quick summary of the 16 models we release along with their parameters can be seen below:

Figure 2: Names and parameters of the released models.

🔗Links

📊Performance Metrics

The table below illustrates how LlamaCanon performs in comparison to vanilla Llama models, as well as some open-sourced pretraining benchmarks.

Figure 3: Cross-benchmark performance evaluation of the released models.

📈Training Curves

To further showcase the advantage of Canon layers over the entirety of the pretraining process, we provide detailed training-time performance curves. Interactive versions and additional benchmark metrics are available in our GitHub repository.

Figure 4: MMLU accuracy vs. training tokens.

📌Model Details

Model Type: Llama Transformer + LlamaCanon Transformer
Language: English
License: Apache 2.0
Type: Base model without any instruction fine-tuning or post-training.
Context length: 4096 tokens (+ ~50% for LlamaCanon).
- Note: The models were pretrained with context length 4096. However, unlike traditional RoPE transformers, LlamaCanon demonstrates strong length generalization, extending to ~50% more tokens (as detailed in our paper). While long-context fine-tuning could further enhance this capability, we have deliberately avoided it to maintain a clean and controlled comparison of base-model pretraining, highlighting the effectiveness of Canon layers.

🧩Installation and Dependencies

It is highly recommended to pip install causal-conv1d for CUDA efficiency, as our implementation of Canon layers relies on depth-wise conv1d. The code is tested with transformers==4.47.1 and 4.53.3 but should be compatible with many earlier versions. Ensure you enable trust_remote_code=True to download the architecture code automatically.

▶️Demo

The following sample demonstrates how to use our pre-trained models:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Choose any of our 16 released models
# model_name = "facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.003"
model_name = "facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.005"
# model_name = "facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.003"

# Below is simply a wrapper of either the Llama2 tokenizer (for <=3B models) 
#   or Llama3 (for 8B models); alternatively, you can download your own 
#   Huggingface llama2/3 tokenizers and use that instead
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).cuda()

input_text = "Galileo Galilei climbed the Leaning Tower of Pisa to conduct a controlled experiment."
inputs = tokenizer(input_text, return_tensors="pt")
output_ids = model.generate(inputs['input_ids'].cuda(), max_new_tokens=50)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

⚠️Bias, Risks, and Limitations

The models are released for research purposes only (mainly for controlled experiments comparing Llama and LlamaCanon) and are not intended for applications requiring high factual accuracy, safety-critical use cases, or medical/health contexts. The models were pretrained on open datasets and are not safety- or alignment-tuned, meaning:

They may generate content that is factually incorrect, biased, harmful, or offensive.
Outputs may include objectionable content even if such outcomes weren't explicitly intended.
Users are responsible for ensuring appropriate evaluation and implementing additional filtering or safety mechanisms suitable for their specific use cases.

📖Citation

Please cite the following if you use our models or findings in your research:

@article{Allenzhu2025-canon,
  author = {{Allen-Zhu}, Zeyuan},
  title = {{Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers}},
  year = {2025},
  month = {May},
  journal = {SSRN Electronic Journal},
  note = {\url{https://ssrn.com/abstract=5240330}}
}

Note: A technical report for this release will appear under Physics of Language Models: Part 4.2. Until then, please cite the above paper. Thank you!

Additional Resources

GitHub Repository includes
- Full training recipes, model configurations, and interactive plots (on all benchmarks).

Model Card Author

Zeyuan Allen-Zhu