Jan Kessler commited on
Commit
4567c5f
·
1 Parent(s): 59a9145

add proper README

Browse files
Files changed (2) hide show
  1. LICENSE +0 -0
  2. README.md +36 -1
LICENSE DELETED
File without changes
README.md CHANGED
@@ -1,5 +1,40 @@
1
  ---
2
  license: other
3
  license_name: nvidia-open-model-license
4
- license_link: LICENSE
 
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
3
  license_name: nvidia-open-model-license
4
+ license_link: >-
5
+ https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
6
  ---
7
+
8
+ # Llama-3.3-Nemotron-Super-49B-v1-FP8-Dynamic
9
+
10
+ FP8-Dynamic quantization of https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1
11
+
12
+ Created with llmcompressor using the following code:
13
+
14
+ ```
15
+ from transformers import AutoTokenizer, AutoModelForCausalLM
16
+ from llmcompressor.transformers import oneshot
17
+ from llmcompressor.modifiers.quantization import QuantizationModifier
18
+
19
+ MODEL_ID = "/models/Llama-3_3-Nemotron-Super-49B-v1"
20
+ model = AutoModelForCausalLM.from_pretrained(
21
+ MODEL_ID, device_map="auto", torch_dtype="auto", trust_remote_code=True,
22
+ )
23
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
24
+
25
+ # Configure the simple PTQ quantization
26
+ recipe = QuantizationModifier(
27
+ targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
28
+
29
+ # Apply the quantization algorithm.
30
+ oneshot(model=model, recipe=recipe, trust_remote_code_model=True)
31
+
32
+ # Save the model.
33
+ SAVE_DIR = MODEL_ID + "-FP8-Dynamic"
34
+ model.save_pretrained(SAVE_DIR)
35
+ tokenizer.save_pretrained(SAVE_DIR)
36
+ ```
37
+
38
+ To run it with vllm, use the latest version (0.8.2 as of now) and add the following PR: https://github.com/vllm-project/vllm/pull/15008
39
+
40
+ Make sure to read the original model's README for further guidance.