alexmarques commited on
Commit
c010780
·
verified ·
1 Parent(s): bd807ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -16
README.md CHANGED
@@ -68,32 +68,41 @@ This optimization reduces the number of bits per parameter from 16 to 8, reducin
68
 
69
  ### Use with vLLM
70
 
71
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 
 
 
 
 
72
 
73
  ```python
74
- from transformers import AutoTokenizer
75
- from vllm import LLM, SamplingParams
76
 
77
- max_model_len, tp_size = 4096, 1
78
- model_name = "neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic"
79
- tokenizer = AutoTokenizer.from_pretrained(model_name)
80
- llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
81
- sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
82
 
83
- messages_list = [
84
- [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
85
- ]
 
86
 
87
- prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
88
 
89
- outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
90
 
91
- generated_text = [output.outputs[0].text for output in outputs]
 
 
 
 
 
 
 
 
 
92
  print(generated_text)
93
  ```
94
 
95
- vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
96
-
97
  <details>
98
  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
99
 
 
68
 
69
  ### Use with vLLM
70
 
71
+ 1. Initialize vLLM server:
72
+ ```
73
+ vllm serve RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-dynamic --tensor_parallel_size 1 --tokenizer_mode mistral
74
+ ```
75
+
76
+ 2. Send requests to the server:
77
 
78
  ```python
79
+ from openai import OpenAI
 
80
 
81
+ # Modify OpenAI's API key and API base to use vLLM's API server.
82
+ openai_api_key = "EMPTY"
83
+ openai_api_base = "http://<your-server-host>:8000/v1"
 
 
84
 
85
+ client = OpenAI(
86
+ api_key=openai_api_key,
87
+ base_url=openai_api_base,
88
+ )
89
 
90
+ model = "RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-dynamic"
91
 
 
92
 
93
+ messages = [
94
+ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
95
+ ]
96
+
97
+ outputs = client.chat.completions.create(
98
+ model=model,
99
+ messages=messages,
100
+ )
101
+
102
+ generated_text = outputs.choices[0].message.content
103
  print(generated_text)
104
  ```
105
 
 
 
106
  <details>
107
  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
108