Salesforce
/

Llama-xLAM-2-8b-fc-r

@@ -34,7 +34,7 @@ library_name: transformers
 [Large Action Models (LAMs)](https://blog.salesforceairesearch.com/large-action-models/) are advanced language models designed to enhance decision-making by translating user intentions into executable actions. As the **brains of AI agents**, LAMs autonomously plan and execute tasks to achieve specific goals, making them invaluable for automating workflows across diverse domains.
 **This model release is for research purposes only.**
-The new **xLAM-2** series, built on our most advanced data synthesis, processing, and training pipelines, marks a significant leap in **multi-turn conversation** and **tool usage**. Trained using our novel APIGen-MT framework, which generates high-quality training data through simulated agent-human interactions. Our models achieve state-of-the-art performance on **BFCL** and **τ-bench** benchmarks, outperforming frontier models like GPT-4o and Claude 3.5. Notably, even our smaller models demonstrate superior capabilities in multi-turn scenarios while maintaining exceptional consistency across trials.
 We've also refined the **chat template** and **vLLM integration**, making it easier to build advanced AI agents. Compared to previous xLAM models, xLAM-2 offers superior performance and seamless deployment across applications.
@@ -46,42 +46,38 @@ We've also refined the **chat template** and **vLLM integration**, making it eas
 ## Table of Contents
-- [Model Series](#model-series)
 - [Usage](#usage)
   - [Basic Usage with Huggingface Chat Template](#basic-usage-with-huggingface-chat-template)
 - [Benchmark Results](#benchmark-results)
 - [Citation](#citation)
 ---
 ## Model Series
 [xLAM](https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4) series are significant better at many things including general tasks and function calling.
 For the same number of parameters, the model have been fine-tuned across a wide range of agent tasks and scenarios, all while preserving the capabilities of the original model.
-| Model                  | # Total Params | Context Length |Release Date | Category | Download Model  | Download GGUF files |
-|------------------------|----------------|------------|-------------|-------|----------------|----------|
-| Llama-xLAM-2-70b-fc-r | 70B            | 128k            | Mar. 26, 2025 | Multi-turn Conversation, Function-calling   | [🤗 Link](https://huggingface.co/Salesforce/Llama-xLAM-2-70b-fc-r)         |      NA               |
-| Llama-xLAM-2-8b-fc-r      | 8B             | 128k            | Mar. 26, 2025 | Multi-turn Conversation, Function-calling     | [🤗 Link](https://huggingface.co/Salesforce/Llama-xLAM-2-8b-fc-r)              |   [🤗 Link](https://huggingface.co/Salesforce/Llama-xLAM-2-8b-fc-r-gguf)    |
-| xLAM-2-32b-fc-r     | 32B            | 32k (max 128k)*            | Mar. 26, 2025 |  Multi-turn Conversation, Function-calling   | [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-32b-fc-r)             |      NA               |
-| xLAM-2-3b-fc-r      | 3B             | 32k (max 128k)*            | Mar. 26, 2025 |  Multi-turn Conversation, Function-calling    | [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-3b-fc-r)              |      [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-3b-fc-r-gguf)               |
-| xLAM-2-1b-fc-r      | 1B             | 32k (max 128k)*            | Mar. 26, 2025 |  Multi-turn Conversation, Function-calling | [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-1b-fc-r)              |      [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-1b-fc-r-gguf)               |
-| xLAM-7b-r           | 7.24B          | 32k            | Sep. 5, 2024|General,  Function-calling | [🤗 Link](https://huggingface.co/Salesforce/xLAM-7b-r) | -- |
-| xLAM-8x7b-r           | 46.7B          | 32k           | Sep. 5, 2024|General,  Function-calling | [🤗 Link](https://huggingface.co/Salesforce/xLAM-8x7b-r) | -- |
-| xLAM-8x22b-r           | 141B          | 64k           | Sep. 5, 2024|General,  Function-calling | [🤗 Link](https://huggingface.co/Salesforce/xLAM-8x22b-r) | -- |
-| xLAM-1b-fc-r           | 1.35B          | 16k           | July 17, 2024 | Function-calling| [🤗 Link](https://huggingface.co/Salesforce/xLAM-1b-fc-r) | [🤗 Link](https://huggingface.co/Salesforce/xLAM-1b-fc-r-gguf) |
-| xLAM-7b-fc-r           | 6.91B          | 4k            | July 17, 2024| Function-calling| [🤗 Link](https://huggingface.co/Salesforce/xLAM-7b-fc-r) | [🤗 Link](https://huggingface.co/Salesforce/xLAM-7b-fc-r-gguf) |
-| xLAM-v0.1-r           | 46.7B          | 32k            | Mar. 18, 2024 |General,  Function-calling | [🤗 Link](https://huggingface.co/Salesforce/xLAM-v0.1-r) | -- |
 ***Note:** The default context length for Qwen-2.5-based models is 32k, but you can use techniques like YaRN (Yet Another Recursive Network) to achieve maximum 128k context length. Please refer to [here](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct#processing-long-texts) for more details.
-### 📦 Model Naming Conventions
-- `xLAM-7b-r`: A general-purpose v1.0 or v2.0 release of the **Large Action Model**, fine-tuned for broad agentic capabilities. The `-r` suffix indicates it is a **research** release.
-- `xLAM-7b-fc-r`: A specialized variant where `-fc` denotes fine-tuning for **function calling** tasks, also marked for **research** use.
-- ✅ All models are fully compatible with VLLM, FastChat, and Transformers-based inference frameworks.
----
 ## Usage
@@ -137,17 +133,90 @@ generated_tokens = outputs[:, input_ids_len:] # Slice the output to get only the
 print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True))
 ```
-<!-- ### Using vLLM for Inference
-The xLAM models can also be efficiently served using vLLM for high-throughput inference. Please refer to the vLLM documentation for detailed instructions on how to deploy and use these models. You can typically start the vLLM service with the model name:
 ```bash
-vllm serve Salesforce/xLAM-2-3b-fc-r
 ```
-And then interact with the model using your preferred method for querying a vLLM endpoint. -->
 ## Benchmark Results
@@ -155,7 +224,7 @@ And then interact with the model using your preferred method for querying a vLLM
 <p align="center">
 <img width="80%" alt="BFCL Results" src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/bfcl-result.png?raw=true">
 <br>
-<small><i>Performance comparison of different models on BFCL leaderboard. The rank is based on the overall accuracy, which is a weighted average of different evaluation categories. "FC" stands for function-calling mode in contrast to using a customized "prompt" to extract the function calls.</i></small>
 </p>
 ### τ-bench Benchmark
@@ -194,6 +263,9 @@ If you use our model or dataset in your work, please cite our paper:
 }
 ```
 ```bibtex
 @article{zhang2025actionstudio,
   title={ActionStudio: A Lightweight Framework for Data and Training of Action Models},
@@ -212,8 +284,6 @@ If you use our model or dataset in your work, please cite our paper:
 }
 ```
-Additionally, please check our other related works regarding xLAM and consider citing them as well:
 ```bibtex
 @article{liu2024apigen,
@@ -235,4 +305,3 @@ Additionally, please check our other related works regarding xLAM and consider c
 }
 ```

 [Large Action Models (LAMs)](https://blog.salesforceairesearch.com/large-action-models/) are advanced language models designed to enhance decision-making by translating user intentions into executable actions. As the **brains of AI agents**, LAMs autonomously plan and execute tasks to achieve specific goals, making them invaluable for automating workflows across diverse domains.
 **This model release is for research purposes only.**
+The new **xLAM-2** series, built on our most advanced data synthesis, processing, and training pipelines, marks a significant leap in **multi-turn conversation** and **tool usage**. Trained using our novel APIGen-MT framework, which generates high-quality training data through simulated agent-human interactions. Our models achieve state-of-the-art performance on [**BFCL**](https://gorilla.cs.berkeley.edu/leaderboard.html) and **τ-bench** benchmarks, outperforming frontier models like GPT-4o and Claude 3.5. Notably, even our smaller models demonstrate superior capabilities in multi-turn scenarios while maintaining exceptional consistency across trials.
 We've also refined the **chat template** and **vLLM integration**, making it easier to build advanced AI agents. Compared to previous xLAM models, xLAM-2 offers superior performance and seamless deployment across applications.
 ## Table of Contents
 - [Usage](#usage)
   - [Basic Usage with Huggingface Chat Template](#basic-usage-with-huggingface-chat-template)
+  - [Using vLLM for Inference](#using-vllm-for-inference)
+    - [Setup and Serving](#setup-and-serving)
+    - [Testing with OpenAI API](#testing-with-openai-api)
 - [Benchmark Results](#benchmark-results)
 - [Citation](#citation)
 ---
 ## Model Series
 [xLAM](https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4) series are significant better at many things including general tasks and function calling.
 For the same number of parameters, the model have been fine-tuned across a wide range of agent tasks and scenarios, all while preserving the capabilities of the original model.
+| Model                  | # Total Params | Context Length | Category | Download Model  | Download GGUF files |
+|------------------------|----------------|------------|-------|----------------|----------|
+| Llama-xLAM-2-70b-fc-r | 70B            | 128k            | Multi-turn Conversation, Function-calling   | [🤗 Link](https://huggingface.co/Salesforce/Llama-xLAM-2-70b-fc-r)         |      NA               |
+| Llama-xLAM-2-8b-fc-r      | 8B             | 128k            | Multi-turn Conversation, Function-calling     | [🤗 Link](https://huggingface.co/Salesforce/Llama-xLAM-2-8b-fc-r)              |   [🤗 Link](https://huggingface.co/Salesforce/Llama-xLAM-2-8b-fc-r-gguf)    |
+| xLAM-2-32b-fc-r     | 32B            | 32k (max 128k)*            |  Multi-turn Conversation, Function-calling   | [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-32b-fc-r)             |      NA               |
+| xLAM-2-3b-fc-r      | 3B             | 32k (max 128k)*            |  Multi-turn Conversation, Function-calling    | [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-3b-fc-r)              |      [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-3b-fc-r-gguf)               |
+| xLAM-2-1b-fc-r      | 1B             | 32k (max 128k)*            |  Multi-turn Conversation, Function-calling | [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-1b-fc-r)              |      [🤗 Link](https://huggingface.co/Salesforce/xLAM-2-1b-fc-r-gguf)               |
 ***Note:** The default context length for Qwen-2.5-based models is 32k, but you can use techniques like YaRN (Yet Another Recursive Network) to achieve maximum 128k context length. Please refer to [here](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct#processing-long-texts) for more details.
+You can also explore our previous xLAM series [here](https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4).
+The `-fc` suffix indicates that the models are fine-tuned for **function calling** tasks, while the `-r` suffix signifies a **research** release.
+✅ All models are fully compatible with vLLM and Transformers-based inference frameworks.
 ## Usage
 print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True))
 ```
+### Using vLLM for Inference
+The xLAM models can also be efficiently served using vLLM for high-throughput inference. Please use `vllm>=0.6.5` since earlier versions will cause degraded performance for Qwen-based models.
+#### Setup and Serving
+1. Install vLLM with the required version:
+```bash
+pip install "vllm>=0.6.5"
+```
+2. Download the tool parser plugin to your local path:
 ```bash
+wget https://huggingface.co/Salesforce/xLAM-2-1b-fc-r/raw/main/xlam_tool_call_parser.py
 ```
+3. Start the OpenAI API-compatible endpoint:
+```bash
+vllm serve Salesforce/xLAM-2-1b-fc-r \
+  --enable-auto-tool-choice \
+  --tool-parser-plugin ./xlam_tool_call_parser.py \
+  --tool-call-parser xlam \
+  --tensor-parallel-size 1
+```
+Note: Ensure that the tool parser plugin file is downloaded and that the path specified in `--tool-parser-plugin` correctly points to your local copy of the file. The xLAM series models all utilize the **same** tool call parser, so you only need to download it **once** for all models.
+#### Testing with OpenAI API
+Here's a minimal example to test tool usage with the served endpoint:
+```python
+import openai
+import json
+# Configure the client to use your local vLLM endpoint
+client = openai.OpenAI(
+    base_url="http://localhost:8000/v1",  # Default vLLM server URL
+    api_key="empty"  # Can be any string
+)
+# Define a tool/function
+tools = [
+    {
+        "type": "function",
+        "function": {
+            "name": "get_weather",
+            "description": "Get the current weather for a location",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "location": {
+                        "type": "string",
+                        "description": "The city and state, e.g. San Francisco, CA"
+                    },
+                    "unit": {
+                        "type": "string",
+                        "enum": ["celsius", "fahrenheit"],
+                        "description": "The unit of temperature to return"
+                    }
+                },
+                "required": ["location"]
+            }
+        }
+    }
+]
+# Create a chat completion
+response = client.chat.completions.create(
+    model="Salesforce/xLAM-2-1b-fc-r",  # Model name doesn't matter, vLLM uses the served model
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant that can use tools."},
+        {"role": "user", "content": "What's the weather like in San Francisco?"}
+    ],
+    tools=tools,
+    tool_choice="auto"
+)
+# Print the response
+print("Assistant's response:")
+print(json.dumps(response.model_dump(), indent=2))
+```
+For more advanced configurations and deployment options, please refer to the [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
 ## Benchmark Results
 <p align="center">
 <img width="80%" alt="BFCL Results" src="https://github.com/apigen-mt/apigen-mt.github.io/blob/main/img/bfcl-result.png?raw=true">
 <br>
+<small><i>Performance comparison of different models on [BFCL leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html). The rank is based on the overall accuracy, which is a weighted average of different evaluation categories. "FC" stands for function-calling mode in contrast to using a customized "prompt" to extract the function calls.</i></small>
 </p>
 ### τ-bench Benchmark
 }
 ```
+Additionally, please check our other awesome related works regarding xLAM series and consider citing them as well:
 ```bibtex
 @article{zhang2025actionstudio,
   title={ActionStudio: A Lightweight Framework for Data and Training of Action Models},
 }
 ```
 ```bibtex
 @article{liu2024apigen,
 }
 ```