Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,68 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
tags:
|
| 5 |
+
- ONNX
|
| 6 |
+
- DML
|
| 7 |
+
- ONNXRuntime
|
| 8 |
+
- mistral
|
| 9 |
+
- conversational
|
| 10 |
+
- custom_code
|
| 11 |
+
inference: false
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Mistral-7B-Instruct-v0.2 ONNX models
|
| 15 |
+
|
| 16 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
| 17 |
+
This repository hosts the optimized versions of [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) to accelerate inference with ONNX Runtime.
|
| 18 |
+
|
| 19 |
+
The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.2.
|
| 20 |
+
|
| 21 |
+
Optimized Mistral models are published here in [ONNX](https://onnx.ai) format to run with [ONNX Runtime](https://onnxruntime.ai/) on CPU and GPU across devices, including server platforms and Windows, Linux, and Mac desktops, with the precision best suited to each of these targets.
|
| 22 |
+
|
| 23 |
+
[DirectML](https://aka.ms/directml) support lets developers bring hardware acceleration to Windows devices at scale across AMD, Intel, and NVIDIA GPUs. Along with DirectML, ONNX Runtime provides cross platform support for Mistral across a range of devices for CPU and GPU.
|
| 24 |
+
|
| 25 |
+
To easily get started with Mistral, you can use [Olive](https://github.com/microsoft/Olive), our easy-to-use, hardware-aware model optimization tool. See [here](https://github.com/microsoft/Olive/tree/main/examples/mistral) for instructions on how to run it with Mistral.
|
| 26 |
+
|
| 27 |
+
## ONNX Models
|
| 28 |
+
|
| 29 |
+
Here are some of the optimized configurations we have added:
|
| 30 |
+
|
| 31 |
+
1. ONNX model for int4 DML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using [AWQ](https://arxiv.org/abs/2306.00978).
|
| 32 |
+
2. ONNX model for fp16 CUDA: ONNX model you can use to run for your NVIDIA GPUs.
|
| 33 |
+
3. ONNX model for int4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via RTN.
|
| 34 |
+
4. ONNX model for int4 CPU: ONNX model for your CPU, using int4 quantization via RTN.
|
| 35 |
+
|
| 36 |
+
## Hardware Supported
|
| 37 |
+
|
| 38 |
+
The models are tested on:
|
| 39 |
+
- GPU SKU: RTX 4090 (DirectML)
|
| 40 |
+
- GPU SKU: 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
|
| 41 |
+
- CPU SKU: Standard F64s v2 (64 vcpus, 128 GiB memory)
|
| 42 |
+
|
| 43 |
+
Minimum Configuration Required:
|
| 44 |
+
- Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
|
| 45 |
+
- CUDA: Streaming Multiprocessors (SMs) >= 70 (i.e. V100 or newer)
|
| 46 |
+
|
| 47 |
+
### Model Description
|
| 48 |
+
|
| 49 |
+
- **Developed by:** Microsoft
|
| 50 |
+
- **Model type:** ONNX
|
| 51 |
+
- **Language(s) (NLP):** Python, C, C++
|
| 52 |
+
- **License:** Apache License Version 2.0
|
| 53 |
+
- **Model Description:** This is a conversion of the Mistral-7B-Instruct-v0.2 model for ONNX Runtime inference.
|
| 54 |
+
|
| 55 |
+
## Additional Details
|
| 56 |
+
- [**Mistral Model Announcement Link**](https://mistral.ai/news/announcing-mistral-7b/)
|
| 57 |
+
- [**Mistral Model Card**](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
|
| 58 |
+
- [**Mistral Technical Report**](https://arxiv.org/abs/2310.06825)
|
| 59 |
+
|
| 60 |
+
## Appendix
|
| 61 |
+
|
| 62 |
+
### Activation Aware Quantization
|
| 63 |
+
|
| 64 |
+
AWQ works by identifying the top 1% most salient weights that are most important for maintaining accuracy and quantizing the remaining 99% of weights. This leads to less accuracy loss from quantization compared to many other quantization techniques. For more on AWQ, see [here](https://arxiv.org/abs/2306.00978).
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
## Model Card Contact
|
| 68 |
+
sschoenmeyer, sunghcho, kvaishnavi
|