Overview
This document presents the evaluation results of DeepSeek-LLM-67B-Chat, a 8-bit quantized model using GPTQ, evaluated with the Language Model Evaluation Harness on the ARC, GPQA and IfEval benchmark.
📊 Evaluation Summary
| Metric | Value | Description |
|---|---|---|
| ARC-Challenge | 58.11% |
Raw (acc,none) |
| GPQA Overall | 25.44% |
Averaged across GPQA-Diamond, GPQA-Extended, GPQA-Main (n-shot, zeroshot, CoT, Generative) |
| GPQA (n-shot acc) | 33.04% |
Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (acc,none) |
| GPQA (zeroshot acc) | 32.51% |
Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (acc,none) |
| GPQA (CoT n-shot) | 17.21% |
Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (exact_match flexible-extract) |
| GPQA (CoT zeroshot) | 17.52% |
Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (exact_match flexible-extract) |
| GPQA (Generative n-shot) | 26.49% |
Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (exact_match flexible-extract) |
| IFEval Overall | 43.16% |
Averaged across Prompt-level Strict, Prompt-level Loose, Inst-level Strict, Inst-level Loose |
| IFEval (Prompt-level Strict) | 36.23% |
Prompt-level strict accuracy |
| IFEval (Prompt-level Loose) | 38.45% |
Prompt-level loose accuracy |
| IFEval (Inst-level Strict) | 47.84% |
Inst-level strict accuracy |
| IFEval (Inst-level Loose) | 50.12% |
Inst-level loose accuracy |
⚙️ Model Configuration
- Model:
DeepSeek-LLM-67B-Chat - Parameters:
67 billion - Quantization:
8-bit GPTQ - Source: Hugging Face (
hf) - Precision:
torch.float16 - Hardware:
NVIDIA A100 80GB PCIe - CUDA Version:
12.4 - PyTorch Version:
2.6.0+cu124 - Batch Size:
1
📌 Interpretation:
- The evaluation was performed on a high-performance GPU (A100 80GB).
- The model is significantly smaller than the full version, with GPTQ 8-bit quantization reducing memory footprint.
- A single-sample batch size was used, which might slow evaluation speed.
📈 Performance Insights
- Quantization Impact: The 8-bit GPTQ quantization reduces memory usage but may also impact accuracy slightly.
- Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).
📌 Let us know if you need further analysis or model tuning! 🚀
- Downloads last month
- 2
Model tree for empirischtech/DeepSeek-LLM-67B-Chat-gptq-8bit
Base model
deepseek-ai/deepseek-llm-67b-chat