Overview

This document presents the evaluation results of DeepSeek-LLM-67B-Chat, a 8-bit quantized model using GPTQ, evaluated with the Language Model Evaluation Harness on the ARC, GPQA and IfEval benchmark.


📊 Evaluation Summary

Metric Value Description
ARC-Challenge 58.11% Raw (acc,none)
GPQA Overall 25.44% Averaged across GPQA-Diamond, GPQA-Extended, GPQA-Main (n-shot, zeroshot, CoT, Generative)
GPQA (n-shot acc) 33.04% Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (acc,none)
GPQA (zeroshot acc) 32.51% Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (acc,none)
GPQA (CoT n-shot) 17.21% Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (exact_match flexible-extract)
GPQA (CoT zeroshot) 17.52% Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (exact_match flexible-extract)
GPQA (Generative n-shot) 26.49% Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (exact_match flexible-extract)
IFEval Overall 43.16% Averaged across Prompt-level Strict, Prompt-level Loose, Inst-level Strict, Inst-level Loose
IFEval (Prompt-level Strict) 36.23% Prompt-level strict accuracy
IFEval (Prompt-level Loose) 38.45% Prompt-level loose accuracy
IFEval (Inst-level Strict) 47.84% Inst-level strict accuracy
IFEval (Inst-level Loose) 50.12% Inst-level loose accuracy

⚙️ Model Configuration

  • Model: DeepSeek-LLM-67B-Chat
  • Parameters: 67 billion
  • Quantization: 8-bit GPTQ
  • Source: Hugging Face (hf)
  • Precision: torch.float16
  • Hardware: NVIDIA A100 80GB PCIe
  • CUDA Version: 12.4
  • PyTorch Version: 2.6.0+cu124
  • Batch Size: 1

📌 Interpretation:

  • The evaluation was performed on a high-performance GPU (A100 80GB).
  • The model is significantly smaller than the full version, with GPTQ 8-bit quantization reducing memory footprint.
  • A single-sample batch size was used, which might slow evaluation speed.

📈 Performance Insights

  • Quantization Impact: The 8-bit GPTQ quantization reduces memory usage but may also impact accuracy slightly.
  • Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

📌 Let us know if you need further analysis or model tuning! 🚀

Downloads last month
2
Safetensors
Model size
19B params
Tensor type
BF16
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for empirischtech/DeepSeek-LLM-67B-Chat-gptq-8bit

Quantized
(6)
this model

Dataset used to train empirischtech/DeepSeek-LLM-67B-Chat-gptq-8bit