TroglodyteDerivations
/

MLX_GPT_OSS_120B_GSM8K_Evaluation

gpt-oss-120b-gsm8k-evaluation

Model card Files Files and versions

xet

Community

TroglodyteDerivations commited on Sep 2

Commit

6946ede

verified ·

1 Parent(s): c781af4

Create README.md

Browse files

Files changed (1) hide show

README.md +119 -0

README.md ADDED Viewed

	@@ -0,0 +1,119 @@

+---
+license: apache-2.0
+datasets:
+- openai/gsm8k
+base_model:
+- openai/gpt-oss-120b
+- deepseek-ai/DeepSeek-V3.1
+tags:
+- gpt-oss-120b-gsm8k-evaluation
+---
+# Model Card for MLX GPT-OSS-120B GSM8K Evaluation
+## Model Description
+This model card documents the evaluation results of the **MLX GPT-OSS-120B** model on the **GSM8K mathematical reasoning benchmark** using few-shot testing methodology. The evaluation was conducted using a custom testing framework that leverages Apple's MLX framework for efficient inference on Apple Silicon.
+- **Model Type:** Transformer-based language model
+- **Model Size:** 120 billion parameters
+- **Framework:** MLX (Apple Silicon optimized)
+- **Evaluation Method:** Few-shot testing with 2 demonstration examples
+- **Dataset:** GSM8K main test set (1,319 samples)
+## Evaluation Results
+The model was evaluated on the GSM8K mathematical reasoning benchmark using the following testing protocol:
+| Metric | Value |
+|--------|-------|
+| **Accuracy** | **Calculating...** |
+| Total Problems | 1,319 |
+| Few-shot Examples | 2 |
+| Max Tokens Generated | 512 |
+| Temperature | Default (0.7) |
+*Note: Final accuracy results will be populated after the evaluation completes.*
+## Usage
+The evaluation was conducted using the following Python script:
+```python
+from mlx_gpt_oss_120b_few_shot_testing_gsm8k import MLXGPTGSM8KEvaluator
+# Initialize evaluator
+evaluator = MLXGPTGSM8KEvaluator(
+    model_path="/path/to/your/model",
+    data_path="/path/to/gsm8k_main_test_20250902_110036.json"
+)
+# Run evaluation
+results, accuracy = evaluator.evaluate_gsm8k(num_samples=1319)
+```
+## Evaluation Methodology
+The evaluation process follows this structured approach:
+```mermaid
+flowchart TD
+    A[Start Evaluation] --> B[Load MLX GPT-OSS-120B Model]
+    B --> C[Load GSM8K Dataset<br/>1319 samples]
+    C --> D[Create Few-Shot Prompts<br/>2 examples per question]
+    subgraph EvaluationLoop [Per-Sample Processing]
+        D --> E[Generate Model Response]
+        E --> F[Extract Numerical Answer]
+        F --> G[Compare with Expected Answer]
+        G --> H[Record Accuracy]
+    end
+    H --> I[Save Intermediate Results<br/>Every 10 samples]
+    EvaluationLoop --> J[Calculate Final Accuracy]
+    J --> K[Generate Comprehensive Reports<br/>JSON, TXT, Logs]
+    K --> L[End Evaluation]
+```
+### Key Components:
+1. **Few-shot Prompting**: Each question is prefixed with 2 worked examples demonstrating the expected reasoning format
+2. **Answer Extraction**: Uses regex patterns to extract numerical answers from model responses
+3. **Accuracy Calculation**: Compares extracted answers with ground truth values
+4. **Comprehensive Logging**: Detailed logs and intermediate result saving
+## Files Generated
+The evaluation script produces the following output files:
+- `gsm8k_evaluation_YYYYMMDD_HHMMSS.log` - Detailed execution log
+- `gpt_oss_output_YYYYMMDD_HHMMSS/` - Directory containing:
+  - `final_results.json` - Complete evaluation results
+  - `intermediate_results.json` - Periodic saves during evaluation
+  - `summary.json` - Evaluation metrics summary
+  - `results_summary.txt` - Human-readable summary
+## Limitations
+- Evaluation conducted on a subset of the full GSM8K test set
+- Performance may vary based on the specific few-shot examples used
+- Answer extraction relies on pattern matching which may not capture all valid answer formats
+- Computational requirements are significant due to model size
+## Environmental Impact
+The evaluation was conducted on Apple Silicon hardware, which typically offers improved energy efficiency compared to traditional GPU setups. The MLX framework further optimizes resource utilization for Apple hardware.
+## Citation
+If you use this evaluation methodology or results in your research, please acknowledge:
+```
+Evaluation of GPT-OSS-120B using MLX framework on GSM8K mathematical reasoning benchmark.
+```
+## Contact
+For questions about this evaluation, please open an issue in the respective repository.
+---
+*This model card was generated based on the evaluation of MLX GPT-OSS-120B on the GSM8K dataset.*