TroglodyteDerivations commited on
Commit
6946ede
·
verified ·
1 Parent(s): c781af4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -0
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - openai/gsm8k
5
+ base_model:
6
+ - openai/gpt-oss-120b
7
+ - deepseek-ai/DeepSeek-V3.1
8
+ tags:
9
+ - gpt-oss-120b-gsm8k-evaluation
10
+ ---
11
+ # Model Card for MLX GPT-OSS-120B GSM8K Evaluation
12
+
13
+ ## Model Description
14
+
15
+ This model card documents the evaluation results of the **MLX GPT-OSS-120B** model on the **GSM8K mathematical reasoning benchmark** using few-shot testing methodology. The evaluation was conducted using a custom testing framework that leverages Apple's MLX framework for efficient inference on Apple Silicon.
16
+
17
+ - **Model Type:** Transformer-based language model
18
+ - **Model Size:** 120 billion parameters
19
+ - **Framework:** MLX (Apple Silicon optimized)
20
+ - **Evaluation Method:** Few-shot testing with 2 demonstration examples
21
+ - **Dataset:** GSM8K main test set (1,319 samples)
22
+
23
+ ## Evaluation Results
24
+
25
+ The model was evaluated on the GSM8K mathematical reasoning benchmark using the following testing protocol:
26
+
27
+ | Metric | Value |
28
+ |--------|-------|
29
+ | **Accuracy** | **Calculating...** |
30
+ | Total Problems | 1,319 |
31
+ | Few-shot Examples | 2 |
32
+ | Max Tokens Generated | 512 |
33
+ | Temperature | Default (0.7) |
34
+
35
+ *Note: Final accuracy results will be populated after the evaluation completes.*
36
+
37
+ ## Usage
38
+
39
+ The evaluation was conducted using the following Python script:
40
+
41
+ ```python
42
+ from mlx_gpt_oss_120b_few_shot_testing_gsm8k import MLXGPTGSM8KEvaluator
43
+
44
+ # Initialize evaluator
45
+ evaluator = MLXGPTGSM8KEvaluator(
46
+ model_path="/path/to/your/model",
47
+ data_path="/path/to/gsm8k_main_test_20250902_110036.json"
48
+ )
49
+
50
+ # Run evaluation
51
+ results, accuracy = evaluator.evaluate_gsm8k(num_samples=1319)
52
+ ```
53
+
54
+ ## Evaluation Methodology
55
+
56
+ The evaluation process follows this structured approach:
57
+
58
+ ```mermaid
59
+ flowchart TD
60
+ A[Start Evaluation] --> B[Load MLX GPT-OSS-120B Model]
61
+ B --> C[Load GSM8K Dataset<br/>1319 samples]
62
+ C --> D[Create Few-Shot Prompts<br/>2 examples per question]
63
+
64
+ subgraph EvaluationLoop [Per-Sample Processing]
65
+ D --> E[Generate Model Response]
66
+ E --> F[Extract Numerical Answer]
67
+ F --> G[Compare with Expected Answer]
68
+ G --> H[Record Accuracy]
69
+ end
70
+
71
+ H --> I[Save Intermediate Results<br/>Every 10 samples]
72
+ EvaluationLoop --> J[Calculate Final Accuracy]
73
+ J --> K[Generate Comprehensive Reports<br/>JSON, TXT, Logs]
74
+ K --> L[End Evaluation]
75
+ ```
76
+
77
+ ### Key Components:
78
+
79
+ 1. **Few-shot Prompting**: Each question is prefixed with 2 worked examples demonstrating the expected reasoning format
80
+ 2. **Answer Extraction**: Uses regex patterns to extract numerical answers from model responses
81
+ 3. **Accuracy Calculation**: Compares extracted answers with ground truth values
82
+ 4. **Comprehensive Logging**: Detailed logs and intermediate result saving
83
+
84
+ ## Files Generated
85
+
86
+ The evaluation script produces the following output files:
87
+
88
+ - `gsm8k_evaluation_YYYYMMDD_HHMMSS.log` - Detailed execution log
89
+ - `gpt_oss_output_YYYYMMDD_HHMMSS/` - Directory containing:
90
+ - `final_results.json` - Complete evaluation results
91
+ - `intermediate_results.json` - Periodic saves during evaluation
92
+ - `summary.json` - Evaluation metrics summary
93
+ - `results_summary.txt` - Human-readable summary
94
+
95
+ ## Limitations
96
+
97
+ - Evaluation conducted on a subset of the full GSM8K test set
98
+ - Performance may vary based on the specific few-shot examples used
99
+ - Answer extraction relies on pattern matching which may not capture all valid answer formats
100
+ - Computational requirements are significant due to model size
101
+
102
+ ## Environmental Impact
103
+
104
+ The evaluation was conducted on Apple Silicon hardware, which typically offers improved energy efficiency compared to traditional GPU setups. The MLX framework further optimizes resource utilization for Apple hardware.
105
+
106
+ ## Citation
107
+
108
+ If you use this evaluation methodology or results in your research, please acknowledge:
109
+
110
+ ```
111
+ Evaluation of GPT-OSS-120B using MLX framework on GSM8K mathematical reasoning benchmark.
112
+ ```
113
+
114
+ ## Contact
115
+
116
+ For questions about this evaluation, please open an issue in the respective repository.
117
+
118
+ ---
119
+ *This model card was generated based on the evaluation of MLX GPT-OSS-120B on the GSM8K dataset.*