vllm (pretrained=/root/autodl-tmp/Cydonia-R1-24B-v4.1,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.952	±	0.0135
		strict-match	5	exact_match	↑	0.944	±	0.0146

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.926	±	0.0117
		strict-match	5	exact_match	↑	0.924	±	0.0119

vllm (pretrained=/root/autodl-tmp/Cydonia-R1-24B-v4.1,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.9), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7797	±	0.0033
- humanities	2	none	acc	↑	0.7114	±	0.0063
- other	2	none	acc	↑	0.8239	±	0.0065
- social sciences	2	none	acc	↑	0.8703	±	0.0060
- stem	2	none	acc	↑	0.7498	±	0.0074

vllm (pretrained=/root/autodl-tmp/85-256-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.908	±	0.0183
		strict-match	5	exact_match	↑	0.892	±	0.0197

vllm (pretrained=/root/autodl-tmp/90-128-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.908	±	0.0183
		strict-match	5	exact_match	↑	0.892	±	0.0197

vllm (pretrained=/root/autodl-tmp/90-512-256-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.912	±	0.0180
		strict-match	5	exact_match	↑	0.896	±	0.0193

vllm (pretrained=/root/autodl-tmp/90-256-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.928	±	0.0164
		strict-match	5	exact_match	↑	0.916	±	0.0176

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.91	±	0.0128
		strict-match	5	exact_match	↑	0.90	±	0.0134

vllm (pretrained=/root/autodl-tmp/91-256-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.920	±	0.0172
		strict-match	5	exact_match	↑	0.908	±	0.0183

vllm (pretrained=/root/autodl-tmp/90-512-400-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.932	±	0.0160
		strict-match	5	exact_match	↑	0.920	±	0.0172

vllm (pretrained=/root/autodl-tmp/90-512-512-99999999-seed41,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.936	±	0.0155
		strict-match	5	exact_match	↑	0.924	±	0.0168

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.924	±	0.0119
		strict-match	5	exact_match	↑	0.910	±	0.0128

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7765	±	0.0033
- humanities	2	none	acc	↑	0.7050	±	0.0063
- other	2	none	acc	↑	0.8233	±	0.0065
- social sciences	2	none	acc	↑	0.8677	±	0.0060
- stem	2	none	acc	↑	0.7482	±	0.0075

vllm (pretrained=/root/autodl-tmp/90-512-512-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.94	±	0.0151
		strict-match	5	exact_match	↑	0.92	±	0.0172

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.92	±	0.0121
		strict-match	5	exact_match	↑	0.91	±	0.0128

vllm (pretrained=/root/autodl-tmp/90-512-512-99999999,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.9), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7755	±	0.0033
- humanities	2	none	acc	↑	0.7054	±	0.0063
- other	2	none	acc	↑	0.8243	±	0.0065
- social sciences	2	none	acc	↑	0.8671	±	0.0060
- stem	2	none	acc	↑	0.7428	±	0.0075

vllm (pretrained=/root/autodl-tmp/90-512-600-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.928	±	0.0164
		strict-match	5	exact_match	↑	0.916	±	0.0176

vllm (pretrained=/root/autodl-tmp/90-512-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.924	±	0.0168
		strict-match	5	exact_match	↑	0.908	±	0.0183

vllm (pretrained=/root/autodl-tmp/90-1024-256-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.916	±	0.0176
		strict-match	5	exact_match	↑	0.900	±	0.0190

vllm (pretrained=/root/autodl-tmp/90-512-4096-9.9999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.900	±	0.0190
		strict-match	5	exact_match	↑	0.892	±	0.0197

vllm (pretrained=/root/autodl-tmp/90-1024-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.924	±	0.0168
		strict-match	5	exact_match	↑	0.908	±	0.0183

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.912	±	0.0127
		strict-match	5	exact_match	↑	0.902	±	0.0133

vllm (pretrained=/root/autodl-tmp/90-4096-512-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.92	±	0.0172
		strict-match	5	exact_match	↑	0.90	±	0.0190

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.914	±	0.0126
		strict-match	5	exact_match	↑	0.902	±	0.0133

vllm (pretrained=/root/autodl-tmp/40-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.888	±	0.0200
		strict-match	5	exact_match	↑	0.880	±	0.0206

vllm (pretrained=/root/autodl-tmp/60-512-1024-5,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.912	±	0.0180
		strict-match	5	exact_match	↑	0.908	±	0.0183

vllm (pretrained=/root/autodl-tmp/60-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.92	±	0.0172
		strict-match	5	exact_match	↑	0.92	±	0.0172

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.89	±	0.014
		strict-match	5	exact_match	↑	0.89	±	0.014

vllm (pretrained=/root/autodl-tmp/65-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.908	±	0.0183
		strict-match	5	exact_match	↑	0.908	±	0.0183

vllm (pretrained=/root/autodl-tmp/60-1024-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.884	±	0.0203
		strict-match	5	exact_match	↑	0.880	±	0.0206

vllm (pretrained=/root/autodl-tmp/70-512-1024-5,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.916	±	0.0176
		strict-match	5	exact_match	↑	0.900	±	0.0190

vllm (pretrained=/root/autodl-tmp/70-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.916	±	0.0176
		strict-match	5	exact_match	↑	0.912	±	0.0180

vllm (pretrained=/root/autodl-tmp/85-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.896	±	0.0193
		strict-match	5	exact_match	↑	0.896	±	0.0193

vllm (pretrained=/root/autodl-tmp/86-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.904	±	0.0187
		strict-match	5	exact_match	↑	0.904	±	0.0187

vllm (pretrained=/root/autodl-tmp/87-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.888	±	0.02
		strict-match	5	exact_match	↑	0.888	±	0.02

Downloads last month: 3

Safetensors

Model size

24B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for noneUsername/Cydonia-R1-24B-v4.1-W8A8

Base model

TheDrummer/Cydonia-R1-24B-v4.1

Quantized

(11)

this model