vllm (pretrained=/root/autodl-tmp/Cydonia-R1-24B-v4.1,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.952 | ± | 0.0135 |
| strict-match | 5 | exact_match | ↑ | 0.944 | ± | 0.0146 |
vllm (pretrained=/root/autodl-tmp/Cydonia-R1-24B-v4.1,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.926 | ± | 0.0117 |
| strict-match | 5 | exact_match | ↑ | 0.924 | ± | 0.0119 |
vllm (pretrained=/root/autodl-tmp/Cydonia-R1-24B-v4.1,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.9), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu | 2 | none | acc | ↑ | 0.7797 | ± | 0.0033 | |
| - humanities | 2 | none | acc | ↑ | 0.7114 | ± | 0.0063 | |
| - other | 2 | none | acc | ↑ | 0.8239 | ± | 0.0065 | |
| - social sciences | 2 | none | acc | ↑ | 0.8703 | ± | 0.0060 | |
| - stem | 2 | none | acc | ↑ | 0.7498 | ± | 0.0074 |
vllm (pretrained=/root/autodl-tmp/85-256-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.908 | ± | 0.0183 |
| strict-match | 5 | exact_match | ↑ | 0.892 | ± | 0.0197 |
vllm (pretrained=/root/autodl-tmp/90-128-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.908 | ± | 0.0183 |
| strict-match | 5 | exact_match | ↑ | 0.892 | ± | 0.0197 |
vllm (pretrained=/root/autodl-tmp/90-512-256-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.912 | ± | 0.0180 |
| strict-match | 5 | exact_match | ↑ | 0.896 | ± | 0.0193 |
vllm (pretrained=/root/autodl-tmp/90-256-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.928 | ± | 0.0164 |
| strict-match | 5 | exact_match | ↑ | 0.916 | ± | 0.0176 |
vllm (pretrained=/root/autodl-tmp/90-256-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto.
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.91 | ± | 0.0128 |
| strict-match | 5 | exact_match | ↑ | 0.90 | ± | 0.0134 |
vllm (pretrained=/root/autodl-tmp/91-256-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.920 | ± | 0.0172 |
| strict-match | 5 | exact_match | ↑ | 0.908 | ± | 0.0183 |
vllm (pretrained=/root/autodl-tmp/90-512-400-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.932 | ± | 0.0160 |
| strict-match | 5 | exact_match | ↑ | 0.920 | ± | 0.0172 |
vllm (pretrained=/root/autodl-tmp/90-512-512-99999999-seed41,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.936 | ± | 0.0155 |
| strict-match | 5 | exact_match | ↑ | 0.924 | ± | 0.0168 |
vllm (pretrained=/root/autodl-tmp/90-512-512-99999999-seed41,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.924 | ± | 0.0119 |
| strict-match | 5 | exact_match | ↑ | 0.910 | ± | 0.0128 |
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu | 2 | none | acc | ↑ | 0.7765 | ± | 0.0033 | |
| - humanities | 2 | none | acc | ↑ | 0.7050 | ± | 0.0063 | |
| - other | 2 | none | acc | ↑ | 0.8233 | ± | 0.0065 | |
| - social sciences | 2 | none | acc | ↑ | 0.8677 | ± | 0.0060 | |
| - stem | 2 | none | acc | ↑ | 0.7482 | ± | 0.0075 |
vllm (pretrained=/root/autodl-tmp/90-512-512-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.94 | ± | 0.0151 |
| strict-match | 5 | exact_match | ↑ | 0.92 | ± | 0.0172 |
vllm (pretrained=/root/autodl-tmp/90-512-512-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.92 | ± | 0.0121 |
| strict-match | 5 | exact_match | ↑ | 0.91 | ± | 0.0128 |
vllm (pretrained=/root/autodl-tmp/90-512-512-99999999,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.9), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu | 2 | none | acc | ↑ | 0.7755 | ± | 0.0033 | |
| - humanities | 2 | none | acc | ↑ | 0.7054 | ± | 0.0063 | |
| - other | 2 | none | acc | ↑ | 0.8243 | ± | 0.0065 | |
| - social sciences | 2 | none | acc | ↑ | 0.8671 | ± | 0.0060 | |
| - stem | 2 | none | acc | ↑ | 0.7428 | ± | 0.0075 |
vllm (pretrained=/root/autodl-tmp/90-512-600-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.928 | ± | 0.0164 |
| strict-match | 5 | exact_match | ↑ | 0.916 | ± | 0.0176 |
vllm (pretrained=/root/autodl-tmp/90-512-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.924 | ± | 0.0168 |
| strict-match | 5 | exact_match | ↑ | 0.908 | ± | 0.0183 |
vllm (pretrained=/root/autodl-tmp/90-1024-256-99999999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.916 | ± | 0.0176 |
| strict-match | 5 | exact_match | ↑ | 0.900 | ± | 0.0190 |
vllm (pretrained=/root/autodl-tmp/90-512-4096-9.9999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.900 | ± | 0.0190 |
| strict-match | 5 | exact_match | ↑ | 0.892 | ± | 0.0197 |
vllm (pretrained=/root/autodl-tmp/90-1024-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.924 | ± | 0.0168 |
| strict-match | 5 | exact_match | ↑ | 0.908 | ± | 0.0183 |
vllm (pretrained=/root/autodl-tmp/90-1024-1024-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.912 | ± | 0.0127 |
| strict-match | 5 | exact_match | ↑ | 0.902 | ± | 0.0133 |
vllm (pretrained=/root/autodl-tmp/90-4096-512-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.92 | ± | 0.0172 |
| strict-match | 5 | exact_match | ↑ | 0.90 | ± | 0.0190 |
vllm (pretrained=/root/autodl-tmp/90-4096-512-99999,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.914 | ± | 0.0126 |
| strict-match | 5 | exact_match | ↑ | 0.902 | ± | 0.0133 |
vllm (pretrained=/root/autodl-tmp/40-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.888 | ± | 0.0200 |
| strict-match | 5 | exact_match | ↑ | 0.880 | ± | 0.0206 |
vllm (pretrained=/root/autodl-tmp/60-512-1024-5,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.912 | ± | 0.0180 |
| strict-match | 5 | exact_match | ↑ | 0.908 | ± | 0.0183 |
vllm (pretrained=/root/autodl-tmp/60-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.92 | ± | 0.0172 |
| strict-match | 5 | exact_match | ↑ | 0.92 | ± | 0.0172 |
vllm (pretrained=/root/autodl-tmp/60-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.89 | ± | 0.014 |
| strict-match | 5 | exact_match | ↑ | 0.89 | ± | 0.014 |
vllm (pretrained=/root/autodl-tmp/65-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.908 | ± | 0.0183 |
| strict-match | 5 | exact_match | ↑ | 0.908 | ± | 0.0183 |
vllm (pretrained=/root/autodl-tmp/60-1024-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.884 | ± | 0.0203 |
| strict-match | 5 | exact_match | ↑ | 0.880 | ± | 0.0206 |
vllm (pretrained=/root/autodl-tmp/70-512-1024-5,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.916 | ± | 0.0176 |
| strict-match | 5 | exact_match | ↑ | 0.900 | ± | 0.0190 |
vllm (pretrained=/root/autodl-tmp/70-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.916 | ± | 0.0176 |
| strict-match | 5 | exact_match | ↑ | 0.912 | ± | 0.0180 |
vllm (pretrained=/root/autodl-tmp/85-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.896 | ± | 0.0193 |
| strict-match | 5 | exact_match | ↑ | 0.896 | ± | 0.0193 |
vllm (pretrained=/root/autodl-tmp/86-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.904 | ± | 0.0187 |
| strict-match | 5 | exact_match | ↑ | 0.904 | ± | 0.0187 |
vllm (pretrained=/root/autodl-tmp/87-512-1024,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.888 | ± | 0.02 |
| strict-match | 5 | exact_match | ↑ | 0.888 | ± | 0.02 |
- Downloads last month
- 3
Model tree for noneUsername/Cydonia-R1-24B-v4.1-W8A8
Base model
TheDrummer/Cydonia-R1-24B-v4.1