QVAC Genesis I: the Largest and Highest-Quality Multi-domain Educational Synthetic Dataset for Pre-training
KEY HIGHLIGHTS
- Tether Data’s AI research division releases QVAC Genesis I, the largest synthetic dataset ever released for pre-training Large Language Models (LLMs) with a focus on educational content.
- Designed for multi-domain educational coverage, the dataset includes curriculum-aligned topics and materials across high-school, undergraduate, and professional domains: Mathematics, Physics, Biology, Medicine.
- Validated across multiple educational benchmarks, QVAC Genesis I consistently outperforms existing synthetic datasets in reasoning, knowledge, and subject-specific QA tasks.
- This marks the first public release of a synthetic dataset purpose-built and rigorously validated for education-specific content, offering deep and comprehensive coverage across key STEM domains.
- By making QVAC Genesis I openly available to researchers, Tether aims to empower the global AI community to accelerate the development of open-source educational LLMs , closing the gap with closed-source/proprietary models, and democratizing access to foundational AI capabilities.
🚀 Download QVAC Genesis I Dataset
Access the world's largest synthetic dataset and highest-quality multi-domain educational synthetic dataset for AI training, purpose-built for scientific research advancement.
🔗 Get the Dataset1. Introduction
Overview
Recent advances in large language model (LLM) pretraining have increased the focus on curating high-quality web-scale datasets. Synthetic data, generated to emulate real-world text distributions, has become critical for LLM development. Microsoft’s Phi model [1] demonstrated the value of large-scale synthetic datasets, generating billions of tokens for pretraining. HuggingFace subsequently developed Cosmopedia [2] to replicate Phi-1.5. Despite these efforts, existing synthetic datasets remain insufficiently refined for training state-of-the-art LLMs that can compete with leading closed-source/proprietary models.
Furthermore, generating high-quality pre-training synthetic datasets is resource-intensive and expensive. As a result, dataset creation has been limited to well-funded corporations and major research institutions, restricting broader participation in the AI research community.
Motivation for Large-Scale Synthetic Data
There is a need for publicly available, large-scale synthetic datasets that are rigorously curated. Such datasets can lower the barrier to entry for academic institutions, small research labs, and public organizations, enabling wider experimentation and innovation. Moreover, synthetic data can be tailored to cover critical educational and scientific domains, supporting specialized training aligned with real-world learning objectives.
Key Contributions and Objectives
To address these challenges, Tether Data, S.A. de C.V. (Tether Data, we, us, our) introduces QVAC Genesis I, a large-scale multi-domain educational synthetic dataset designed to support open, high-quality LLM pretraining. Our contributions include:
- Largest publicly available synthetic dataset. We generated 41 billion text tokens using a pipeline seeded by domain-labelled text from high-quality sources across critical educational domains, including mathematics, medicine, physics, and biology. This is the largest public pre-training synthetic dataset to date. 
- Education domain-specific datasets. We created synthetic data covering all critical educational topics including: college-level general medicine, college-level professional medicine, college-level biology, college-level mathematics, college-level physics, high school-level biology, high-level Mathematics, high-level physics, and high-level conceptual physics. 
- Validated top-quality dataset. Ablation studies on global benchmarks, such as MMLU, show that our datasets outperform existing synthetic datasets, achieving state-of-the-art accuracy across multiple educational topics. 
- Open-source contribution. QVAC Genesis I democratizes access to high-quality pretraining data, enabling participation from public institutions, small research labs, and the academic community, fostering a more inclusive AI research ecosystem. 
2. Methodology
Our methodology consists of a four-stage pipeline designed to generate high-quality synthetic educational content through systematic error analysis and correction. The approach leverages state-of-the-art language models to create domain-specific educational materials that address common misconceptions and learning gaps.
Learning From Failures Pipeline Diagram
Figure 1. Diagram of the pipeline for generating synthetic data: Seeds Data are used as input for the Quality Filter, whose output becomes the input for the Scaling QA phase, in which 4 questions + options + target are generated for each seed. Each of these questions moves on to phase two (Model Answering), where a proposed solution to that question is generated using LLM. Finally, only proposed solutions that differ from the target (Compare to Gold Label) move on to the last phase (Failure Analysis), where an analysis of the incorrect answer and the correct solution to the question is generated in four different styles (educational textbook, web articles, qa, conversational dialogue).
2.1 Seed Data Acquisition
Web-Based Source Selection Criteria
- Seed corpus: We evaluated several open-source datasets—including DCLM, FineWeb-Edu, and others—but they offered limited control over domain coverage, which was critical for our goals. After further analysis, we chose FineFineWeb [3], built on FineWeb, a state-of-the-art open source dataset that exposes 60+ curated categories. 
 This let us target domain-specific slices—especially mathematics, physics, medicine, and biology—aligned with our objectives.
- Domain scope: From FineFineWeb, we extract STEM seeds exclusively in the following subdomains: - Biology 
- Medicine 
- Physics 
- Maths 
 
Furthermore, to provide reasoning and commonsense data, we include small samples from Logical Deduction and HellaSwag datasets. These consist of 1,200 examples from the BIG-bench Logical Deduction training set and 40,000 examples from the Rowan HellaSwag training set.
Curation and Filtering of Seed Content
- Domain extraction: We subset FineFineWeb to the listed STEM subdomains (biology, medicine, physics, maths), which we then used to generate a total of 9 specific subdomains: - College Biology, High School Biology
- College Medicine, Professional Medicine
- College Mathematics, High School Mathematics
- College Physics, High School Physics, Conceptual Physics
 
- Quality filtering: We score each document with the Ultra-FineWeb-classifier [4], a lightweight fastText model trained within Ultra-FineWeb’s verification-based filtering pipeline. During the construction of Ultra-FineWeb, the team applied this pipeline to FineWeb datasets. Documents that passed the pipeline’s verification checks became positive training examples; those filtered out served as negatives. The resulting classifier predicts the probability that a page is “high-quality” according to these verified labels and is optimized for throughput at web scale. In our setup, we run the classifier over our candidate subset and retain only seeds whose score exceeds the recommended high-quality threshold. 
2.2 Prompt Engineering
Prompt Design Strategies
Objective. From the seed pools, generate multiple-choice questions per domain/level (e.g., college_biology) and then use a small SOTA model to produce an answer. Only incorrect model answers are forwarded to the failure-analysis stage. Where a final text is generated in four different styles (educational textbook, question-answers, web articles, and conversation dialogue), in which the incorrect solution is first analysed and then the correct solution is given.
Scaling QA Methodology
Our approach focuses on systematically generating large synthetic question–answer (QA) data from unstructured scientific text. We begin with domain-specific seed passages drawn from medical, biological, physical, and mathematical sciences, ensuring broad conceptual coverage across diverse knowledge areas. Using a scaled prompting strategy, a large-capacity language model is instructed to generate multiple-choice QA pairs inspired by the topics of each seed passage. Each pair consists of a question, four options, and one correct answer.
The prompting process is dynamically adjusted to produce different levels of conceptual complexity, ranging from high-school fundamentals to college-level analytical reasoning. By modifying the prompt design, the same framework can be extended to generate domain-specific data of varying difficulty, enabling rapid expansion of high-quality training material for any scientific discipline. The resulting synthetic QA corpus is employed as annealing data, helping to refine the model during late-stage pretraining or fine-tuning for task alignment. These data are employed to perform inference-time evaluation and failure analysis across existing language models. By analyzing the types of questions where models consistently underperform such as reasoning-intensive, multi-concept, or numerically grounded items, we can systematically identify the weaknesses of each model.
This methodology demonstrates how scalable prompting can be leveraged to create domain-balanced, complexity-controlled synthetic QA data that supports both model assessment and future pretraining efforts. It bridges the gap between raw scientific text and structured evaluation resources, helping to reveal capability gaps in large language models across critical scientific domains. For detailed information about the prompt used see Appendix (Prompt Templates).
Answer Generation and Extraction Methodology
Our answer generation and extraction approach focuses on systematically identifying and analyzing where state-of-the-art models fail, providing valuable insights into model limitations and creating targeted training data. We employ a sophisticated LLM-as-a-Judge framework to extract answers from model responses, enabling comprehensive analysis of model performance across different problem types and complexity levels.
Objective: The primary goal is to observe the output of state-of-the-art models and systematically extract question-answer pairs where they fail, creating a rich dataset of model weaknesses and misconceptions that can be used for targeted training and improvement.
Methodology: We use a three-stage process for answer generation and extraction:
- Model Response Generation: State-of-the-art models generate complete responses to evaluation questions across multiple domains and complexity levels
- Answer Extraction: A specialized LLM judge extracts the final answer from the model's complete response using our sophisticated extraction framework
- Failure Identification: We systematically identify cases where model responses differ from ground truth, capturing various types of model failures
This methodology enables us to systematically capture model failures across different domains, creating a comprehensive dataset of model weaknesses that can be used for targeted training and improvement. For detailed information about response categories, extraction processes, and the complete LLM-as-a-Judge framework, see Section 4.2, and for evaluation prompt see Appendix (Prompt Templates).
Failure Analysis Methodology
Our failure analysis approach focuses on creating high-quality educational content by systematically analyzing where state-of-the-art models fail and generating comprehensive explanations that not only provide correct answers but also analyze the reasoning behind model failures. This creates rich, pedagogically valuable content that addresses common misconceptions and learning gaps.
Objective: The primary goal is to create high-quality synthetic data in four different styles where not only the correct answer is provided, but also a thorough analysis of state-of-the-art model failures is included, creating comprehensive educational content that addresses misconceptions and learning gaps.
Methodology: We employ a systematic approach to failure analysis that generates synthetic educational content in four distinct styles:
- Educational Textbook Style: Formal, comprehensive explanations that provide both correct solutions and analysis of common errors
- Question-Answer Format: Structured Q&A content that addresses specific failure patterns and misconceptions
- Web Articles Style: Accessible, engaging content that explains complex concepts through failure analysis
- Conversational Dialogue Style: Natural tutoring sessions that guide learners through error analysis and correct reasoning
All four styles are generated from MCQ, the model's wrong answer, and the correct label.
This methodology demonstrates how systematic failure analysis can be leveraged to create domain-balanced, pedagogically-rich synthetic data that supports both model assessment and educational content generation. For detailed information about the four-style content generation process and specific prompt templates, see Appendix (Prompt Templates).
Diversity and Coverage Optimization
Domain-Level Balance:
- Per-domain/level generation. For each of the domains/levels listed above, generate items so that every domain/level is represented with equal weight to ensure comprehensive coverage across all educational domains.
Error Distribution Strategy:
- Balanced error collection. Select only incorrect model answers for the next stage, ensuring that errors are gathered across all domains/levels rather than concentrating on a single area. This approach maximizes learning opportunities by addressing misconceptions across the entire educational spectrum.
Format Standardization:
- MCQ format consistency. Keep the same four-option structure and answer label format across domains/levels to maintain comparable items and ensure consistent evaluation metrics.
Quality Assurance Measures:
- Content validation: Automated checks for answer key consistency, option overlap, and length ratios
- Semantic deduplication: Removal of near-duplicate content to prevent overfitting
- Difficulty calibration: Balanced distribution of question complexity within each domain/level
- Expert review: Manual validation of edge cases and ambiguous content
2.3 Synthetic Data Generation
Tooling. We orchestrate the end-to-end pipeline using distilabel [5] running against a vLLM inference server (vLLM Team, 2024).
Pipeline Orchestration:
- distilabel (orchestration & AI feedback). We employ distilabel (Argilla, 2024), a framework for synthetic data and AI feedback designed to build fast, reliable, and scalable pipelines. It models workflows as a DAG of steps (e.g., generate → judge → filter), comes with ready-made tasks for common patterns like LLM-as-a-judge, and integrates smoothly with Argilla for storing datasets and optional human-in-the-loop review. In practice, we use distilabel to define prompt templates, spawn "generator" models, attach "judge" steps to rate outputs (helpfulness, correctness, etc.), and write back structured records plus scores for downstream filtering and evaluation. 
- vLLM (serving). We host the LLMs behind vLLM (vLLM Team, 2024), benefiting from its standard-compatible API, streaming responses, continuous batching, and PagedAttention for high-throughput, memory-efficient inference. This lets us scale generation and judging steps without changing our pipeline code. 
- Integration. distilabel sends generation and evaluation requests to vLLM; results flow back into the DAG where we apply rubric-based filters and retain only examples that meet target quality thresholds. The same setup lets us reuse judge steps to clean or re-rank data created in earlier runs, keeping the pipeline reproducible and easy to iterate. 
Model Architecture. We used the following open-source models in the various stages:
- Generation model: QwQ-32B [6] for question generation and failure analysis 
- Answer stage model: Qwen3-1.7B-Base [7] for model answering 
- Failure-analysis stage: QwQ-32B for generating educational content from incorrect responses 
Flow.
- Seed → Item generation (QwQ-32B). For each domain/level (e.g., college_biology), using distilabel + vLLM, QwQ-32B generates the question, options A–D, and the gold label from the seed. 
- Answer stage (Qwen3-1.7B-Base). The MCQ is posed to Qwen3-1.7B-Base with the fixed template shown above. 
- Answer extraction and error routing (QwQ-32B). We use a sophisticated LLM-as-a-Judge framework for answer extraction that can handle various response patterns and edge cases. This approach represents a significant advancement over traditional log-likelihood-based evaluation methods. If the extracted answer ≠ gold label, we pass (problem, model response, correct label) to failure analysis. 
 Example:
 Question: "Calculate the approximate annual growth rate in kg/year of an organism that gains 13.6 kg per day during its peak growth phase. Assume 365 days/year.- A. 5000 
 B. 2500
 C. 1000
 D. 13,600
 Answer: "- Gold label: A Model Output: "13,600\nThis question was last updated by Bob in July 2018." 
 Since the model's answer (D) ≠ gold label (A), this question and incorrect answer are forwarded to failure analysis.
 For detailed information about the answer extraction framework, response categories, and extraction processes, see Section 4.2.
- Failure analysis (QwQ-32B). We prompt QwQ-32B to generate the analysis in one of the four styles using the problem, proposed solution, and correct answer. For detailed information about the four-style content generation process and specific prompt templates, see Appendix (Prompt Templates) and Section 4.2. 
3. Pre-training Setup
3.1 Model Architecture and Parameters
We pre-train a 1.7B-parameter transformer (Qwen3 family) initialized from scratch with BF16 mixed precision and context length 4,096. Tokenization uses the Qwen3 tokenizer; data are stored in HuggingFace Datasets (Arrow). The corpus totals 41B tokens (multi-domain) and is traversed for 1 epoch via a PyTorch DataLoader. To aid stability and throughput expected in technical deployments, we enable activation checkpointing, fused kernels where available (fused attention/optimizer), enable FlashAttention2 on H100, and torch.compile (safe mode) once the run is stable.
Optimization follows AdamW (weight decay 0.01), learning rate 2e-4, warmup 600 steps, gradient clipping 1.0, and seed 42. Per-GPU micro-batch is 4 with gradient accumulation 8 across 480 GPUs, yielding an effective global batch of 4×8×480=15,360 samples/step. We log train metrics every 50 steps, validate every 500 steps (20 eval iters), checkpoints are created every 1000 steps, and support resume with exact optimizer/state restoration. We achieved a total training throughput of 1.5 seconds per step (). We note common failure modes and mitigations: BF16 overflow (addressed via dynamic loss scaling), NCCL stalls (timeouts and interface pinning), and fragmentation (CUDA max_split_size_mb=512, expandable segments, GC threshold 0.8).
3.2 Multi‑node GPU Setup
We made multiple training runs on 60 nodes with 8× NVIDIA H100 80GB per node (480 GPUs total), 8 CPUs per task, ~800 GB RAM per node, Slurm priority partition, exclusive allocation, and 72-hour time limit. We launch with srun using PyTorch DDP (world size 480), auto-detect the master from Slurm, and bind ranks to GPUs via Slurm’s environment. Stdout/stderr are streamed to logs_training/qvac_60node_training_%j.{out,err}; checkpoints are sharded and saved periodically for robust resume.
Networking is NCCL over InfiniBand with UCX transports. We use infiniband and set NCCL_IB_DISABLE=0 , NCCL_IB_HCA="mlx5", NCCL_SOCKET_IFNAME, and NCCL_BLOCKING_WAIT=1 with a 720-second watchdog to fail fast on fabric issues. UCX is configured for multi-device transport; we also pin file system threads and enable asynchronous I/O prefetch to keep GPUs fed.
Reliability & observability: W&B captures metrics, system traces, and artifacts; we additionally export structured logs (throughput, TFLOPs/GPU, GPU/host memory, step time. For reproducibility, we fix seeds, log exact launch scripts and env, and report effective tokens/step and utilization.
4. Evaluation and Results
4.1 Dataset Statistics
Volume, Diversity, and Domain Coverage:
| Domain | Number of Samples | No of Tokens (in B) | 
|---|---|---|
| High school biology | 3,818,070 | 4.511 | 
| College biology | 3,286,648 | 3.927 | 
| Professional medicine | 1,552,474 | 1.884 | 
| College medicine | 5,164,247 | 6.218 | 
| High school mathematics | 3,244,240 | 4.277 | 
| College mathematics | 5,895,052 | 8.243 | 
| High school physics | 2,277,880 | 3.061 | 
| College physics | 4,281,062 | 5.814 | 
| Conceptual physics | 2,354,184 | 2.973 | 
| Total | 31,873,857 | 40.906 | 
4.2 LLM-as-a-Judge Evaluation
Figure 2. Histogram showing the results obtained using LLM-as-a-Judge method using Opencompass framework. The different educational domains of the MMLU dataset on the x-axis and the score on the y-axis. We can see that Qvac Genesis I performs better on average than the current largest synthetic dataset, Cosmopedia, and also in all individual topic and level domains except college physics.
Figure 3. Different representation of results obtained using LLM as a judge via the OpenCompass framework.
Methodology and Framework
We developed a robust and stable evaluation framework using OpenCompass [8] that leverages LLM-as-a-Judge methodology to extract answers from model outputs. This approach represents a significant advancement over traditional log-likelihood-based evaluation methods commonly used in benchmarking.
Traditional Log-Likelihood Limitations:
- Relies on next-token probability prediction, which may not capture the model's true reasoning capabilities
- Models may require multiple tokens to arrive at the correct answer or may self-correct during generation
- Cannot handle cases where the model fails to provide a clear answer or provides multiple conflicting responses
- Does not account for the model's ability to reason through complex problems step-by-step
Our LLM-as-a-Judge Approach: Our evaluation framework addresses these limitations by implementing a three-stage process:
- Response Generation: The model generates a complete response to the evaluation question
- Answer Extraction: A specialized LLM judge extracts the final answer from the model's complete response
- Exact Matching: The extracted answer is compared against the ground truth using exact string matching
This methodology provides several advantages:
- Captures the model's complete reasoning process rather than just next-token predictions
- Handles cases where models self-correct or require multiple reasoning steps
- Provides clear evaluation of cases where models cannot provide definitive answers
- Enables more nuanced assessment of model capabilities across different problem types
LLM as a Judge Evaluation Pipeline Diagram
Figure 4. Diagram of our evaluation pipeline. Stage 1: The model to be evaluated generates a complete response to the evaluation question. Stage 2: A specialized LLM judge extracts the final answer from the model's complete response. Stage 3: The extracted answer is compared against the ground truth using exact string matching. In the end, for each output we will have: Correct, Incorrect, Multiple Answer or No Answer.
Answer Extraction Framework
We implemented a sophisticated answer extraction system that can handle various response patterns and edge cases:
Response Categories:
- Valid Answer: Single, clear choice (A, B, C, or D)
- MULTIPLE_ANSWERS: When the model provides conflicting or multiple different answers
- NO_ANSWER: When no clear answer can be identified in the response
Extraction Process: The LLM judge analyzes the complete model response to identify:
- Explicit answer statements (e.g., "ANSWER: A", "The answer is B")
- Boxed format answers (e.g., \boxed{A})
- Standalone letter choices in conclusions
- Self-corrections and final settled answers
- Generated questions vs. original question answers
Evaluation Prompt Template
Our evaluation system uses a carefully designed prompt template that ensures consistent and reliable answer extraction. For detailed information about the complete prompt template, see Prompt Templates.
Scoring Criteria and Metrics
Primary Metrics:
- Accuracy: Percentage of correctly answered questions (exact match between extracted answer and ground truth)
- No Answer Rate: Percentage of responses classified as NO_ANSWER
- Multiple Answer Rate: Percentage of responses classified as MULTIPLE_ANSWERS
Quality Assurance:
- Inter-annotator agreement on answer extraction
- Manual validation of edge cases
- Consistency checks across different model outputs
- Robustness testing with various response formats
Advantages Over Traditional Methods
- Comprehensive Evaluation: Captures the full reasoning process rather than just next-token predictions
- Handles Edge Cases: Properly categorizes ambiguous, multiple, or missing answers
- Real-world Applicability: Reflects how models actually perform in practical scenarios
- Fair Comparison: Provides consistent evaluation across different model architectures and training approaches
- Interpretability: Clear categorization of model response types enables better understanding of model capabilities and limitations
This evaluation framework provides a more accurate and comprehensive assessment of model performance, particularly for complex reasoning tasks where traditional log-likelihood methods may not capture the full extent of model capabilities.
4.3 Next-Token Prediction Performance
- Benchmark Tasks and Datasets
- Accuracy and Generalization Analysis
Figure 5. Histogram showing the results obtained using the Loglikelihood method and LM-Harness framework. Even here we can see that Qvac Genesis I performs better on average than the current largest synthetic dataset, Cosmopedia, and also in all individual topic and level domains except college physics.
Figure 6. Different representation of results obtained using the Loglikelihood method and LM-Harness framework.
5. Conclusion
- Summary of Findings: We built the largest synthetic datasets to date with 41 billion tokens for 10 critical educational topics and intend to generate and publish more tokens to have a total coverage of all other domains. We achieved superior performance (i.e., accuracy, quality) when compared to Cosmopedia v2. the state-of-the-art synthetic datasets. This was demonstrated from the MMLU benchmark of 10 selected critical topics where we obtained SOTA performance over the sota cosmopedia v2, current sota.
- Implications for Future Pre-training: Public, researchers, academics, research institutions, practitioners and AI community can make use of the datasets to build SOTA base model. This will set the base for a strong foundation of the base model for post-training too.
- Limitations and Next Steps: We initially currently focus on these key critical educational content: medicine, math, physics and biology. We plan to generate synthetic domains to have total coverage on all other STEM domains from the FineFineWeb.
6. References
[1] Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., & Lee, Y. T. (2023). Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint https://arxiv.org/abs/2309.05463
[2] Hugging Face. Cosmopedia: A synthetic dataset for pretraining language models. Hugging Face Hub. https://huggingface.co/datasets/HuggingFaceTB/Cosmopedia
[3] m-a-p. FineFineWeb: A curated web corpus with domain categorization. Hugging Face Hub. https://huggingface.co/datasets/m-a-p/FineFineWeb
[3] Argilla. Distilabel: A framework for synthetic data and AI feedback. GitHub. https://github.com/argilla-io/distilabel
[4] OpenBMB. Ultra-FineWeb-classifier: A quality classifier for web content. Hugging Face Hub. https://huggingface.co/openbmb/Ultra-FineWeb-classifier
[5] vLLM Team. vLLM: Easy, fast, and cheap LLM serving for everyone. GitHub. https://github.com/vllm-project/vllm
[6] Qwen Team. QwQ-32B: A large language model for question answering. Hugging Face Hub. https://huggingface.co/Qwen/QwQ-32B
[7] Qwen Team. Qwen3-1.7B-Base: A compact base language model. Hugging Face Hub. https://huggingface.co/Qwen/Qwen3-1.7B-Base
[8] Opencompass. OpenCompass is an LLM evaluation platform. GitHub.
https://github.com/open-compass/opencompass
7. Appendix
Ablation studies (experimental results on different independent + mixed datasets)
- Experimented with 1 epoch and 2 epoch on Cosmopedia V2.
- Assigned domain wise weightage like each super domain maths, biology, medicine and physics is assigned 25% weightage for all
- Assigned complexity wise weightage
- No upsampling/downsampling done to maintain the reproducibility of the trained models
Prompt Templates
Scaling QA Template (for QwQ-32B):
You are a {{level}} {{domain}} tutor. I will give you a context passage.
IGNORE the details of the passage. Use it only as inspiration for topics, but DO NOT refer to the passage, book, text, or "discussed/mentioned" material in your questions.
Context:
{{output}}
Generate EXACTLY 4 independent, self-contained questions suitable for {{level}} level {{domain}}.
Output rules (MUST be followed strictly):
- Output EXACTLY 4 samples.
- Each row = ONE question in the following format:
  "<question>","<choiceA>","<choiceB>","<choiceC>","<choiceD>","<correct_choice_letter>"
- Each row MUST end with a newline (line break).
- Do NOT join multiple questions into one line.
- Do NOT output any extra text, headers, or blank lines.
Field definitions:
- <question> = a fully self-contained {{level}}-level {{domain}} problem. 
  It must NOT reference any passage, book, figure, table, or "as discussed" or "as mentioned" style wording. 
  Include all necessary values, constants, or definitions directly inside the question.
- <choiceA> ... <choiceD> = four mutually exclusive, plausible answer choices.
- <correct_choice_letter> = exactly one of A, B, C, D.
Question variety:
- Use a mix of imperative (e.g., "Calculate…"), interrogative, completion/fill-in, and true/false style.
- Do NOT start all questions with the same word.
- Include realistic values/constants when needed (e.g., g = 9.8 m/s²).
- Use standard SI units and symbols.
Hard constraints:
- Generate exactly 4 questions (4 samples).
- Correct answers must be evenly distributed: one A, one B, one C, one D.
- Randomize placement of correct answers; no clustering.
- Each row must contain exactly 6 comma-separated values.
- Do NOT add quotes around the entire block; only around individual fields.
Output ONLY the 4 samples in the required format, nothing else.
MCQ Answer Template (for Qwen3-1.7B-Base):
Question: {{question}}
A. {{option_a}}
B. {{option_b}}
C. {{option_c}}
D. {{option_d}}
Answer:
LLM-as-a-Judge Answer Extraction Template (for Quality Control):
MMLU_ANSWER_EXTRACTOR_TEMPLATE = """
You are an expert answer extractor. Your ONLY job is to extract the final answer from the candidate's response.
CRITICAL INSTRUCTIONS - READ CAREFULLY:
- DO NOT solve the question yourself
- DO NOT generate a new answer
- DO NOT think about what the correct answer should be
- DO NOT evaluate whether the candidate's answer is right or wrong
- ONLY extract what the candidate actually wrote as their final answer
Your task is purely extraction, not generation or evaluation.
Here are the extraction guidelines for MMLU multiple choice questions:
1. Look for the candidate's final answer in their response. This should be one of: A, B, C, or D
   - Look for explicit statements like "ANSWER: A", "The answer is B", "Final answer: C"
   - Look for \boxed{A} format (extract what's inside the braces)
   - Look for standalone letters A, B, C, or D in their conclusion
   - The final choice they settle on in their reasoning
2. If the candidate's response contains multiple different answers:
   - If they state one answer but then correct/fix themselves, extract the corrected/final answer they provided
   - If they state multiple different answers without choosing one or correcting themselves, return "MULTIPLE_ANSWERS"
3. If you cannot clearly identify a single letter choice (A, B, C, or D) in the response, return "NO_ANSWER"
4. If the candidate generates new questions:
   - DO NOT extract the answer from the new generated questions
   - ONLY extract the answer from the original question
RESPONSE FORMAT:
First, provide a brief explanation of why you are extracting that particular answer (what indicators you found in the candidate's response).
Then, provide the extracted answer as a single letter.
Use this exact format:
Extraction Reasoning: [Brief explanation of what indicators led you to extract this answer from the candidate's response]
Extracted Candidate's Answer: [A single letter: A, B, C, or D. Use MULTIPLE_ANSWERS if the candidate provided multiple different answers, or NO_ANSWER if no clear answer was found.]
<Original Question Begin>: 
Question: {input}
A. {A}
B. {B}
C. {C}
D. {D}
Answer: 
<Original Question End>
<Candidate's Response Begin>: 
{prediction}
<Candidate's Response End>
"""
Educational Textbook Template (for Failure Analysis):
You are an educational content creator generating high-quality textbook explanations for academic learning.
Given:
• The problem: {{prompt}}
• A proposed solution: {{full_response}}
• The correct answer: {{target}}
**Content Requirements:**
- Generate a comprehensive educational content of MAXIMUM 3000 words
- Final answers must appear within \boxed{…}
- Write in clear, pedagogical language suitable for textbooks
- Create appropriate section titles and structure that fit the specific topic and problem type
- Organize your explanation with logical sections that help students understand the concept, identify errors in the proposed solution, learn the correct approach, and apply these insights
**Your Task:**
Analyze the given problem and proposed solution. Create a comprehensive educational explanation that:
1. Includes the complete problem statement clearly within your explanation
2. Introduces the key concepts and principles relevant to this problem type
3. Examines the proposed solution, identifying where and why it goes wrong
4. Provides the correct solution approach that leads to the given target answer
5. Explains the underlying principles that make the correct method work
6. Discusses broader applications and common misconceptions
7. Concludes with actionable takeaways for students
**IMPORTANT:** Your textbook explanation must be completely self-contained. Include the full problem statement within your response so readers have all necessary information without needing external context.
Structure your response with appropriate headings and sections that naturally fit the subject matter and problem type
Web Articles Template (for Failure Analysis):
You are a content creator specializing in engaging, informative web articles that break down complex problems and solutions.
Given:
• The problem: {{prompt}}
• A proposed solution: {{full_response}}
• The correct answer: {{target}}
**Content Requirements:**
- Generate engaging web content of MAXIMUM 3000 words
- Use a conversational yet informative tone suitable for online readers
- Final answers must appear within \boxed{…}
- Create compelling headings and subheadings that work well for web reading
- Include relatable examples and practical insights
- Structure content for easy scanning with shorter paragraphs and clear sections
**Your Task:**
Create an engaging web article that breaks down this problem and solution. Your article should:
1. Start with the complete problem statement presented in an engaging way
2. Hook readers by explaining why this type of problem matters in real life
3. Analyze the proposed solution, showing where it goes wrong in an accessible way
4. Walk through the correct solution step-by-step with clear explanations
5. Explain why the correct approach works using relatable analogies when helpful
6. Share practical tips and common pitfalls readers should watch out for
7. End with actionable takeaways and encourage further learning
**IMPORTANT:** Your article must be completely self-contained and include the full problem statement. Write for a general audience interested in learning. Use engaging language that makes complex concepts accessible.
Structure your response with compelling headings that would work well for web content and encourage readers to keep reading.
Conversational Dialogue Template (for Failure Analysis):
You are creating a natural conversational dialogue between a curious student and a knowledgeable assistant discussing a problem and its solution.
Given:
• The problem: {{prompt}}
• A proposed solution: {{full_response}}
• The correct answer: {{target}}
**Content Requirements:**
- Generate a comprehensive natural conversational dialogue of MAXIMUM 3000 words
- Use "User:" and "Assistant:" to clearly mark each speaker
- Final answers must appear within \boxed{…}
- Make the conversation flow naturally with realistic student questions
- Include follow-up questions and clarifications that feel authentic
- Create an engaging back-and-forth that teaches through dialogue
**Your Task:**
Create a natural conversation where a student asks about this problem and you provide helpful explanations. The dialogue should:
1. Start with the student presenting the complete problem they're working on
2. Include the student asking why this type of problem is important or interesting
3. Have the student share their attempted solution and ask for feedback
4. Show the assistant explaining what went wrong in a supportive way
5. Include the student asking for the correct approach step-by-step
6. Have natural follow-up questions about the reasoning and methods
7. End with the student asking for tips to avoid similar mistakes in the future
**IMPORTANT:** Create a completely self-contained dialogue that includes the full problem statement naturally within the conversation. Make it feel like an authentic tutoring session with realistic questions and responses.
Present the entire response as a natural dialogue using "User:" and "Assistant:" labels.
Question-Answer Template (for Failure Analysis):
You are an expert tutor providing clear, direct answers to questions about problem-solving and reasoning.
Given:
• The problem: {{prompt}}
• A proposed solution: {{full_response}}
• The correct answer: {{target}}
**Content Requirements:**
- Generate a focused Q&A content of MAXIMUM 3000 words
- Use clear, direct language with a helpful tutoring tone
- Final answers must appear within \boxed{…}
- Structure as natural Q&A flow that addresses the key learning points
- Focus on practical understanding and clear explanations
- Prioritize clarity and directness over lengthy explanations
**Your Task:**
Create a focused Q&A response that addresses this problem and solution. Your response should:
1. Present the complete problem clearly as the main question
2. Identify what makes this type of problem important or challenging
3. Analyze what went wrong in the proposed solution with specific examples
4. Provide the correct solution with step-by-step reasoning
5. Explain the key principle or method that ensures the right approach
6. Give practical advice for avoiding similar mistakes
7. Summarize the main takeaway in a clear, memorable way
**IMPORTANT:** Your Q&A must be completely self-contained and include the full problem statement. Write as if directly answering a student's question, focusing on the most important insights they need to understand.
Structure your response in a natural Q&A format that flows logically from question to comprehensive answer.
 
					





