Domain language model

What is the best option to build a language model on a particular domain(small). Is it good to find tune an already built small language model on the domain or build the SLM from scratch. Is there any data size criteria to decide on this. Thanks in advance.

1 Like

If you ultimately require language skills, foundational math and logic abilities beyond specialized knowledge in SLM, it’s better to base it on existing models. Pre-training a new SLM from scratch is quite expensive, even for mega-parameter-scale models.
In most cases, fine-tuning using LoRA or QLoRA is generally preferable.

Even smaller models are becoming quite stable these days.


Background: what “domain language model” typically means (and why it matters)

When people say “build a model for my small domain”, they usually want one (or more) of these:

  1. Domain knowledge (answers should be correct w.r.t. your internal docs)
  2. Domain behavior (output format, tone, workflow steps, taxonomy, JSON schema)
  3. Domain language (jargon, abbreviations, notation, writing style)

The best approach depends on which of these you need most.


Your main choice: adapt an existing SLM vs train from scratch

In a small domain, the default best option is not training from scratch

Training from scratch requires massive token volumes to avoid an undertrained, brittle model.

  • Scaling work (Chinchilla) finds that, for compute-optimal training, parameters and training tokens should scale together; they demonstrate this by training a 70B parameter model on ~1.4T tokens.
  • A commonly used rule-of-thumb derived from this regime is ~20 tokens per parameter (e.g., 70B → 1.4T tokens). (Epoch AI)
  • Also, in practice, “tokens-per-parameter” in notable open models has been trending upward over time (data-heavy training), which pushes requirements even higher. (Epoch AI)

Implication: If your domain data is “small” (even a few million words), training-from-scratch will almost always underperform a good existing base model that you adapt.


The options that usually win for small domains

Option A — RAG for domain knowledge (recommended starting point)

RAG (Retrieval-Augmented Generation) means: keep a general model, but retrieve relevant passages from your domain documents and feed them into the prompt at answer time.

Why it’s a strong default for “small domain”:

  • It directly addresses “knowledge changes” and “provenance/citations” problems: you update the document index instead of retraining weights. (arXiv)

Use RAG when: your domain is mostly documents/policies/manuals/KB articles and you want grounded answers.


Option B — Fine-tune an existing SLM (almost always better than scratch)

Fine-tuning is best when you want behavioral consistency:

  • consistent templates
  • strict JSON
  • correct taxonomy labels
  • specific reasoning workflow (“step 1–2–3”)
  • customer-support style

For small domains, you normally do parameter-efficient fine-tuning (PEFT) rather than full fine-tuning:

  • LoRA: freeze base weights and train small low-rank adapters; reduces trainable parameters and memory. (arXiv)
  • QLoRA: train LoRA adapters while the base model is quantized to 4-bit; makes fine-tuning feasible on limited hardware and was shown to preserve performance well. (arXiv)
  • Hugging Face’s PEFT overview emphasizes why PEFT helps with cost/storage and is often better in low-data settings (and can reduce catastrophic forgetting vs full fine-tuning). (Hugging Face)

Use fine-tuning when: you need the model to behave in a domain-specific way, not just “know facts.”


Option C — Continued pretraining (DAPT/TAPT) for domain language mismatch

If the model struggles with domain text even when retrieved (RAG gives it the right paragraph but it still “doesn’t get it”), you may need domain-adaptive pretraining:

  • “Don’t Stop Pretraining” shows that continued pretraining on in-domain text (DAPT) improves downstream task performance across multiple domains and in both high- and low-resource settings. (ACL Anthology)

Use DAPT when: your domain has unusual language distribution (dense jargon, abbreviations, formulas, log syntax, biomedical/legal/engineering writing).


Data size criteria you can actually use

1) If you’re thinking about training from scratch

Use a tokens-per-parameter sanity check:

  • A common compute-optimal reference point is ~20 tokens/parameter (e.g., 70B trained on ~1.4T tokens). (Epoch AI)

Examples (rule-of-thumb):

  • 1B params → ~20B tokens
  • 3B params → ~60B tokens
  • 7B params → ~140B tokens

If you are not in the tens of billions of tokens, training from scratch is rarely justified for quality.


2) If you’re thinking about fine-tuning (SFT/LoRA/QLoRA)

For supervised fine-tuning (instruction/input → ideal output), dataset size is measured in examples/pairs.

Practical, widely used heuristics:

  • Unsloth’s dataset guidance: minimum ~100 rows, >1,000 rows preferred for better outcomes. (unsloth.ai)
  • NVIDIA’s practical guidance for parameter-efficient fine-tuning: small-to-medium dataset (100–1,000 prompt-sample pairs). (NVIDIA Blog)

Interpretation:

  • If you only have 50–200 examples, you can still improve formatting and style, but expect brittleness.
  • At 1,000–10,000 good examples, you can usually get consistent behavior across a range of prompts.

3) If you’re thinking about continued pretraining (DAPT)

There isn’t a single universal “minimum,” but the key question is:

  • Do you have enough unlabeled in-domain text to noticeably shift the model’s language understanding?

DAPT is supported by evidence as beneficial even under low-resource conditions, but it still needs enough domain text to move the needle. (ACL Anthology)
In small domains, teams often try RAG + SFT first; add DAPT only if language mismatch persists.


A clear recommendation for a “small domain” (most common scenario)

Best default stack

  1. RAG for domain knowledge (fast wins, easy updates). (arXiv)
  2. LoRA/QLoRA SFT for domain behavior (templates/taxonomy/schema). (arXiv)
  3. DAPT only if needed for domain-language mismatch. (ACL Anthology)
  4. Avoid scratch training unless you truly have foundation-scale data (typically billions–trillions of tokens).

Quick decision checklist

Choose the lowest-cost method that solves your real problem:

  • Need correct answers from internal docs? → RAG first. (arXiv)
  • Need consistent output format / workflow? → SFT with LoRA/QLoRA. (arXiv)
  • Model can’t interpret your jargon even with retrieved passages? → Consider DAPT. (ACL Anthology)
  • Considering scratch? → Check tokens-per-parameter; if you’re not in tens of billions+ tokens, it’s usually the wrong move. (Epoch AI)