Need advice in order to start training

Hi folks,

I need to start getting into training.

Right now, I’m totally lost on how to get started. I’m not specifically looking for services or frameworks or tools. I’m looking mostly for reading material so that I can *understand* all the important stuff and allow me to make good choices.

Questions that pop into my mind:

  • when should I use LoRA vs other techniques?

  • understand the right balance between doing a lot of fine-tuning in a smaller model vs a shorter one on a bigger model

  • how to know if I should quantize my finetuned model or if I should use full precision?

  • I will be dealing with long context (maybe 30-60k), with long documents and data extraction(including pdfs), with dynamic system prompt (changing variables and instructions). How can I best learn how to prepare a dataset for this scenario? ← this is my most pressing question

  • what are my unknown unknowns regarding all of this?

My real question is: where should I go to learn about this stuff?

thanks a lot!!

Well, you’re trying to take in too much knowledge all at once, so I think it’ll be a bit hard to learn unless you prioritize what to learn first…

Roughly speaking, the details add up to this much.


You are asking for the “theory of the terrain” so your choices stop feeling random. The right way to learn this is a layered reading path: first the mechanics, then adaptation methods, then long-context behavior, then document-to-text and dataset construction, then quantization and evaluation.


1) The basic background you need (in plain terms)

Pretraining vs fine-tuning

  • Pretraining teaches broad language ability by predicting next tokens at massive scale.
  • Fine-tuning mostly teaches behavior and task format. It teaches “do it this way” more than “learn new facts”.

That is why your dataset matters so much. The model will copy patterns.

Context length is not the same as context usage

A model can “accept” 60k tokens but still fail to use the right evidence. A classic result is that performance often drops when the relevant info is in the middle of the prompt rather than the beginning or end. (arXiv)

So long-context work is two problems:

  1. Can the model technically ingest 30–60k tokens.
  2. Can it reliably retrieve and apply the correct parts.

Quantization is about memory and speed, not “training quality”

Quantization compresses numbers (weights or KV cache) so inference is cheaper. It can be almost free or it can break edge cases. You only know by evaluation. GPTQ and AWQ are two well-cited approaches for weight-only quantization. (arXiv)
For long context, KV cache memory often dominates, so KV cache quantization becomes a real lever. (vLLM)


2) The reading path that builds real understanding

I am listing fewer items on purpose. These are “spines” you can build around.

A. How transformers work (so everything else has hooks)

  • Scaling Laws for Neural Language Models (Kaplan et al.). Builds intuition for why bigger models help and how compute and data trade off. (arXiv)
  • Training Compute-Optimal LLMs (Chinchilla). Explains why many models are undertrained and why data and model size should scale together. (arXiv)

What to extract from these:

  • Bigger models can be more sample efficient.
  • Data quality and quantity matter as much as parameters, sometimes more.

B. Why fine-tuning works and what it changes

  • InstructGPT paper. Shows a smaller model can beat a much larger one after alignment-style fine-tuning. This directly informs your “small + more tuning vs big + less tuning” question. (arXiv)

What to extract:

  • Fine-tuning can change user-facing usefulness a lot.
  • Model size is not the only determinant of “quality”.

C. Parameter-efficient fine-tuning (LoRA and friends)

  • LoRA paper. Core idea: freeze base weights, learn low-rank updates. Great default for “teach behavior” without full retraining. (arXiv)
  • QLoRA paper. Core idea: keep base weights quantized (4-bit) during training, train LoRA adapters, save memory. (arXiv)
  • Adapters (Houlsby et al.). Another PEFT family. Good for understanding the design space LoRA sits in. (arXiv)
  • Prefix-tuning. “Virtual tokens” as trainable continuous prompts. Helps you reason about “prompt-like” alternatives to LoRA. (arXiv)
  • IA3 overview (PEFT conceptual guide). Shows that “PEFT” is broader than LoRA and why some methods can be cheaper. (Hugging Face)

What to extract:

  • PEFT methods mostly change behavior cheaply.
  • Full fine-tuning is heavier but sometimes needed if PEFT saturates.

D. Long-context behavior and how to measure it

  • Lost in the Middle. The key “don’t trust the window” paper. (arXiv)
  • LongBench. A benchmark suite that forces multi-task long-context evaluation. (arXiv)
  • RULER. Important nuance: “needle-in-a-haystack” tests can be too shallow, so you want multiple task types. (arXiv)

What to extract:

  • You need long-context evaluation protocols, not vibes.
  • Position effects are real and measurable.

E. If your base model does not truly support 30–60k

These are “context extension” readings. Separate project.

  • Position Interpolation (PI). Minimal fine-tuning approach for extending RoPE models. (arXiv)
  • YaRN. Another RoPE extension method focused on efficiency. (arXiv)

F. Document-to-text reality (PDFs are not text)

Your extraction performance is capped by your PDF conversion consistency.

  • Docling. Good overview of “PDF understanding” and conversion outputs you can standardize. (GitHub)
  • Unstructured partitioning docs. Shows common element-level decomposition and why strategy matters for PDFs. (docs.unstructured.io)
  • Optional academic angle: DocVQA. Useful if your PDFs are scanned or layout-heavy and you need to think in “document understanding” terms. (arXiv)

G. Structured outputs and schema fidelity

For extraction, “almost JSON” is failure.

  • JSONSchemaBench. A modern benchmark for constrained decoding and schema compliance. (arXiv)

H. Quantization fundamentals you actually need

  • GPTQ paper. Classic post-training weight quantization. (arXiv)
  • AWQ paper. Activation-aware weight-only quantization, widely referenced. (arXiv)
  • vLLM KV cache quantization docs. Long-context makes KV cache a first-class concern. (vLLM)

I. One “glue” concept worth learning early: RAG

Even if you plan to fine-tune, retrieval is often the cleanest way to handle long documents.

  • RAG paper (Lewis et al.). Grounding concept and the parametric vs non-parametric memory framing. (arXiv)

3) Your questions, turned into simple decision rules

When should you use LoRA vs other techniques?

Use LoRA when:

  • You want better instruction following, formatting, extraction behavior.
  • You have limited compute.
  • You want multiple task variants without storing full model copies. (That is a core LoRA motivation.) (arXiv)

Consider full fine-tuning when:

  • You have strong evidence LoRA plateaus on your eval set.
  • The task requires deeper capability shifts, not just behavior.

Consider prefix / prompt tuning when:

  • You want very small trainable parameter count.
  • You can accept that some tasks respond better than others. (arXiv)

Small model with lots of fine-tuning vs big model with lighter tuning

Two background facts help.

  • Scaling laws give you a baseline expectation: bigger often wins, especially when data is limited. (arXiv)
  • Fine-tuning can flip the ranking in user preference, as InstructGPT shows. (arXiv)

Practical rule:

  • If your PDFs are messy and your extraction rules are complex, start from a stronger base model.
  • Use LoRA or similar to enforce the “contract” and output structure.

Quantize or stay full precision?

Treat it as two separate knobs.

  • Weight quantization: GPTQ and AWQ are common references. (arXiv)
  • KV cache quantization: matters more as context grows. (vLLM)

Practical rule:

  • If your task is strict schema extraction, quantization often works well, but you must validate.
  • If your task depends on subtle cross-document reasoning inside 60k tokens, be more cautious and measure carefully.

4) The most pressing topic: how to prepare a dataset for long-context PDF extraction with dynamic system prompts

Think of each training example as a “mini production run”. The closer to production, the more it transfers.

Step 1. Define a stable output contract

Write down, explicitly:

  • The JSON schema or field list.
  • Normalization rules (dates, currency, units).
  • Missing data policy (null vs empty string).
  • Conflict policy (which source wins, or flag ambiguity).

This becomes your invariant “law”.

Step 2. Standardize document conversion and preserve structure

PDF ingestion is where many projects fail.

  • Pick a consistent extraction pipeline and stick to it.
  • Preserve layout cues in the text representation: page breaks, headings, table blocks, footnotes.
    Docling and Unstructured are worth reading here because they describe element-level decomposition and PDF strategies. (GitHub)

If you train on clean text but serve OCR-noisy text, you will get brittle behavior.

Step 3. Represent the runtime prompt shape exactly

Your prompt has “dynamic system prompt variables”. That is fine. The key is to make the model learn:

  • Variables change.
  • Contract and schema do not.

So in training:

  • Randomize the variables across examples.
  • Keep the invariant rules identical across examples.
  • Avoid leaving placeholders like {X} in the final rendered training text unless you want the model to output placeholders.

Step 4. Train for position robustness, not just length

Long context failure is often position failure. “Lost in the Middle” is the canonical warning. (arXiv)

So you want training data where the relevant facts appear:

  • Early.
  • Middle.
  • Late.
  • Across page boundaries.
  • Inside and outside tables.

You can do this by augmentation:

  • Shuffle irrelevant sections.
  • Insert distractor sections.
  • Move the “needle” field location.

Then measure it with LongBench-style evaluation and RULER-style stress tests. (arXiv)

Step 5. Use strict schema targets from day one

If you want strict JSON, always train strict JSON.
Then evaluate schema validity. JSONSchemaBench is directly about this problem and is useful even if you never use constrained decoding. (arXiv)

Step 6. Build an evaluation set early and keep it sacred

You need:

  • Field-level accuracy.
  • Normalized accuracy.
  • “Missing vs wrong” metrics.
  • Position-sensitivity metrics.

Do not tune on the test set. Treat it like a product benchmark.


5) The common “unknown unknowns” that derail learning

These are the recurring traps.

  1. Chat formatting mismatch. Training prompts differ from inference prompts. This silently ruins fine-tunes. Hugging Face’s chat templates docs explain the mechanism. (Hugging Face)
  2. Assuming context window equals usable context. Lost-in-the-middle effects are real even for long-context models. (arXiv)
  3. Over-trusting needle tests. Simple retrieval tests can look good while real tasks fail. RULER is explicitly about this gap. (arXiv)
  4. Ignoring KV cache. Long context can be bottlenecked by KV cache memory, not just model weights. (vLLM)
  5. Document conversion drift. Small OCR or layout differences can dominate model behavior. The doc parsing layer is part of the model in practice. (docs.unstructured.io)

Summary bullets

  • Read in layers: scaling and capability, then fine-tuning, then PEFT, then long-context measurement, then document parsing, then quantization. (arXiv)
  • Long context is a measurement problem first. Use Lost in the Middle, LongBench, and RULER to build intuition and tests. (arXiv)
  • For your dataset, match production exactly: same PDF-to-text pipeline, same prompt shape, randomized dynamic variables, strict schema outputs. (docs.unstructured.io)
  • Quantization is two knobs: weights and KV cache. Long context makes KV cache critical. (vLLM)

wow, this IS a lot to digest :slight_smile: thank you so much!

small side question, regarding pdfs, you mentioned extraction and OCR. I was heading towards a vision-capable model and converting the pages to images. Do you see any downsides to that?

1 Like

PDFs are diverse… If it’s a structured document leaning towards Markdown, VLM is fine. Or rather, OCR → LLM/Embeddings or a hybrid lightweight PDF-specific OCR would work. But if it’s just a PDF scanned from some random paper document… well, it becomes a difficult process.

However, if you can ignore such messy documents, using VLM alone offers good maintainability.


A vision-capable model on page images is a valid direction. It also has real downsides that show up fast in production. The core tradeoff is simple:

  • PDF text extraction uses the PDF’s native text layer (when it exists). It is fast and can be exact, but reading order and layout can be messy. (PyMuPDF Documentation)
  • OCR or VLM-on-images re-derives text from pixels. It handles scans and tricky visuals, but it is costlier, more DPI-sensitive, and more likely to introduce subtle transcription and structure errors.

Below is the “background model” of PDFs, then the concrete downsides of image-first, then a practical hybrid strategy that usually beats both extremes.


Background: PDFs are not one thing

A PDF page can be:

1) Born-digital (real text exists)

The page contains a text layer (glyphs + coordinates). Extractors can often give you word-level bounding boxes. PyMuPDF even exposes words with coordinates and provides a sort option specifically because “natural reading order” is a known issue. (PyMuPDF Documentation)

2) Scanned (no text exists)

The page is an image. There is no text layer. You must OCR or use a vision model.

3) Mixed

Many real PDFs are mixed. Some pages have text. Some are scans. OCRmyPDF documents --skip-text as a way to avoid OCR on pages that already have text, explicitly for mixed “born-digital + scanned” documents. (OCRmyPDF)

If you convert every page to an image, you treat all three types as scans. That is the main reason the downsides exist.


What you gain with “pages → images → vision model”

It is worth stating the upside clearly, because it explains why people do this:

  1. Uniformity
    One pipeline. Everything is “an image”. Mixed PDFs stop being special cases.

  2. Robustness to visual-only content
    Stamps, signatures, handwriting, checkboxes, weird fonts, rotated scans. These can break text extraction and classic OCR.

  3. End-to-end extraction is possible
    OCR-free document models like Donut exist precisely to avoid OCR error propagation and to handle document understanding directly from pixels. (arXiv)

If your corpus is mostly scanned, or you need to read visual marks, image-first is often the correct default.

Now the downsides.


Downsides of “image-first” (the practical ones)

Downside 1: You throw away perfect text when it exists

For born-digital pages, the PDF already contains:

  • exact characters
  • precise coordinates
  • consistent rendering independent of DPI

When you rasterize, you discard that and ask a model to re-infer it from pixels.

This matters most for:

  • small fonts
  • superscripts/subscripts
  • long tables with tight spacing
  • “similar-looking” characters (O vs 0, l vs 1, S vs 5)

If you care about exact invoice IDs, account numbers, totals, dates, and line items, discarding a clean text layer is often a quality regression.

Downside 2: DPI becomes a permanent accuracy vs cost knob

Once you rasterize, resolution controls everything.

  • Low DPI is cheaper but loses tiny text and thin separators.
  • High DPI increases compute, latency, and memory. It also increases image tokenization/tiling costs in many VLM stacks.

This is not a one-time choice. It becomes an ongoing operational problem: different documents require different DPI, but variable DPI complicates consistency and debugging.

Downside 3: Tables are the hardest case for “read the whole page”

Tables fail in two separate ways:

  1. Text recognition (what each cell says)
  2. Structure recognition (which text belongs to which row/column, and merged cells)

A lot of “OCR worked but extraction failed” is actually structure failure.

That is why Table Transformer exists as a dedicated model for table detection and table structure recognition, with metrics like GriTS for evaluating structure quality. (GitHub)
It is also why modern OCR stacks emphasize layout + structure pipelines, not just “turn pixels into text.” PaddleOCR 3.0’s PP-StructureV3 is explicitly a multi-module pipeline including layout analysis and postprocessing to output structured JSON/Markdown. (arXiv)

A general VLM can “describe a table,” but reliable cell-level extraction at scale usually needs structure-first logic.

Downside 4: Determinism and “verbatimness” are harder

Extraction wants:

  • verbatim transcription
  • stable omissions (ideally none)
  • stable formatting rules
  • stable numeric parsing

Generative vision models can be “helpful” in ways you do not want:

  • normalizing whitespace
  • correcting “obvious” typos
  • guessing a faint digit
  • silently dropping repeated headers/footers

This is not rare. It is a natural outcome of “generate the most likely text” when pixels are ambiguous.

If your downstream system is strict (JSON fields, validations), you end up building a lot of guardrails anyway.

Downside 5: Provenance and traceability become more work

In extraction workflows, you often want:

  • where did this value come from (page, region)
  • confidence
  • a highlighted crop for review

PDF text extraction can give you coordinates natively (words with bounding boxes). (PyMuPDF Documentation)
OCR pipelines also provide coordinates and confidence by design.

A VLM can be prompted to return boxes, but it is not inherently a “coordinate-native” system unless you choose a model/approach built for it. If you do not enforce provenance early, audits and debugging get painful.

Downside 6: Long documents compound cost and error

With 30–60k context, you are already fighting “use the right evidence” problems. “Lost in the Middle” shows performance can degrade significantly when relevant information sits in the middle of long contexts. (arXiv)

Now add vision:

  • more pages means more encodes
  • more encodes means more latency and variance
  • more variance means more edge-case failures

So “vision + long docs” often pushes you toward routing and selective processing rather than “full-page everything.”

Downside 7: You lose a very useful trick for mixed PDFs

OCRmyPDF’s --skip-text exists because mixed documents are common and you want to preserve existing text while OCR’ing only what needs it. (OCRmyPDF)

If you rasterize everything, you cannot “skip text.” You always pay the full price.


A better default architecture: hybrid routing + selective vision

This is the pattern that tends to win on accuracy, cost, and debuggability.

Step 1: Page-level routing

For each page:

  • If a usable text layer exists, extract it (plus word boxes).
  • If not, render and OCR (or render and vision).

This directly matches how OCRmyPDF treats mixed documents with --skip-text. (OCRmyPDF)

Step 2: Layout detection before heavy extraction

Run a lightweight layout detector to find:

  • tables
  • key-value blocks
  • headers/footers
  • figures

DocLayout-YOLO is an example of a modern “real-time layout detection” approach built for diverse documents. (GitHub)

Step 3: Table structure recognition for tables

Use a table structure model (Table Transformer or similar) to recover the grid and cell grouping. (GitHub)

Step 4: Escalate to a VLM only on hard regions

Use the VLM for:

  • low-confidence OCR regions
  • stamps/signatures/handwritten notes
  • cells with unusual formatting
  • visually complex totals blocks

This keeps your VLM spend bounded and makes failures reviewable.


When “vision-first for every page” is actually the right choice

It can be the right default if most of these are true:

  • Your PDFs are mostly scans or photos.
  • Layout is inconsistent and the “text layer” is often missing or corrupted.
  • You need to read non-text marks (checkboxes, stamps, handwriting).
  • You care more about semantic extraction than verbatim transcription.
  • You can afford higher per-page compute and you have strong validation.

If you do this, treat it like an engineering system, not a prompt:

  • enforce schema and validations
  • require provenance fields when possible (page index, crop ID)
  • build a review loop for low-confidence outputs

Reading path focused on your decision

These are the sources that give the right background for “text vs OCR vs vision”:

  1. PyMuPDF text extraction and reading order
    Explains that reading order is non-trivial and shows sort options and word-level boxes. (PyMuPDF Documentation)

  2. OCRmyPDF mixed-document behavior (--skip-text)
    Concrete evidence that hybrid handling is a first-class need in practice. (OCRmyPDF)

  3. Donut (OCR-free document understanding)
    Clear articulation of why OCR pipelines can be expensive and how OCR errors propagate, plus a vision-first alternative. (arXiv)

  4. PaddleOCR 3.0 PP-StructureV3 pipeline
    Shows what “production document parsing” looks like: preprocessing, OCR, layout analysis, postprocessing, structured outputs. (arXiv)

  5. Table Transformer
    Shows that tables are a distinct subproblem with dedicated models and structure metrics. (GitHub)

  6. DocLayout-YOLO
    Modern layout detection framing and why zoning is a big lever for speed and accuracy. (GitHub)


Summary

  • Image-first is simplest and handles scans and visual marks well, but it discards clean PDF text when it exists and makes DPI a permanent cost-quality tradeoff. (PyMuPDF Documentation)
  • Tables are the biggest pain point. Table structure recognition exists because “read the page” is not enough. (GitHub)
  • The most robust default is hybrid routing: extract native text when available, OCR or vision when not, then use layout detection and escalate to VLM only for hard regions. (OCRmyPDF)