PDFs are diverse… If it’s a structured document leaning towards Markdown, VLM is fine. Or rather, OCR → LLM/Embeddings or a hybrid lightweight PDF-specific OCR would work. But if it’s just a PDF scanned from some random paper document… well, it becomes a difficult process.
However, if you can ignore such messy documents, using VLM alone offers good maintainability.
A vision-capable model on page images is a valid direction. It also has real downsides that show up fast in production. The core tradeoff is simple:
- PDF text extraction uses the PDF’s native text layer (when it exists). It is fast and can be exact, but reading order and layout can be messy. (PyMuPDF Documentation)
- OCR or VLM-on-images re-derives text from pixels. It handles scans and tricky visuals, but it is costlier, more DPI-sensitive, and more likely to introduce subtle transcription and structure errors.
Below is the “background model” of PDFs, then the concrete downsides of image-first, then a practical hybrid strategy that usually beats both extremes.
Background: PDFs are not one thing
A PDF page can be:
1) Born-digital (real text exists)
The page contains a text layer (glyphs + coordinates). Extractors can often give you word-level bounding boxes. PyMuPDF even exposes words with coordinates and provides a sort option specifically because “natural reading order” is a known issue. (PyMuPDF Documentation)
2) Scanned (no text exists)
The page is an image. There is no text layer. You must OCR or use a vision model.
3) Mixed
Many real PDFs are mixed. Some pages have text. Some are scans. OCRmyPDF documents --skip-text as a way to avoid OCR on pages that already have text, explicitly for mixed “born-digital + scanned” documents. (OCRmyPDF)
If you convert every page to an image, you treat all three types as scans. That is the main reason the downsides exist.
What you gain with “pages → images → vision model”
It is worth stating the upside clearly, because it explains why people do this:
-
Uniformity
One pipeline. Everything is “an image”. Mixed PDFs stop being special cases.
-
Robustness to visual-only content
Stamps, signatures, handwriting, checkboxes, weird fonts, rotated scans. These can break text extraction and classic OCR.
-
End-to-end extraction is possible
OCR-free document models like Donut exist precisely to avoid OCR error propagation and to handle document understanding directly from pixels. (arXiv)
If your corpus is mostly scanned, or you need to read visual marks, image-first is often the correct default.
Now the downsides.
Downsides of “image-first” (the practical ones)
Downside 1: You throw away perfect text when it exists
For born-digital pages, the PDF already contains:
- exact characters
- precise coordinates
- consistent rendering independent of DPI
When you rasterize, you discard that and ask a model to re-infer it from pixels.
This matters most for:
- small fonts
- superscripts/subscripts
- long tables with tight spacing
- “similar-looking” characters (O vs 0, l vs 1, S vs 5)
If you care about exact invoice IDs, account numbers, totals, dates, and line items, discarding a clean text layer is often a quality regression.
Downside 2: DPI becomes a permanent accuracy vs cost knob
Once you rasterize, resolution controls everything.
- Low DPI is cheaper but loses tiny text and thin separators.
- High DPI increases compute, latency, and memory. It also increases image tokenization/tiling costs in many VLM stacks.
This is not a one-time choice. It becomes an ongoing operational problem: different documents require different DPI, but variable DPI complicates consistency and debugging.
Downside 3: Tables are the hardest case for “read the whole page”
Tables fail in two separate ways:
- Text recognition (what each cell says)
- Structure recognition (which text belongs to which row/column, and merged cells)
A lot of “OCR worked but extraction failed” is actually structure failure.
That is why Table Transformer exists as a dedicated model for table detection and table structure recognition, with metrics like GriTS for evaluating structure quality. (GitHub)
It is also why modern OCR stacks emphasize layout + structure pipelines, not just “turn pixels into text.” PaddleOCR 3.0’s PP-StructureV3 is explicitly a multi-module pipeline including layout analysis and postprocessing to output structured JSON/Markdown. (arXiv)
A general VLM can “describe a table,” but reliable cell-level extraction at scale usually needs structure-first logic.
Downside 4: Determinism and “verbatimness” are harder
Extraction wants:
- verbatim transcription
- stable omissions (ideally none)
- stable formatting rules
- stable numeric parsing
Generative vision models can be “helpful” in ways you do not want:
- normalizing whitespace
- correcting “obvious” typos
- guessing a faint digit
- silently dropping repeated headers/footers
This is not rare. It is a natural outcome of “generate the most likely text” when pixels are ambiguous.
If your downstream system is strict (JSON fields, validations), you end up building a lot of guardrails anyway.
Downside 5: Provenance and traceability become more work
In extraction workflows, you often want:
- where did this value come from (page, region)
- confidence
- a highlighted crop for review
PDF text extraction can give you coordinates natively (words with bounding boxes). (PyMuPDF Documentation)
OCR pipelines also provide coordinates and confidence by design.
A VLM can be prompted to return boxes, but it is not inherently a “coordinate-native” system unless you choose a model/approach built for it. If you do not enforce provenance early, audits and debugging get painful.
Downside 6: Long documents compound cost and error
With 30–60k context, you are already fighting “use the right evidence” problems. “Lost in the Middle” shows performance can degrade significantly when relevant information sits in the middle of long contexts. (arXiv)
Now add vision:
- more pages means more encodes
- more encodes means more latency and variance
- more variance means more edge-case failures
So “vision + long docs” often pushes you toward routing and selective processing rather than “full-page everything.”
Downside 7: You lose a very useful trick for mixed PDFs
OCRmyPDF’s --skip-text exists because mixed documents are common and you want to preserve existing text while OCR’ing only what needs it. (OCRmyPDF)
If you rasterize everything, you cannot “skip text.” You always pay the full price.
A better default architecture: hybrid routing + selective vision
This is the pattern that tends to win on accuracy, cost, and debuggability.
Step 1: Page-level routing
For each page:
- If a usable text layer exists, extract it (plus word boxes).
- If not, render and OCR (or render and vision).
This directly matches how OCRmyPDF treats mixed documents with --skip-text. (OCRmyPDF)
Step 2: Layout detection before heavy extraction
Run a lightweight layout detector to find:
- tables
- key-value blocks
- headers/footers
- figures
DocLayout-YOLO is an example of a modern “real-time layout detection” approach built for diverse documents. (GitHub)
Step 3: Table structure recognition for tables
Use a table structure model (Table Transformer or similar) to recover the grid and cell grouping. (GitHub)
Step 4: Escalate to a VLM only on hard regions
Use the VLM for:
- low-confidence OCR regions
- stamps/signatures/handwritten notes
- cells with unusual formatting
- visually complex totals blocks
This keeps your VLM spend bounded and makes failures reviewable.
When “vision-first for every page” is actually the right choice
It can be the right default if most of these are true:
- Your PDFs are mostly scans or photos.
- Layout is inconsistent and the “text layer” is often missing or corrupted.
- You need to read non-text marks (checkboxes, stamps, handwriting).
- You care more about semantic extraction than verbatim transcription.
- You can afford higher per-page compute and you have strong validation.
If you do this, treat it like an engineering system, not a prompt:
- enforce schema and validations
- require provenance fields when possible (page index, crop ID)
- build a review loop for low-confidence outputs
Reading path focused on your decision
These are the sources that give the right background for “text vs OCR vs vision”:
-
PyMuPDF text extraction and reading order
Explains that reading order is non-trivial and shows sort options and word-level boxes. (PyMuPDF Documentation)
-
OCRmyPDF mixed-document behavior (--skip-text)
Concrete evidence that hybrid handling is a first-class need in practice. (OCRmyPDF)
-
Donut (OCR-free document understanding)
Clear articulation of why OCR pipelines can be expensive and how OCR errors propagate, plus a vision-first alternative. (arXiv)
-
PaddleOCR 3.0 PP-StructureV3 pipeline
Shows what “production document parsing” looks like: preprocessing, OCR, layout analysis, postprocessing, structured outputs. (arXiv)
-
Table Transformer
Shows that tables are a distinct subproblem with dedicated models and structure metrics. (GitHub)
-
DocLayout-YOLO
Modern layout detection framing and why zoning is a big lever for speed and accuracy. (GitHub)
Summary
- Image-first is simplest and handles scans and visual marks well, but it discards clean PDF text when it exists and makes DPI a permanent cost-quality tradeoff. (PyMuPDF Documentation)
- Tables are the biggest pain point. Table structure recognition exists because “read the page” is not enough. (GitHub)
- The most robust default is hybrid routing: extract native text when available, OCR or vision when not, then use layout detection and escalate to VLM only for hard regions. (OCRmyPDF)