NeMo Retriever OCR v1

Model Overview

Description

The NeMo Retriever OCR v1 model is a state-of-the-art text recognition model designed for robust end-to-end optical character recognition (OCR) on complex real-world images. It integrates three core neural network modules: a detector for text region localization, a recognizer for transcription of detected regions, and a relational model for layout and structure analysis.

This model is optimized for a wide variety of OCR tasks, including multi-line, multi-block, and natural scene text, and it supports advanced reading order analysis via its relational model component. NeMo Retriever OCR v1 has been developed to be production-ready and commercially usable, with a focus on speed and accuracy on both document and natural scene images.

The NeMo Retriever OCR v1 model is part of the NVIDIA NeMo Retriever collection of NIM microservices, which provides state-of-the-art, commercially-ready models and microservices optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can readily customize them for domain-specific use cases, such as information technology, human resource help assistants, and research & development research assistants.

This model is ready for commercial use.

We are excited to announce the open sourcing of this commercial model. For users interested in deploying this model in production environments, it is also available via the model API in NVIDIA Inference Microservices (NIM) at nemoretriever-ocr-v1.

License/Terms of use

The use of this model is governed by the NVIDIA Open Model License Agreement and the use of the post-processing scripts are licensed under Apache 2.0.

Team

Mike Ranzinger
Bo Liu
Theo Viel
Charles Blackmon-Luca
Oliver Holworthy
Edward Kim
Even Oldridge

Deployment Geography

Global

Use Case

The NeMo Retriever OCR v1 model is designed for high-accuracy and high-speed extraction of textual information from images, making it ideal for powering multimodal retrieval systems, Retrieval-Augmented Generation (RAG) pipelines, and agentic applications that require seamless integration of visual and language understanding. Its robust performance and efficiency make it an excellent choice for next-generation AI systems that demand both precision and scalability across diverse real-world content.

Release Date

10/23/2025 via https://huggingface.co/nvidia/nemoretriever-ocr-v1

References

Technical blog: https://developer.nvidia.com/blog/approaches-to-pdf-data-extraction-for-information-retrieval/

Model Architecture

Architecture Type: Hybrid detector–recognizer with document-level relational modeling

The NeMo Retriever OCR v1 model integrates three specialized neural components:

Text Detector: Utilizes a RegNetY-8GF convolutional backbone for high-accuracy localization of text regions within images.
Text Recognizer: Employs a Transformer-based sequence recognizer to transcribe text from detected regions, supporting variable word and line lengths.
Relational Model: Applies a multi-layer global relational module to predict logical groupings, reading order, and layout relationships across detected text elements.

All components are trained jointly in an end-to-end fashion, providing robust, scalable, and production-ready OCR for diverse document and scene images.

Network Architecture: RegNetY-8GF

Parameter Counts:

Component	Parameters
Detector	45,268,472
Recognizer	4,944,346
Relational model	2,254,422
Total	52,467,240

Input

Property	Value
Input Type & Format	Image (RGB, PNG/JPEG, float32/uint8), aggregation level (word, sentence, or paragraph)
Input Parameters (Two-Dimensional)	3 x H x W (single image) or B x 3 x H x W (batch)
Input Range	[0, 1] (float32) or [0, 255] (uint8, auto-converted)
Other Properties	Handles both single images and batches. Automatic multi-scale resizing for best accuracy.

Output

Property	Value
Output Type	Structured OCR results: a list of detected text regions (bounding boxes), recognized text, and confidence scores
Output Format	Bounding boxes: tuple of floats, recognized text: string, confidence score: float
Output Parameters	Bounding boxes: One-Dimenional (1D) list of bounding box coordinates, recognized text: One-Dimenional (1D) list of strings, confidence score: One-Dimenional (1D) list of floats
Other Properties	Please see the sample output for an example of the model output

Sample output

ocr_boxes = [[[15.552736282348633, 43.141815185546875],
  [150.00149536132812, 43.141815185546875],
  [150.00149536132812, 56.845645904541016],
  [15.552736282348633, 56.845645904541016]],
 [[298.3145751953125, 44.43315124511719],
  [356.93585205078125, 44.43315124511719],
  [356.93585205078125, 57.34814453125],
  [298.3145751953125, 57.34814453125]],
 [[15.44686508178711, 13.67985725402832],
  [233.15859985351562, 13.67985725402832],
  [233.15859985351562, 27.376562118530273],
  [15.44686508178711, 27.376562118530273]],
 [[298.51727294921875, 14.268900871276855],
  [356.9850769042969, 14.268900871276855],
  [356.9850769042969, 27.790447235107422],
  [298.51727294921875, 27.790447235107422]]]

ocr_txts = ['The previous notice was dated',
 '22 April 2016',
 'The previous notice was given to the company on',
 '22 April 2016']

ocr_confs = [0.97730815, 0.98834222, 0.96804602, 0.98499225]

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Usage

Prerequisites

OS: Linux amd64 with NVIDIA GPU
CUDA: CUDA Toolkit 12.8 and compatible NVIDIA driver installed (for PyTorch CUDA). Verify with nvidia-smi.
Python: 3.12 (both subpackages require python = ~3.12)
Build tools (when building the C++ extension):
- GCC/G++ with C++17 support
- CUDA toolkit headers (for building CUDA kernels)
- OpenMP (used by the C++ extension)

Installation

The model requires torch, and the custom code available in this repository.

Clone the repository

Make sure git-lfs is installed (https://git-lfs.com)

git lfs install

Using https

git clone https://huggingface.co/nvidia/nemoretriever-ocr-v1

Or using ssh

git clone [email protected]:nvidia/nemoretriever-ocr-v1

Installation

With pip

Create and activate a Python 3.12 environment (optional)
Run the following command to install the package:

cd nemo-retriever-ocr
pip install hatchling
pip install -v .

With docker

Run the example end-to-end without installing anything on the host (besides Docker, docker compose, and NVIDIA Container Toolkit):

Ensure Docker can see your GPU:

docker run --rm --gpus all nvcr.io/nvidia/pytorch:25.09-py3 nvidia-smi

From the repo root, bring up the service to run the example against the provided image ocr-example-image.png:

docker compose run --rm nemo-retriever-ocr \
  bash -lc "python example.py ocr-example-input-1.png --merge-level paragraph"

This will:

Build an image from the provided Dockerfile (based on nvcr.io/nvidia/pytorch)
Mount the repo at /workspace
Run example.py with model from checkpoints

Output is saved next to your input image as <name>-annotated.<ext> on the host.

Run the model using the following code:

from nemo_retriever_ocr.inference.pipeline import NemoRetrieverOCR

ocr = NemoRetrieverOCR()

predictions = ocr("ocr-example-input-1.png")

for pred in predictions:
    print(
        f"  - Text: '{pred['text']}', "
        f"Confidence: {pred['confidence']:.2f}, "
        f"Bbox: [left={pred['left']:.4f}, upper={pred['upper']:.4f}, right={pred['right']:.4f}, lower={pred['lower']:.4f}]"
    )

Model Version(s):

nemoretriever-ocr-v1

Training and Evaluation Datasets:

Training Dataset

Data Modality

Image

Image Training Data Size

Less than a Million Images

The model is trained on a large-scale, curated mix of public and proprietary OCR datasets, focusing on high diversity of document layouts and scene images. The training set includes synthetic and real images with varied noise and backgrounds, filtered for commercial use eligibility.

Data Collection Method: Hybrid (Automated, Human, Synthetic)
Labeling Method: Hybrid (Automated, Human, Synthetic)
Properties: Includes scanned documents, natural scene images, receipts, and business documents.

Evaluation Datasets

The NeMo Retriever OCR v1 model is evaluated on several NVIDIA internal datasets for various tasks, such as pure OCR, table content extraction, and document retrieval.

Data Collection Method: Hybrid (Automated, Human, Synthetic)
Labeling Method: Hybrid (Automated, Human, Synthetic)
Properties: Benchmarks include challenging scene images, documents with varied layouts, and multi-language data.

Evaluation Results

We benchmarked NeMo Retriever OCR v1 on internal evaluation datasets against PaddleOCR on various tasks, such as pure OCR (Character Error Rate), table content extraction (TEDS), and document retrieval (Recall@5).

Metric	NeMo Retriever OCR v1	PaddleOCR	Net change
Character Error Rate	0.1633	0.2029	-19.5% ✔️
Bag-of-character Error Rate	0.0453	0.0512	-11.5% ✔️
Bag-of-word Error Rate	0.1203	0.2748	-56.2% ✔️
Table Extraction TEDS	0.781	0.781	0.0% ⚖️
Public Earnings Multimodal Recall@5	0.779	0.775	+0.5% ✔️
Digital Corpora Multimodal Recall@5	0.901	0.883	+2.0% ✔️

Detailed Performance Analysis

The model demonstrates robust performance on complex layouts, noisy backgrounds, and challenging real-world scenes. Reading order and block detection are powered by the relational module, supporting downstream applications such as chart-to-text, table-to-text, and infographic-to-text extraction.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Explainability, Bias, Safety & Security, and Privacy sections below.
Please report security vulnerabilities or NVIDIA AI Concerns here.

Bias

Field	Response
Participation considerations from adversely impacted groups protected classes in model design and testing	None
Measures taken to mitigate against unwanted bias	None

Explainability

Field	Response
Intended Task/Domain:	Optical Character Recognition (OCR) with a focus on retrieval application and documents.
Model Type:	Hybrid neural network with convolutional detector, transformer recognizer, and document structure modeling.
Intended Users:	Developers and teams building AI-driven search applications, retrieval-augmented generation (RAG) workflows, multimodal agents, or document intelligence applications. It is ideal for those working with large collections of scanned or photographed documents, including PDFs, forms, and reports.
Output:	Structured OCR results, including detected bounding boxes, recognized text, and confidence scores.
Describe how the model works:	The model first detects text regions in the image, then transcribes recognized text, and finally analyzes document structure and reading order. Outputs structured, machine-readable results suitable for downstream search and analysis.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:	Not Applicable
Technical Limitations:	This model version supports English only.
Verified to have met prescribed NVIDIA quality standards:	Yes
Performance Metrics:	Accuracy (e.g., character error rate), throughput, and latency.
Potential Known Risks:	The model may not always extract or transcribe all text with perfect accuracy, particularly in cases of poor image quality or highly stylized fonts.
Licensing & Terms of Use:	Use of this model is governed by NVIDIA Open Model License Agreement and the use of the post-processing scripts are licensed under Apache 2.0.

Privacy

Field	Response
Generatable or reverse engineerable personal data?	No
Personal data used to create this model?	None Known
How often is dataset reviewed?	The dataset is initially reviewed when added, and subsequent reviews are conducted as needed or in response to change requests.
Is there provenance for all datasets used in training?	Yes
Does data labeling (annotation, metadata) comply with privacy laws?	Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made?	No, not possible with externally-sourced data.
Applicable Privacy Policy	https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

Safety

Field	Response
Model Application Field(s):	Text recognition and structured OCR for multimodal retrieval. Inputs can include natural scene images, scanned documents, charts, tables, and infographics.
Use Case Restrictions:	Abide by NVIDIA Open Model License Agreement and the use of the post-processing scripts are licensed under Apache 2.0.
Model and dataset restrictions:	The principle of least privilege (PoLP) is applied, limiting access for dataset generation and model development. Restrictions enforce dataset access only during training, and all dataset license constraints are adhered to.
Describe the life critical impact (if present):	Not applicable.

Downloads last month: 16

Collection including nvidia/nemoretriever-ocr-v1

Nemotron RAG

Collection

10 items • Updated 1 day ago • 5