Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding
Cernis-Thinking is a reasoning-capable vision language model fine-tuned with reinforcement learning (GRPO/GSPO) for document understanding tasks. Built on Qwen2.5-VL-7B, it excels at mathematical reasoning, LaTeX OCR, invoice extraction, and handwriting transcription.
Model Details
- Base Model: Qwen2.5-VL-7B-Instruct
- Training Method: Group Relative Policy Optimization (GRPO) with GSPO extensions
- Training Data: ~2,000 samples across 4 document understanding tasks
- Model Size: 7B parameters
- License: Apache 2.0
Capabilities
Cernis-Thinking is trained on four distinct document understanding tasks:
- Mathematical Reasoning - Solves math problems from images with step-by-step reasoning
- LaTeX OCR - Converts mathematical notation images to LaTeX code
- Invoice Extraction - Extracts structured information from invoices and receipts
- Handwriting Transcription - Transcribes handwritten text from images
Training Details
Datasets
- AI4Math/MathVista - Mathematical reasoning (filtered for numeric answers)
- unsloth/LaTeX_OCR - LaTeX formula recognition
- mychen76/invoices-and-receipts_ocr_v1 - Invoice extraction
- corto-ai/handwritten-text - Handwriting transcription
Reinforcement Learning Approach
The model was trained using GRPO (Group Relative Policy Optimization) with custom reward functions:
1. Formatting Reward Function
- Rewards proper use of
<REASONING>and<SOLUTION>tags - Penalizes malformed outputs (e.g., excessive "addCriterion" artifacts)
- Encourages structured, parseable responses
2. Task-Specific Correctness Reward
- Math: Exact numeric matching (2.0 points)
- LaTeX/Handwriting: String similarity with word overlap scoring (0.75-2.0 points)
- Invoices: Partial credit for extracting key information (1.5 points)
3. ROUGE-like Word Overlap
- For text-heavy tasks, rewards based on word overlap ratio:
50% overlap: 1.5 points
30% overlap: 0.75 points
- Prevents wasted training on completely wrong outputs
Training Configuration
training_args = GRPOConfig(
learning_rate = 5e-6,
num_train_epochs = 0.5,
per_device_train_batch_size = 1,
gradient_accumulation_steps = 2,
num_generations = 4,
max_prompt_length = 1024,
max_completion_length = 1024,
# GSPO settings
importance_sampling_level = "sequence",
loss_type = "dr_grpo",
)
Usage
With Transformers
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
"coolAI/cernis-thinking",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("coolAI/cernis-thinking")
# Prepare image and prompt
image = Image.open("document.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Extract the key information from this invoice. First provide your reasoning between <REASONING> and </REASONING>, then your answer between <SOLUTION> and </SOLUTION>"}
]
}
]
# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True).to(model.device)
# Generate
output_ids = model.generate(**inputs, max_new_tokens=1024)
generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)
print(generated_text[0])
With vLLM (Recommended for Production)
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
# Initialize vLLM
llm = LLM(
model="coolAI/cernis-thinking",
max_model_len=16384,
gpu_memory_utilization=0.8
)
# Prepare prompt
prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>What is the LaTeX code shown in this image? Provide your answer between <SOLUTION> and </SOLUTION><|im_end|>\n<|im_start|>assistant\n"
# Sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_k=50,
max_tokens=1024
)
# Generate
outputs = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {"image": ImageAsset("formula.png").pil_image}
},
sampling_params=sampling_params
)
print(outputs[0].outputs[0].text)
Example Outputs
Mathematical Reasoning
Input: Image of geometry problem
Output:
<REASONING>
To solve this parallelogram problem, I need to use the properties:
1. Opposite sides are equal in a parallelogram
2. Angle bisectors create specific relationships...
</REASONING>
<SOLUTION>
42
</SOLUTION>
LaTeX OCR
Input: Image of mathematical formula
Output:
<SOLUTION>
\frac{2}{3} < a^{2} \alpha^{2} \leq 1
</SOLUTION>
Invoice Extraction
Input: Invoice image
Output:
<SOLUTION>
Invoice No: 53553822
Date: 07/24/2012
Vendor: Leo Brown
Seller Address: 082 Christopher Club Apt. 771 Thomasberg, OH 42949
Seller Tax ID: 926-74-9803
Total: $247.50
</SOLUTION>
Citation
@misc{cernis-thinking-2025,
title={Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding},
author={Your Name},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/coolAI/cernis-thinking}}
}
Acknowledgments
- Built with Unsloth for efficient VLM training
- Base model: Qwen2.5-VL-7B-Instruct
- Training datasets: AI4Math, Unsloth, mychen76, corto-ai
License
Apache 2.0 - Free for commercial and research use
- Downloads last month
- 9