Qwen3.5-9B 한국 기업 문서 OCR 특화 모델

모델 개요

Qwen3.5-9B 한국 기업 문서 OCR 특화 모델은 Qwen3.5-9B를 기반으로,
**한국 기업 환경에서 실제로 사용되는 문서 약 30만 건(총 307,272건)**을 활용해 파인튜닝한 모델입니다.

이 모델은 특히 한국어 문서의 image-to-text / OCR 후처리 / 문서 구조화 업무에서 다음과 같은 문제를 줄이는 것을 목표로 설계되었습니다.

한글 오인식 및 잘못된 문자 출력
띄어쓰기 및 토큰 경계 오류
숫자, 단위, 필드, 셀 값 누락
표, 차트, 혼합 레이아웃 문서의 구조 복원 실패
OCR 결과는 있으나 실제 업무에 쓰기 어려운 비정형 출력

실무적으로는 OCR 엔진이 먼저 원시 텍스트를 추출하고,
이 모델이 그 결과를 정제·보정·구조화하는 후단(refinement) 모델로 가장 효과적으로 활용됩니다.

모델이 해결하려는 문제

기반 모델인 Qwen3.5-9B는 전반적인 성능이 우수하지만,
한국 기업 문서 처리 업무에서는 다음과 같은 한계가 반복적으로 관찰되었습니다.

복잡한 한글 문서에서 문자 오류 발생
본문 중 핵심 필드 또는 일부 행/셀 누락
표/폼/혼합 문서에서 구조 보존 불안정
차트/그래프 페이지에서 값과 설명의 정합성 저하
후속 자동화(RAG, 추출, QA, JSON 변환)에 바로 사용하기 어려운 출력

본 모델은 이러한 문제를 줄이고,
한글 가독성, 문서 충실도, 후속 활용성을 높이기 위해 파인튜닝되었습니다.

주요 활용 시나리오

이 모델은 다음과 같은 한국 기업 문서 업무에 적합합니다.

한국어 OCR 후처리 및 오탈자/오인식 보정
비정형 OCR 결과를 정제된 텍스트로 정리
문서를 Markdown 형식으로 구조화
표/폼/리스트 구조 복원
차트/그래프 기반 설명 생성
문서 정보 JSON 추출
보고서/사내문서/오피스 문서 정규화
문서 QA 및 문서 자동화 전처리

대표적인 입력 문서 예시는 다음과 같습니다.

보고서 및 일반 업무 문서
표 중심 문서
그래프/차트 포함 페이지
폼/오피스 문서
국문/영문 혼합 문서
논문 및 기술 문서

학습 데이터 개요

파인튜닝에 사용된 데이터는 총 307,272건이며, 문서 유형별 구성은 아래와 같습니다.

데이터 유형	데이터 건수	비율
그래프	56,436	18.4%
텍스트	49,838	16.2%
영어논문	44,160	14.4%
문서	41,980	13.7%
테이블	40,275	13.1%
논문	36,650	11.9%
챠트	29,903	9.7%
오피스	4,460	1.5%
시각화	3,570	1.2%
합계	307,272	100.0%

데이터 특성

학습 데이터는 한국 기업 문서 환경을 반영하도록 구성되었으며, 특히 아래 요소를 중점적으로 포함합니다.

한글 OCR 노이즈 보정
숫자, 날짜, 단위, 고유명사 복원
표/차트/혼합 레이아웃 문서 처리
Markdown/JSON 등 구조화 산출물 생성
후속 문서 QA/SFT 데이터 생성에 활용 가능한 문서 이해 패턴

학습 데이터 예시 문서 유형

아래 이미지는 파인튜닝에 사용된 문서 스타일의 예시입니다.

1) 차트 + 본문 + 요약 표가 함께 있는 보고서 페이지

2) 원형 차트 중심의 시각화 문서 페이지

3) 본문 + 불릿 + 도식 + 섹션 구조가 혼합된 문서 페이지

이와 같은 문서들은 단순 OCR만으로는 텍스트 일부는 추출되더라도,
표/차트/문단 구조까지 안정적으로 보존하기 어려운 경우가 많습니다.

권장 사용 방식

본 모델은 2단계 파이프라인으로 사용할 때 가장 효과적입니다.

OCR 엔진이 이미지/PDF에서 원시 텍스트를 추출
본 모델이 이를 정제·보정·구조화하여 최종 출력 생성

이 방식은 다음 작업에 특히 유용합니다.

깨끗한 한글 텍스트 복원
Markdown 변환
표 인지 기반 구조화
차트/그래프 기반 설명 생성
JSON 필드 추출
문서 자동화 전처리

권장 실무 패턴은 다음과 같습니다.

OCR → 정규화 → 구조화 추출 → 정합성 검증 → 후속 자동화

Quick Start (Transformers)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "SEOKDONG/Qwen3.5-9B-kor-enterprise"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto" if torch.cuda.is_available() else None,
    low_cpu_mem_usage=True,
)
model.eval()



messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://image.inblog.dev/?url=https%3A%2F%2Fsource.inblog.dev%2Fpost_image%2F2025-07-29T01%3A37%3A10.510Z-0de0e57f-ad77-4890-9c34-8b3653a1a59f&w=1920&q=85"
                }
            },
            {
                "type": "text",
                "text": "텍스트의 오타와 띄어쓰기를 교정하고, 원문 형식을 최대한 유지하여 깔끔하게 Markdown 포맷으로 추출해 주세요."
            }
        ]
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt")
if torch.cuda.is_available():
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=4094,
        temperature=0.01,
        do_sample=True,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )

prompt_len = inputs["input_ids"].shape[-1]
result = tokenizer.decode(output_ids[0][prompt_len:], skip_special_tokens=True)
print(result)

Qwen3.5-9B Korean Enterprise Document OCR

Model Overview

Qwen3.5-9B Korean Enterprise Document OCR is a Korean enterprise-document fine-tuned model built on top of Qwen3.5-9B.

The model was fine-tuned on approximately 307,272 Korean enterprise document samples to improve Korean document reading quality in real-world business settings, especially where baseline OCR/image-to-text pipelines often suffer from:

incorrect Korean characters
spacing and tokenization errors
dropped fields or omitted rows/cells
weak reconstruction of tables, charts, and mixed-layout business documents

In practice, this model is especially useful as a document-text refinement and structuring model used after OCR extraction (for example with EasyOCR), although it can also be used as a downstream text-structuring model for document understanding workflows.

What This Model Is Designed For

This model is intended for Korean enterprise document workloads such as:

OCR post-correction for Korean business documents
converting noisy OCR output into cleaner Korean text
document-to-Markdown structuring
table and form reconstruction
chart/graph grounded description generation
field extraction to JSON
enterprise office document normalization for downstream automation

Typical document scenarios include:

reports and business documents
forms and office documents
tables and mixed-layout pages
charts and data visualizations
Korean and English mixed documents
academic papers and technical papers

Why This Model Exists

The foundation model already provides strong general performance, but in Korean enterprise document workflows we observed recurring issues in image-to-text and OCR-adjacent tasks:

Korean character corruption in dense documents
omission of important fields, rows, or numeric cells
weak handling of tables and form-like layouts
inconsistent structuring of charts, graphs, and report pages

This fine-tuned model was built to improve Korean readability, document fidelity, and downstream usability for enterprise document pipelines.

Training Data Summary

The fine-tuning set contains 307,272 samples across document-oriented categories.

Data Type	Count	Ratio
그래프	56,436	18.4%
텍스트	49,838	16.2%
영어논문	44,160	14.4%
문서	41,980	13.7%
테이블	40,275	13.1%
논문	36,650	11.9%
챠트	29,903	9.7%
오피스	4,460	1.5%
시각화	3,570	1.2%
합계	307,272	100.0%

Training Data Characteristics

The dataset was constructed to reflect enterprise-style document workloads in Korean, with emphasis on:

Korean text fidelity under OCR noise
numbers, units, dates, and business entities
table-heavy and mixed-layout pages
graph/chart pages that require structure preservation and grounded interpretation
document normalization into formats such as Markdown and JSON

Example Document Types

Below are representative examples of the document styles used during fine-tuning.

1) Report page with charts, narrative text, and a summary table

2) Pie chart / infographic style page

3) Mixed page with bullets, diagram, section titles, and explanatory text

These examples illustrate the types of pages where naive OCR often fails to preserve full business meaning without additional refinement.

Recommended Usage Pattern

This model is best used in a two-stage pipeline:

OCR engine extracts raw text from the image or PDF page.
This model corrects, restructures, normalizes, and converts the noisy OCR output into a more usable format.

This approach is particularly effective when you need:

cleaner Korean text
Markdown conversion
table-aware formatting
chart-aware grounded descriptions
JSON field extraction

Test Notebook

A ready-to-run Jupyter notebook is included in this repository for users who want to test the model end-to-end.

Notebook file: qwen35_ocr_pipeline.ipynb

The notebook includes:

environment setup and package installation
model download from Hugging Face
EasyOCR initialization (ko, en)
model loading with transformers
OCR utility functions
single-image test
batch-image test
document-type prompt comparison
OCR box visualization
export to JSON / CSV / TXT

Packages used in the notebook

pip install \
  transformers>=4.45.0 \
  accelerate \
  torch \
  huggingface_hub \
  sentencepiece \
  pillow \
  requests \
  easyocr \
  opencv-python-headless \
  numpy

Notebook pipeline overview

이미지
  │
  ▼
[ EasyOCR ]
  │
  ▼ 원시 OCR 텍스트
[ Qwen3.5-9B Korean Enterprise OCR ]
  │
  ▼
정제된 텍스트 / Markdown / JSON / 요약 결과

Example Prompts

1) OCR correction

OCR 텍스트의 오타와 띄어쓰기를 교정하고, 원문 형식을 최대한 유지하여 깔끔하게 정리해 주세요.

2) Markdown conversion

OCR 텍스트를 마크다운 형식(제목, 목록, 테이블 등)으로 구조화하세요.

3) JSON extraction

OCR 텍스트에서 다음 정보를 추출하여 JSON으로 반환하세요.
필드: supplier, supplier_reg_no, buyer, buyer_reg_no, item, supply_amount, tax, total, date
값이 없으면 null로 처리하고 JSON 코드블록만 출력하세요.

4) Document classification

이 OCR 텍스트가 어떤 종류의 문서인지 분류하고 (영수증/세금계산서/운송장/계약서/회의록/기타), 그 이유와 주요 내용을 설명하세요.

5) Chart-grounded description

차트와 표에서 관측 가능한 항목명, 수치, 범례, 축 정보를 바탕으로 내용을 설명하세요.
보이지 않는 수치나 원인은 추측하지 마세요.

Best Practices

To get the best results, we recommend the following:

use a stable upstream OCR engine for raw extraction
keep prompts explicit and task-specific
separate OCR transcription from grounded interpretation
validate numeric consistency when handling tables and charts
use JSON outputs for downstream automation where possible
perform post-processing and QA on high-value enterprise documents

For production workflows, a good pattern is:

OCR → normalization → structured extraction → validation → downstream automation

Limitations

This model has important limitations:

It is not a standalone OCR engine. It works best when paired with an OCR front-end such as EasyOCR.
Complex tables with merged cells may still require downstream normalization.
Charts and visualizations should be handled with grounded prompts to avoid over-interpretation.
Very low-resolution scans, handwriting, stamps, skewed pages, or heavy noise can still degrade output quality.
Domain-specific terms, rare forms, and unusual layouts may require prompt engineering or additional task tuning.

Intended Users

This model is intended for:

enterprise AI teams
OCR pipeline developers
document automation engineers
Korean document processing teams
RAG / extraction / workflow automation teams

Out-of-Scope / Misuse Warnings

This model should not be treated as a substitute for:

legal review
financial audit review
compliance sign-off
human verification in high-risk document workflows

For critical business use, human review and validation are strongly recommended.

Evaluation Note

This repository currently focuses on practical document utility rather than publishing a single academic benchmark score.

Recommended evaluation dimensions for downstream users include:

Korean character accuracy improvement
field-level extraction accuracy
Markdown structure fidelity
table reconstruction quality
numeric consistency
chart-grounded description quality

If you publish this model publicly, consider adding a dedicated evaluation section with benchmark datasets, task-specific metrics, and baseline comparisons.

License

Please specify the final license for this repository and ensure that deployment and redistribution follow:

the base model license/policy
your organization's fine-tuned weights distribution policy
the data governance rules for enterprise document data

Citation

If you use this model in your work, please cite the repository or model page.

@misc{qwen35_kor_enterprise_ocr,
  title        = {Qwen3.5-9B Korean Enterprise Document OCR},
  author       = {AIDX},
  year         = {2026},
  howpublished = {Hugging Face model repository}
}

Contact

For enterprise collaboration, evaluation, or integration inquiries, please update this section with the appropriate project or team contact.

Downloads last month: 145

Safetensors

Model size

9B params

Tensor type

F16

Model tree for SEOKDONG/Qwen3.5-9B-kor-enterprise

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

(88)

this model