README.md · main · ai-models/FireRedTeam/FireRed-OCR

ai-models/FireRedTeam/FireRed-OCR

Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

FireRed-OCR/README.md

FireRedTeam<FireRedTeam@users.noreply.huggingface.co>

Update README.md

7bf5e9d9

0 commits

PreviewCode viewBlame

🤗 HuggingFace | 🖥️ Demo | 📄 Technical Report | 🐈 GitHub

Figure 1: Performance comparison on the OmniDocBench v1.5 benchmark. FireRed-OCR achieves state-of-the-art performance among end-to-end solutions, ranking first with a score above 92%.

🔥 FireRed-OCR

FireRed-OCR is a systematic framework designed to specialize general Large Vision-Language Models (LVLMs) into high-performance, pixel-precise structural document parsing experts.

General VLMs frequently suffer from "Structural Hallucination" (e.g., disordered rows, invented formulas) when processing complex documents. FireRed-OCR addresses this by shifting the paradigm from "impressionist" text generation to "structural engineering," achieving State-of-the-Art (SOTA) results on authoritative benchmarks like OmniDocBench v1.5.

✨ Key Features

SOTA Performance: Achieves 92.94% overall score on OmniDocBench v1.5, significantly outperforming DeepSeek-OCR 2, OCRVerse, and massive general VLMs (e.g., Gemini-3.0 Pro，Qwen3-VL-235B).
Structural Integrity: Utilizing Format-Constrained GRPO (Group Relative Policy Optimization), the model enforces strict syntactic validity, eliminating common errors like unclosed tables or invalid LaTeX formulas.
"Geometry + Semantics" Data Factory: A novel data engine that uses geometric feature clustering and multi-dimensional tagging to synthesize balanced datasets, effectively handling long-tail layouts.
Progressive Training Pipeline:
1. Multi-task Pre-alignment: Establishes spatial grounding.
2. Specialized SFT: Standardizes full-image Markdown output.
3. Format-Constrained GRPO: Self-correction via Reinforcement Learning.
In-the-Wild Robustness: Demonstrates superior resilience on complex, non-standard layouts (FireRedBench) compared to traditional pipeline systems like PaddleOCR.

📰 News

2026.02.28: We released FireRed-OCR-2B weights. Check more details in the Model Zoo section.

🗂️ Model Zoo

Models	Base	Description	Download Link
FireRed-OCR-2B	Qwen3-VL-2B-Instruct	Lightweight version achieving 92.94% Overall on OmniDocBench v1.5.	🤗 HuggingFace

🏗️ Model Architecture

The FireRed-OCR framework transforms a general VLM into a structural expert through a three-stage progressive training strategy:

Stage 1: Multi-task Pre-alignment: Trains the model on detection, region recognition, and layout-to-markdown tasks to ground visual perception.
Stage 2: Specialized SFT: Fine-tunes on a high-quality, standardized Markdown dataset to ensure logical consistency and hierarchical expression.
Stage 3: Format-Constrained GRPO: Applies Reinforcement Learning with specific rewards for Formula Syntax, Table Integrity, Hierarchical Closure, and Text Accuracy.

⚡️ Quick Start

FireRed-OCR is based on the Qwen3-VL architecture. You can use the following code snippets to generate structured Markdown from document images.

1. Install Dependencies

pip install transformers
pip install qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCR

2. Inference

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv

# Load the model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "FireRedTeam/FireRed-OCR",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
#     "FireRedTeam/FireRed-OCR,
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")

# Prepare Input
image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📊 Benchmark

We evaluate FireRed-OCR on OmniDocBench v1.5 and FireRedBench.

OmniDocBench v1.5

Model	Overall ↑	Text^Edit ↓	Formula^CDM ↑	Table^TEDs ↑	Table^TEDS_s ↑	R-order^Edit ↓
Pipeline
Dolphin	74.67	0.125	67.85	68.70	77.77	0.124
Dolphin-1.5	83.21	0.092	80.78	78.06	84.10	0.080
PP-StructureV3	86.73	0.073	85.79	81.68	89.48	0.073
MonkeyOCR-pro-1.2B	86.96	0.084	85.02	84.24	89.02	0.130
MonkeyOCR-3B	87.13	0.075	87.45	81.39	85.92	0.129
MonkeyOCR-pro-3B	88.85	0.075	87.25	86.78	90.63	0.128
MinerU2.5	90.67	0.047	88.46	88.22	92.38	0.044
PaddleOCR-VL	92.86	0.035	91.22	90.89	94.76	0.043
PaddleOCR-VL-1.5	94.50	0.035	94.21	92.76	95.79	0.042
GLM-OCR	94.60	-	-	-	-	-
End-to-end
OCRFlux-3B	74.82	0.193	68.03	75.75	80.23	0.202
Mistral OCR	78.83	0.164	82.84	70.03	78.04	0.144
InternVL3-76B	80.33	0.131	83.42	70.64	77.74	0.113
POINTS-Reader	80.98	0.134	79.20	77.13	81.66	0.145
olmOCR-7B	81.79	0.096	86.04	68.92	74.77	0.121
Qwen3-VL-2B	81.87	0.100	85.87	69.77	74.37	0.115
InternVL3.5-241B	82.67	0.142	87.23	75.00	81.28	0.125
GPT-5.2	85.50	0.123	86.11	82.66	87.35	0.099
MinerU2-VLM	85.56	0.078	80.95	83.54	87.66	0.086
Nanonets-OCR-s	85.59	0.093	85.90	80.14	85.57	0.108
Qwen2.5-VL-72B	87.02	0.094	88.27	82.15	86.22	0.102
DeepSeek-OCR	87.36	0.073	84.14	85.25	89.01	0.085
dots.ocr	88.41	0.048	83.22	86.78	90.62	0.053
OCRVerse	88.56	0.058	86.91	84.55	88.45	0.071
Qwen3-VL-235B-A22B	89.15	0.069	88.14	86.21	90.55	0.068
Gemini-3.0 Pro	90.33	0.065	89.18	88.28	90.29	0.071
Qwen3.5-397B-A17B	90.80	-	-	-	-	-
DeepSeek-OCR 2	91.09	0.048	90.31	87.75	92.06	0.057
FireRed-OCR-2B	92.94	0.032	91.71	90.31	93.81	0.041

FireRedBench

Model	Overall ↑	Text^Edit ↓	Formula^CDM ↑	Table^TEDs ↑	Table^TEDS_s ↑	R-order^Edit ↓
GPT-5.2🔒	68.09	0.238	66.33	61.74	68.00	0.38
Gemini-3.0 Pro🔒	79.68	0.169	80.11	75.82	82.73	0.353
Pipeline
GLM-OCR	74.33	0.309	82.53	71.35	79.93	0.456
PaddleOCR-VL-1.5	76.47	0.291	92.37	66.15	74.39	0.453
End-to-end
DeepSeek-OCR 2	61.61	0.290	58.78	55.06	59.42	0.437
dots.ocr	72.93	0.240	82.53	60.25	64.08	0.419
Qwen3-VL-2B-Instruct	65.58	0.283	75.19	49.85	55.66	0.388
FireRed-OCR-2B	74.62	0.248	83.02	65.63	72.30	0.430

Additional Benchmarks

Model	OmniDocBench v1.5	FireRedBench	OCRBench(TextRec)	TEDS_TEST	PubTabNet
GPT-5.2🔒	85.50	68.09	93.0	67.6	84.4
Gemini-3.0 Pro🔒	90.33	79.68	91.9	81.8	91.4
Pipeline
MinerU2.5	90.67	-	-	85.4	88.4
PaddleOCR-VL-1.5	94.50	76.47	53.5 / 87.0	83.3	84.6
GLM-OCR	94.60	74.33	61.0 / 95.0	86.0	85.2
End-to-end
dots.ocr	88.41	72.93	92.1	62.4	71.0
DeepSeek-OCR 2	91.09	61.61	48.5	-	-
FireRed-OCR-2B	92.94	74.62	93.5	80.6	77.0

> For PaddleOCR-VL-1.5 and GLM-OCR on OCRBench, scores are reported as API / pure VLM.

📜 License Agreement

The code and the weights of FireRed-OCR are licensed under Apache 2.0.

🖊️ Citation

We kindly encourage citation of our work if you find it useful.

@article{fireredocr,
  title={FireRed-OCR Technical Report},
  author={Super Intelligence Team， Xiaohongshu Inc.},
  year={202X},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://github.com/FireRedTeam/FireRed-OCR}
}

⚠️ Ethics Statement

FireRed-OCR is a technical tool designed for document digitization and structural parsing.

Prohibited Use: This project must not be used to generate or process content that is illegal, defamatory, pornographic, harmful, or that violates the privacy, rights, or interests of individuals or organizations.
User Responsibility: Users are solely responsible for any content generated using this project. The authors and contributors assume no responsibility or liability for any misuse of the codebase or for any consequences resulting from its use.

🤝 Acknowledgements

We would like to thank the developers of the amazing open-source projects, including Qwen-VL, PaddleOCR, olmOCR and the broader OCR community.

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111