🤗 HuggingFace | 🖥️ Demo | 📄 Technical Report | 🐈 GitHub
Figure 1: Performance comparison on the OmniDocBench v1.5 benchmark. FireRed-OCR achieves state-of-the-art performance among end-to-end solutions, ranking first with a score above 92%.
FireRed-OCR is a systematic framework designed to specialize general Large Vision-Language Models (LVLMs) into high-performance, pixel-precise structural document parsing experts.
General VLMs frequently suffer from "Structural Hallucination" (e.g., disordered rows, invented formulas) when processing complex documents. FireRed-OCR addresses this by shifting the paradigm from "impressionist" text generation to "structural engineering," achieving State-of-the-Art (SOTA) results on authoritative benchmarks like OmniDocBench v1.5.
| Models | Base | Description | Download Link |
|---|---|---|---|
| FireRed-OCR-2B | Qwen3-VL-2B-Instruct | Lightweight version achieving 92.94% Overall on OmniDocBench v1.5. | 🤗 HuggingFace |
The FireRed-OCR framework transforms a general VLM into a structural expert through a three-stage progressive training strategy:
FireRed-OCR is based on the Qwen3-VL architecture. You can use the following code snippets to generate structured Markdown from document images.
1. Install Dependencies
pip install transformers
pip install qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCR
2. Inference
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv
# Load the model
model = Qwen3VLForConditionalGeneration.from_pretrained(
"FireRedTeam/FireRed-OCR",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
# "FireRedTeam/FireRed-OCR,
# dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")
# Prepare Input
image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
We evaluate FireRed-OCR on OmniDocBench v1.5 and FireRedBench.
| Model | Overall ↑ | TextEdit ↓ | FormulaCDM ↑ | TableTEDs ↑ | TableTEDS_s ↑ | R-orderEdit ↓ |
|---|---|---|---|---|---|---|
| Pipeline | ||||||
| Dolphin | 74.67 | 0.125 | 67.85 | 68.70 | 77.77 | 0.124 |
| Dolphin-1.5 | 83.21 | 0.092 | 80.78 | 78.06 | 84.10 | 0.080 |
| PP-StructureV3 | 86.73 | 0.073 | 85.79 | 81.68 | 89.48 | 0.073 |
| MonkeyOCR-pro-1.2B | 86.96 | 0.084 | 85.02 | 84.24 | 89.02 | 0.130 |
| MonkeyOCR-3B | 87.13 | 0.075 | 87.45 | 81.39 | 85.92 | 0.129 |
| MonkeyOCR-pro-3B | 88.85 | 0.075 | 87.25 | 86.78 | 90.63 | 0.128 |
| MinerU2.5 | 90.67 | 0.047 | 88.46 | 88.22 | 92.38 | 0.044 |
| PaddleOCR-VL | 92.86 | 0.035 | 91.22 | 90.89 | 94.76 | 0.043 |
| PaddleOCR-VL-1.5 | 94.50 | 0.035 | 94.21 | 92.76 | 95.79 | 0.042 |
| GLM-OCR | 94.60 | - | - | - | - | - |
| End-to-end | ||||||
| OCRFlux-3B | 74.82 | 0.193 | 68.03 | 75.75 | 80.23 | 0.202 |
| Mistral OCR | 78.83 | 0.164 | 82.84 | 70.03 | 78.04 | 0.144 |
| InternVL3-76B | 80.33 | 0.131 | 83.42 | 70.64 | 77.74 | 0.113 |
| POINTS-Reader | 80.98 | 0.134 | 79.20 | 77.13 | 81.66 | 0.145 |
| olmOCR-7B | 81.79 | 0.096 | 86.04 | 68.92 | 74.77 | 0.121 |
| Qwen3-VL-2B | 81.87 | 0.100 | 85.87 | 69.77 | 74.37 | 0.115 |
| InternVL3.5-241B | 82.67 | 0.142 | 87.23 | 75.00 | 81.28 | 0.125 |
| GPT-5.2 | 85.50 | 0.123 | 86.11 | 82.66 | 87.35 | 0.099 |
| MinerU2-VLM | 85.56 | 0.078 | 80.95 | 83.54 | 87.66 | 0.086 |
| Nanonets-OCR-s | 85.59 | 0.093 | 85.90 | 80.14 | 85.57 | 0.108 |
| Qwen2.5-VL-72B | 87.02 | 0.094 | 88.27 | 82.15 | 86.22 | 0.102 |
| DeepSeek-OCR | 87.36 | 0.073 | 84.14 | 85.25 | 89.01 | 0.085 |
| dots.ocr | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 |
| OCRVerse | 88.56 | 0.058 | 86.91 | 84.55 | 88.45 | 0.071 |
| Qwen3-VL-235B-A22B | 89.15 | 0.069 | 88.14 | 86.21 | 90.55 | 0.068 |
| Gemini-3.0 Pro | 90.33 | 0.065 | 89.18 | 88.28 | 90.29 | 0.071 |
| Qwen3.5-397B-A17B | 90.80 | - | - | - | - | - |
| DeepSeek-OCR 2 | 91.09 | 0.048 | 90.31 | 87.75 | 92.06 | 0.057 |
| FireRed-OCR-2B | 92.94 | 0.032 | 91.71 | 90.31 | 93.81 | 0.041 |
| Model | Overall ↑ | TextEdit ↓ | FormulaCDM ↑ | TableTEDs ↑ | TableTEDS_s ↑ | R-orderEdit ↓ |
|---|---|---|---|---|---|---|
| GPT-5.2🔒 | 68.09 | 0.238 | 66.33 | 61.74 | 68.00 | 0.38 |
| Gemini-3.0 Pro🔒 | 79.68 | 0.169 | 80.11 | 75.82 | 82.73 | 0.353 |
| Pipeline | ||||||
| GLM-OCR | 74.33 | 0.309 | 82.53 | 71.35 | 79.93 | 0.456 |
| PaddleOCR-VL-1.5 | 76.47 | 0.291 | 92.37 | 66.15 | 74.39 | 0.453 |
| End-to-end | ||||||
| DeepSeek-OCR 2 | 61.61 | 0.290 | 58.78 | 55.06 | 59.42 | 0.437 |
| dots.ocr | 72.93 | 0.240 | 82.53 | 60.25 | 64.08 | 0.419 |
| Qwen3-VL-2B-Instruct | 65.58 | 0.283 | 75.19 | 49.85 | 55.66 | 0.388 |
| FireRed-OCR-2B | 74.62 | 0.248 | 83.02 | 65.63 | 72.30 | 0.430 |
| Model | OmniDocBench v1.5 | FireRedBench | OCRBench(TextRec) | TEDS_TEST | PubTabNet |
|---|---|---|---|---|---|
| GPT-5.2🔒 | 85.50 | 68.09 | 93.0 | 67.6 | 84.4 |
| Gemini-3.0 Pro🔒 | 90.33 | 79.68 | 91.9 | 81.8 | 91.4 |
| Pipeline | |||||
| MinerU2.5 | 90.67 | - | - | 85.4 | 88.4 |
| PaddleOCR-VL-1.5 | 94.50 | 76.47 | 53.5 / 87.0 | 83.3 | 84.6 |
| GLM-OCR | 94.60 | 74.33 | 61.0 / 95.0 | 86.0 | 85.2 |
| End-to-end | |||||
| dots.ocr | 88.41 | 72.93 | 92.1 | 62.4 | 71.0 |
| DeepSeek-OCR 2 | 91.09 | 61.61 | 48.5 | - | - |
| FireRed-OCR-2B | 92.94 | 74.62 | 93.5 | 80.6 | 77.0 |
The code and the weights of FireRed-OCR are licensed under Apache 2.0.
We kindly encourage citation of our work if you find it useful.
@article{fireredocr,
title={FireRed-OCR Technical Report},
author={Super Intelligence Team, Xiaohongshu Inc.},
year={202X},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://github.com/FireRedTeam/FireRed-OCR}
}
FireRed-OCR is a technical tool designed for document digitization and structural parsing.
We would like to thank the developers of the amazing open-source projects, including Qwen-VL, PaddleOCR, olmOCR and the broader OCR community.