Sarashina2.2 Ocr

Sarashina2.2 Ocr by sbintuitions, a image-to-text model with OCR capabilities. Understand and compare OCR features, benchmarks, and capabilities.

Comparison

Feature	Sarashina2.2 Ocr	Interfaze
Input Modalities	image	image, text, audio, video, document
Native OCR	Yes	Yes
Long Document Processing	No	Yes
Language Support	unknown	162+
Native Speech-to-Text	No	Yes
Native Object Detection	No	Yes
Guardrail Controls	No	Yes
Context Input Size	unknown	1M
Tool Calling	No	Tool calling supported + built in browser, code execution and web search

OCR Capabilities

Feature	Sarashina2.2 Ocr	Interfaze
Text Bounding Boxes	Yes	Yes
Confidence Scores	No	Yes
Dense Image Processing	Yes	Yes
Low Quality Images	No	Yes
Handwritten Text	No	Yes
Charts, Tables & Equations	Yes	Yes

Scaling

Feature	Sarashina2.2 Ocr	Interfaze
Scaling	Self-hosted/Provider-hosted with quantization	Unlimited

View model card on Hugging Face

About the model

Sarashina2.2-OCR is an end-to-end 3B-parameter OCR model developed by SB Intuitions, specifically tailored for parsing Japanese and English documents.

The model is refined by human preference optimization to enable intuitive document parsing, and excels at converting a wide range of documents—including vertical Japanese text—into Markdown format while maintaining a natural reading order.

Key Features

Beyond standard text extraction, Sarashina2.2-OCR reconstructs documents into naturally structured Markdown, accurately converting complex elements into the following formats:

📊 Tables: Reconstructs tabular layouts into plain HTML format.
📐 Math Formulas: Transcribes mathematical equations directly into standard LaTeX format.
🖼️ Graphics: Detects visual components (e.g., images, charts) and indicates their positions using bounding boxes in the format <bbox>[(x1, y1), (x2, y2)]</bbox>, using normalized integer coordinates (0–1000) with a top-left origin.

Training Summary

Sarashina2.2-OCR integrates a SigLIP2-based vision encoder with the Sarashina2.2-3B-Instruct language model, and was trained through the following pipeline:

1. Pre-training on general image-text dataset:

To build basic image understanding and prepare for high-resolution document parsing to boost downstream OCR performance, we expanded Sarashina2.2-Vision-3B's pre-training into the following 4-substage pipeline, upscaling the resolution to ~2.5M pixels:

🔌 Projector warmup: bridging the gap between the embedding spaces of the LLM and vision encoder.
👁️ Vision encoder pre-training: enhancing image comprehension, especially for understanding Japan-specific images and text.
🔥 Full-parameter pre-training: enhancing the model's unified understanding of images and language using interleaved data.
🔍 High-resolution continual pre-training: expanding the maximum resolution to capture fine-grained details and dense text in complex documents.

We also mixed in a large amount of OCR and grounding data from the start to build basic document understanding early on.

2. Supervised fine-tuning (SFT) on large-scale OCR dataset:

We fine-tuned the model on diverse Japanese and English documents so it can recognize complex layouts and output naturally structured Markdown. We heavily used synthetic datasets alongside open OCR data, keeping the resolution at ~2.5M pixels.

3. Preference Optimization with manually annotated OCR datasets:

Finally, we applied Mixed Preference Optimization (MPO) to achieve more natural reading order understanding for documents with complex layouts. We identified two major challenges when applying it to long-context OCR tasks:

💸 High human-annotation costs: High-quality OCR data with dense, complex layouts requires a lot of time to annotate manually, making large-scale data collection difficult.
🧩 Lack of effective data for late-stage errors: A common approach to prepare negative examples is to sample directly from the model. In long-context OCR tasks, an early mistake in reading order can corrupt the entire remaining prediction, making it difficult to obtain preference pairs that target late-stage errors.

To overcome these, we augmented the negative examples of preference pairs by feeding the model with the first N ground-truth paragraphs and making the model complete the rest of the content. For each human-annotated sample, we applied this to every paragraph at position N to obtain multiple negative examples.

Benchmark Results

Sarashina2.2-OCR delivers highly competitive overall performance among the end-to-end OCR models, achieving the best scores on VJRODa despite its compact 3B-parameter size, while demonstrating strong capabilities in the Math and Table categories of olmOCR-bench.

VJRODa

VJRODa evaluates OCR capabilities for Japanese documents, particularly focusing on complex layouts and vertical text reading order.

Model	CER(↓)	BLEU(↑)
gpt-5-mini-2025-08-07	72.4	23.6
Qwen3.5-4B(non-thinking)	86.1	47.8
KARAKURI VL 32B Instruct 2507	280	14.1
LightOnOCR-2-1B	158	28.9
dots.ocr	40.1	71.5
Sarashina2.2-OCR	22.6	79.9

olmOCR-bench

A comprehensive benchmark designed to evaluate document parsing capabilities across diverse and complex structures, such as mathematical equations, multi-column layouts, and tables.

Model	arXiv Math	Headers Footers	Long Tiny Text	Multi Column	Old Scans	Old Scans Math	Table Tests	Overall
LightOnOCR-2-1B	0.890	0.204	0.889	0.848	0.418	0.865	0.887	0.773
dots.ocr	0.674	0.849	0.921	0.803	0.411	0.603	0.823	0.722
Sarashina2.2-OCR	0.778	0.291	0.846	0.752	0.323	0.528	0.829	0.683

Usage with Transformers

pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, TextStreamer, set_seed


model_path = "sbintuitions/sarashina2.2-ocr"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
set_seed(42)


image_url = "https://huggingface.co/sbintuitions/sarashina2.2-ocr/resolve/main/assets/sample1.jpeg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
message = [
    {
        "role": "user",
        "content": [{"type": "image", "image": image}],
    }
]
inputs = processor.apply_chat_template(
    message, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_dict=True, 
    return_tensors="pt"
).to(model.device)
streamer = TextStreamer(processor, skip_prompt=True, skip_special_tokens=True)


output_ids = model.generate(
    **inputs,
    max_new_tokens=6000,
    temperature=0.0,
    top_p=0.95,
    repetition_penalty=1.2,
    use_cache=True,
    streamer=streamer,
)

Examples

1. Vertical Japanese document parsing

The following image visualizes the output bounding boxes in red:

*https://warp.ndl.go.jp/web/20200609034301/www.town.suo-oshima.lg.jp/data/open/cnt/3/669/1/R01.10.P11.pdf?20200413190829

LICENSE

MIT License

Citation

@misc{sarashinaOCR2026,
  title  = {Sarashina2.2-OCR: End-to-end OCR Model for Japanese Document Parsing},
  author = {Takumi Takada and Toshiyuki Tanaka and Kohei Uehara and Mikihiro Tanaka and Alexis Vallet and Aman Jain and Ryuichiro Hataya and Seitaro Shinagawa and Yuto Imai and Teppei Suzuki},
  year   = {2026},
  url    = {https://huggingface.co/sbintuitions/sarashina2.2-ocr}
}