Interfaze

logo

Beta

pricing

docs

blog

sign in

Sarashina2.2 Ocr

Sarashina2.2 Ocr by sbintuitions, a image-to-text model with OCR capabilities. Understand and compare OCR features, benchmarks, and capabilities.

Comparison

FeatureSarashina2.2 OcrInterfaze
Input Modalities

image

image, text, audio, video, document

Native OCRYesYes
Long Document ProcessingNoYes
Language Support

unknown

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsNoYes
Context Input Size

unknown

1M

Tool CallingNo

Tool calling supported + built in browser, code execution and web search

OCR Capabilities

FeatureSarashina2.2 OcrInterfaze
Text Bounding BoxesYesYes
Confidence ScoresNoYes
Dense Image ProcessingYesYes
Low Quality ImagesNoYes
Handwritten TextNoYes
Charts, Tables & EquationsYesYes

Scaling

FeatureSarashina2.2 OcrInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

About the model

Sarashina2.2-OCR is an end-to-end 3B-parameter OCR model developed by SB Intuitions, specifically tailored for parsing Japanese and English documents.

The model is refined by human preference optimization to enable intuitive document parsing, and excels at converting a wide range of documents—including vertical Japanese text—into Markdown format while maintaining a natural reading order.

Key Features

Beyond standard text extraction, Sarashina2.2-OCR reconstructs documents into naturally structured Markdown, accurately converting complex elements into the following formats:

  • 📊 Tables: Reconstructs tabular layouts into plain HTML format.

  • 📐 Math Formulas: Transcribes mathematical equations directly into standard LaTeX format.

  • 🖼️ Graphics: Detects visual components (e.g., images, charts) and indicates their positions using bounding boxes in the format <bbox>[(x1, y1), (x2, y2)]</bbox>, using normalized integer coordinates (0–1000) with a top-left origin.

Training Summary

Sarashina2.2-OCR integrates a SigLIP2-based vision encoder with the Sarashina2.2-3B-Instruct language model, and was trained through the following pipeline:

1. Pre-training on general image-text dataset:

To build basic image understanding and prepare for high-resolution document parsing to boost downstream OCR performance, we expanded Sarashina2.2-Vision-3B's pre-training into the following 4-substage pipeline, upscaling the resolution to ~2.5M pixels:

  1. 🔌 Projector warmup: bridging the gap between the embedding spaces of the LLM and vision encoder.

  2. 👁️ Vision encoder pre-training: enhancing image comprehension, especially for understanding Japan-specific images and text.

  3. 🔥 Full-parameter pre-training: enhancing the model's unified understanding of images and language using interleaved data.

  4. 🔍 High-resolution continual pre-training: expanding the maximum resolution to capture fine-grained details and dense text in complex documents.

We also mixed in a large amount of OCR and grounding data from the start to build basic document understanding early on.

2. Supervised fine-tuning (SFT) on large-scale OCR dataset:

We fine-tuned the model on diverse Japanese and English documents so it can recognize complex layouts and output naturally structured Markdown. We heavily used synthetic datasets alongside open OCR data, keeping the resolution at ~2.5M pixels.

3. Preference Optimization with manually annotated OCR datasets:

Finally, we applied Mixed Preference Optimization (MPO) to achieve more natural reading order understanding for documents with complex layouts. We identified two major challenges when applying it to long-context OCR tasks:

  1. 💸 High human-annotation costs: High-quality OCR data with dense, complex layouts requires a lot of time to annotate manually, making large-scale data collection difficult.

  2. 🧩 Lack of effective data for late-stage errors: A common approach to prepare negative examples is to sample directly from the model. In long-context OCR tasks, an early mistake in reading order can corrupt the entire remaining prediction, making it difficult to obtain preference pairs that target late-stage errors.

To overcome these, we augmented the negative examples of preference pairs by feeding the model with the first N ground-truth paragraphs and making the model complete the rest of the content. For each human-annotated sample, we applied this to every paragraph at position N to obtain multiple negative examples.

Benchmark Results

Sarashina2.2-OCR delivers highly competitive overall performance among the end-to-end OCR models, achieving the best scores on VJRODa despite its compact 3B-parameter size, while demonstrating strong capabilities in the Math and Table categories of olmOCR-bench.

VJRODa

VJRODa evaluates OCR capabilities for Japanese documents, particularly focusing on complex layouts and vertical text reading order.

ModelCER(↓)BLEU(↑)
gpt-5-mini-2025-08-0772.423.6
Qwen3.5-4B(non-thinking)86.147.8
KARAKURI VL 32B Instruct 250728014.1
LightOnOCR-2-1B15828.9
dots.ocr40.171.5
Sarashina2.2-OCR22.679.9

olmOCR-bench

A comprehensive benchmark designed to evaluate document parsing capabilities across diverse and complex structures, such as mathematical equations, multi-column layouts, and tables.

ModelarXiv MathHeaders FootersLong Tiny TextMulti ColumnOld ScansOld Scans MathTable TestsOverall
LightOnOCR-2-1B0.8900.2040.8890.8480.4180.8650.8870.773
dots.ocr0.6740.8490.9210.8030.4110.6030.8230.722
Sarashina2.2-OCR0.7780.2910.8460.7520.3230.5280.8290.683

Usage with Transformers

pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, TextStreamer, set_seed


model_path = "sbintuitions/sarashina2.2-ocr"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
set_seed(42)


image_url = "https://huggingface.co/sbintuitions/sarashina2.2-ocr/resolve/main/assets/sample1.jpeg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
message = [
    {
        "role": "user",
        "content": [{"type": "image", "image": image}],
    }
]
inputs = processor.apply_chat_template(
    message, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_dict=True, 
    return_tensors="pt"
).to(model.device)
streamer = TextStreamer(processor, skip_prompt=True, skip_special_tokens=True)


output_ids = model.generate(
    **inputs,
    max_new_tokens=6000,
    temperature=0.0,
    top_p=0.95,
    repetition_penalty=1.2,
    use_cache=True,
    streamer=streamer,
)

Examples

1. Vertical Japanese document parsing

The following image visualizes the output bounding boxes in red:

*https://warp.ndl.go.jp/web/20200609034301/www.town.suo-oshima.lg.jp/data/open/cnt/3/669/1/R01.10.P11.pdf?20200413190829

2. Complex business slide parsing

*https://www.aec.go.jp/kettei/kettei/20230220_3.pdf

3. Tabular layout parsing

*https://mhcc.maryland.gov/mhcc/pages/home/workgroups/documents/cardiac/Standing%20Advisory%20Committee%20Members%209-20-19.pdf

4. Mathematical formula parsing

*https://arxiv.org/pdf/2503.09208

LICENSE

MIT License

Citation

@misc{sarashinaOCR2026, title = {Sarashina2.2-OCR: End-to-end OCR Model for Japanese Document Parsing}, author = {Takumi Takada and Toshiyuki Tanaka and Kohei Uehara and Mikihiro Tanaka and Alexis Vallet and Aman Jain and Ryuichiro Hataya and Seitaro Shinagawa and Yuto Imai and Teppei Suzuki}, year = {2026}, url = {https://huggingface.co/sbintuitions/sarashina2.2-ocr} }

Want more deterministic results?