Sarashina2.2 Vision 3b

Sarashina2.2 Vision 3b by sbintuitions, a image-to-text model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

Feature	Sarashina2.2 Vision 3b	Interfaze
Input Modalities	image, text	image, text, audio, video, document
Native OCR	No	Yes
Long Document Processing	No	Yes
Language Support	unknown	162+
Native Speech-to-Text	No	Yes
Native Object Detection	No	Yes
Guardrail Controls	No	Yes
Context Input Size	unknown	1M
Tool Calling	No	Tool calling supported + built in browser, code execution and web search

Scaling

Feature	Sarashina2.2 Vision 3b	Interfaze
Scaling	Self-hosted/Provider-hosted with quantization	Unlimited

View model card on Hugging Face

Sarashina2.2-Vision-3B is a Japanese Large Vision Language Model trained by SB Intuitions.

This model is based on Sarashina2.2-3B-Instruct and Image Encoder of SigLIP.

Model Performance

Japanese Performance

Model	Params(B)	BussinessSlide VQA*1	Heron-Bench*1	JDocQA*1	JMMMU
Sarashina2.2-Vision-3B	3.8	3.932	3.214	3.327	0.486
Qwen2.5-VL-3B-Instruct	3.8	3.516	2.000	3.019	0.450
Qwen3-VL-4B-Instruct	4.4	4.105	2.330	3.596	0.493
InternVL3_5-4B	4.7	3.311	1.893	2.626	0.437
Sarashina2-Vision-14B	14.4	3.110	2.184	-*2	0.432
Stockmark-2-VL-100B-beta	96.5	3.973	2.563	3.168	-*2

*1. gpt-oss-120b was used for LLM-as-a-Judge.

*2. These scores cannot be measured because some input data exceeds the model's max_position_embeddings.

English Performance

Model	Params(B)	DocVQA	InfoVQA	RealworldQA
Sarashina2.2-Vision-3B	3.8	0.831	0.567	0.625
Qwen2.5-VL-3B-Instruct	3.8	0.924	0.750	0.586
Qwen3-VL-4B-Instruct	4.4	0.948	0.798	0.712
InternVL3_5-4B	4.7	0.823	0.541	0.553
Sarashina2-Vision-14B	14.4	0.729	0.490	0.519

How to use

1. Install dependencies

pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate

2. Inference

The following script loads the model and allows inference.

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, set_seed


model_path = "sbintuitions/sarashina2.2-vision-3b"


processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
set_seed(42)

image_url = "https://huggingface.co/sbintuitions/sarashina2.2-vision-3b/resolve/main/sample.jpg"
message = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image_url,
            },
            {
                "type": "text",
                "text": "これはどこで撮った写真ですか？",
            },
        ],
    }
]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <|user|><|prefix|><|file|><|suffix|>これはどこで撮った写真ですか？</s><|assistant|>"""

image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)


output_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.2,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""
この写真は、**道後温泉本館（どうごおんせんほんかん）** の入り口を夜景で撮影した写真です。

---
 場所の詳細：
- **名称**：道後温泉本館（Dogo Onsen Honkan）
- **所在地**：〒790-0842 愛媛県松山市道後湯之町1丁目3番5号
- **アクセス**：JR松山駅から市内電車「道後温泉駅」下車すぐ
- **特徴**：日本最古の温泉の一つとして知られる「道後温泉」の中心的な施設。国の重要文化財にも指定されています。

---
 写真の特徴から判断した理由：
- 建物の屋根や装飾が伝統的な和風建築で、「道後温泉」の看板が目立つ。
- 入口の垂れ幕には「道後」「道後」と書かれており、白い鳳凰の模様が描かれている → 道後温泉の象徴的デザイン。
- 夜の照明と石灯籠、提灯風の灯りが日本の温泉地らしい雰囲気を醸し出している。
- 看板に「道後温泉」の文字が明確に表示されている。

---
 補足情報：
道後温泉本館は、夏目漱石の小説『坊っちゃん』の舞台としても有名で、多くの観光客が訪れる人気スポットです。また、2020年にリニューアルされ、現代的な設備も導入されていますが、外観は伝統を残しています。

---
よって、この写真は **愛媛県松山市にある「道後温泉本館」の夜景** です。
"""

Training

Sarashina2.2-Vision-3B is created through the following five-stage training process:

PreTrain

Projector Warmup: To bridge the gap between the text and image embedding spaces within the LLM
Vision Encoder Pretraining: To enhance image comprehension, especially for understanding Japan-specific images and text
Full Model Pretraining: To enhance the model's unified understanding of images and language using interleaved data

PostTrain

Supervised Fine-Tuning(SFT): To improve the model's ability to follow instructions and respond appropriately to user prompts
Mixed Preference Optimization(MPO): To align the model's outputs with user preferences, ensuring it generates more desirable responses

Limitations

This model has limited safety training. Therefore, it might generate some meaningless sequences, some inaccurate instances, or biased/objectionable outputs. Before using it, we would like developers to tune models based on human preferences and safety considerations.

LICENSE

MIT License