Interfaze

logo

Beta

pricing

docs

blog

sign in

Sarashina2.2 Vision 3b

Sarashina2.2 Vision 3b by sbintuitions, a image-to-text model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

FeatureSarashina2.2 Vision 3bInterfaze
Input Modalities

image, text

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingNoYes
Language Support

unknown

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsNoYes
Context Input Size

unknown

1M

Tool CallingNo

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureSarashina2.2 Vision 3bInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

Sarashina2.2-Vision-3B is a Japanese Large Vision Language Model trained by SB Intuitions.

This model is based on Sarashina2.2-3B-Instruct and Image Encoder of SigLIP.

Model Performance

Japanese Performance

ModelParams(B)BussinessSlide VQA*1Heron-Bench*1JDocQA*1JMMMU
Sarashina2.2-Vision-3B3.83.9323.2143.3270.486
Qwen2.5-VL-3B-Instruct3.83.5162.0003.0190.450
Qwen3-VL-4B-Instruct4.44.1052.3303.5960.493
InternVL3_5-4B4.73.3111.8932.6260.437
Sarashina2-Vision-14B14.43.1102.184-*20.432
Stockmark-2-VL-100B-beta96.53.9732.5633.168-*2

*1. gpt-oss-120b was used for LLM-as-a-Judge.

*2. These scores cannot be measured because some input data exceeds the model's max_position_embeddings.

English Performance

ModelParams(B)DocVQAInfoVQARealworldQA
Sarashina2.2-Vision-3B3.80.8310.5670.625
Qwen2.5-VL-3B-Instruct3.80.9240.7500.586
Qwen3-VL-4B-Instruct4.40.9480.7980.712
InternVL3_5-4B4.70.8230.5410.553
Sarashina2-Vision-14B14.40.7290.4900.519

How to use

1. Install dependencies

pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate

2. Inference

The following script loads the model and allows inference.

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, set_seed


model_path = "sbintuitions/sarashina2.2-vision-3b"


processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
set_seed(42)

image_url = "https://huggingface.co/sbintuitions/sarashina2.2-vision-3b/resolve/main/sample.jpg"
message = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image_url,
            },
            {
                "type": "text",
                "text": "これはどこで撮った写真ですか?",
            },
        ],
    }
]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <|user|><|prefix|><|file|><|suffix|>これはどこで撮った写真ですか?</s><|assistant|>"""

image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)


output_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.2,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""
この写真は、**道後温泉本館(どうごおんせんほんかん)** の入り口を夜景で撮影した写真です。

---
 場所の詳細:
- **名称**:道後温泉本館(Dogo Onsen Honkan)
- **所在地**:〒790-0842 愛媛県松山市道後湯之町1丁目3番5号
- **アクセス**:JR松山駅から市内電車「道後温泉駅」下車すぐ
- **特徴**:日本最古の温泉の一つとして知られる「道後温泉」の中心的な施設。国の重要文化財にも指定されています。

---
 写真の特徴から判断した理由:
- 建物の屋根や装飾が伝統的な和風建築で、「道後温泉」の看板が目立つ。
- 入口の垂れ幕には「道後」「道後」と書かれており、白い鳳凰の模様が描かれている → 道後温泉の象徴的デザイン。
- 夜の照明と石灯籠、提灯風の灯りが日本の温泉地らしい雰囲気を醸し出している。
- 看板に「道後温泉」の文字が明確に表示されている。

---
 補足情報:
道後温泉本館は、夏目漱石の小説『坊っちゃん』の舞台としても有名で、多くの観光客が訪れる人気スポットです。また、2020年にリニューアルされ、現代的な設備も導入されていますが、外観は伝統を残しています。

---
よって、この写真は **愛媛県松山市にある「道後温泉本館」の夜景** です。
"""

Training

Sarashina2.2-Vision-3B is created through the following five-stage training process:

PreTrain

  1. Projector Warmup: To bridge the gap between the text and image embedding spaces within the LLM
  2. Vision Encoder Pretraining: To enhance image comprehension, especially for understanding Japan-specific images and text
  3. Full Model Pretraining: To enhance the model's unified understanding of images and language using interleaved data

PostTrain

  1. Supervised Fine-Tuning(SFT): To improve the model's ability to follow instructions and respond appropriately to user prompts
  2. Mixed Preference Optimization(MPO): To align the model's outputs with user preferences, ensuring it generates more desirable responses

Limitations

This model has limited safety training. Therefore, it might generate some meaningless sequences, some inaccurate instances, or biased/objectionable outputs. Before using it, we would like developers to tune models based on human preferences and safety considerations.

LICENSE

MIT License

Want more deterministic results?