HY Embodied 0.5

HY Embodied 0.5 by tencent, a image-text-to-text model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

Feature	HY Embodied 0.5	Interfaze
Input Modalities	image, text	image, text, audio, video, document
Native OCR	No	Yes
Long Document Processing	No	Yes
Language Support	unknown	162+
Native Speech-to-Text	No	Yes
Native Object Detection	No	Yes
Guardrail Controls	No	Yes
Context Input Size	32.8K	1M
Tool Calling	No	Tool calling supported + built in browser, code execution and web search

Scaling

Feature	HY Embodied 0.5	Interfaze
Scaling	Self-hosted/Provider-hosted with quantization	Unlimited

View model card on Hugging Face

🔥 Updates

[2026-04-09] 🚀 We have released HY-Embodied-0.5, featuring the open-sourced HY-Embodied-0.5 MoT-2B weights on Hugging Face along with the official inference code!

📖 Abstract

We introduce HY-Embodied-0.5, a suite of foundation models tailored specifically for real-world embodied intelligence. To bridge the gap between general Vision-Language Models (VLMs) and the strict demands of physical agents, our models are engineered to excel in spatial-temporal visual perception and complex embodied reasoning (prediction, interaction, and planning).

The suite features an innovative Mixture-of-Transformers (MoT) architecture utilizing latent tokens for modality-specific computing, significantly enhancing fine-grained perception. It includes two primary variants: a highly efficient 2B model for edge deployment and a powerful 32B model for complex reasoning. Through a self-evolving post-training paradigm and large-to-small on-policy distillation, our compact MoT-2B outperforms state-of-the-art models of similar size across 16 benchmarks, while the 32B variant achieves frontier-level performance comparable to Gemini 3.0 Pro. Ultimately, HY-Embodied serves as a robust "brain" for Vision-Language-Action (VLA) pipelines, delivering compelling results in real-world physical robot control.

⭐️ Key Features

🧠 Evolved MoT Architecture: Designed for maximum efficiency without sacrificing visual acuity. The MoT-2B variant contains 4B total parameters but requires only 2.2B activated parameters during inference. By emphasizing modality-specific computing in the vision pathway, it achieves the high inference speed of a dense 2B model while delivering superior, fine-grained perceptual representations.
🔗 High-Quality Mixed Chain Reasoning: We introduce an advanced iterative, self-evolving post-training pipeline. By employing on-policy distillation, we successfully transfer the sophisticated step-by-step reasoning, planning, and high-quality "thinking" capabilities from our powerful 32B model directly to the compact 2B variant.
🌍 Large-Scale Embodied Pre-training: Grounded in a massive, specially curated dataset comprising >100 million embodied and spatial-specific data points. Trained on a corpus exceeding 200 billion tokens, the model develops a deep, native understanding of 3D spaces, physical object interactions, and agent dynamics.
🦾 Stronger VLA Application: Beyond standard academic benchmarks, HY-Embodied is engineered to be the core cognitive engine for physical robots. It seamlessly integrates into Vision-Language-Action (VLA) frameworks, acting as a highly robust and capable brain to drive high success rates in complex, real-world robotic control tasks.

📅 Plannings

Transformers Inference
vLLM Inference
Online Gradio Demo

🛠️ Dependencies and Installation

Prerequisites

🖥️ Operating System: Linux (recommended)
🐍 Python: 3.12+ (recommended and tested)
⚡ CUDA: 12.6
🔥 PyTorch: 2.8.0
🎮 GPU: NVIDIA GPU with CUDA support

Installation

Install the specific Transformers version required for this model:

pip install git+https://github.com/huggingface/transformers@9293856c419762ebf98fbe2bd9440f9ce7069f1a

Note: We will merge the improvements into the Transformers main branch later.

Install other dependencies:

pip install -r requirements.txt

Quick Start

Clone the repository:

git clone https://github.com/Tencent-Hunyuan/HY-Embodied
cd HY-Embodied/

Install dependencies:

pip install -r requirements.txt

Run inference:

python inference.py

The example script demonstrates both single generation and batch generation capabilities.

Model Download

The code automatically downloads the model tencent/HY-Embodied-0.5 from Hugging Face Hub. Ensure you have sufficient disk space (8 GB) for the model weights.

Hardware Requirements

GPU: Recommended for optimal performance (NVIDIA GPU with at least 16GB VRAM)
CPU: Supported but slower
Memory: At least 16GB RAM recommended
Storage: 20GB+ free space for model and dependencies

🚀 Quick Start with Transformers

Basic Inference Example

import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor


MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8

processor = AutoProcessor.from_pretrained(MODEL_PATH)


chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
    processor.chat_template = open(chat_template_path).read()

model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()


messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "./figures/example.jpg"},
            {"type": "text", "text": "Describe the image in detail."},
        ],
    }
]


inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    enable_thinking=THINKING_MODE,
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=32768,
        use_cache=True,
        temperature=TEMPERATURE,
        do_sample=TEMPERATURE > 0,
    )

output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)]
print(processor.batch_decode(output_ids, skip_special_tokens=True)[0])

Batch Inference

import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor


MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8

processor = AutoProcessor.from_pretrained(MODEL_PATH)


chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
    processor.chat_template = open(chat_template_path).read()

model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()


messages_batch = [
    # Sample A: image + text
    [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "./figures/example.jpg"},
                {"type": "text", "text": "Describe the image in detail."},
            ],
        }
    ],
    # Sample B: text only
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "How to open a fridge?"},
            ],
        }
    ],
]


all_inputs = []
for msgs in messages_batch:
    inp = processor.apply_chat_template(
        msgs,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        enable_thinking=THINKING_MODE,
    )
    all_inputs.append(inp)


batch = processor.pad(all_inputs, padding=True, padding_side="left").to(model.device)

with torch.no_grad():
    batch_generated_ids = model.generate(
        **batch,
        max_new_tokens=32768,
        use_cache=True,
        temperature=TEMPERATURE,
        do_sample=TEMPERATURE > 0,
    )


padded_input_len = batch["input_ids"].shape[1]
for i, msgs in enumerate(messages_batch):
    out_ids = batch_generated_ids[i][padded_input_len:]
    print(f"\n--- Sample {i} ---")
    print(processor.decode(out_ids, skip_special_tokens=True))

📊 Evaluation

Visual Perception

Note: We evaluated HY-Embodied-0.5 MoT-2B across 22 embodied-relevant benchmarks against models of similar size. For detailed performance metrics and methodology, please refer to our technical report.

Note: We observed that small models from the Qwen3.5 series produce repetitive thinking patterns in some benchmarks, which leads to lower overall results. Therefore, we compare against Qwen3-VL models in our evaluations.

Benchmark	HY-Embodied 0.5 MoT-2B	Qwen3-VL 2B	Qwen3-VL 4B	RoboBrain 2.5 4B	MiMo-Embodied 7B
CV-Bench	89.2	80.0	85.7	86.9	88.8
DA-2K	92.3	69.5	76.5	79.4	72.2

Embodied Understanding

Benchmark	HY-Embodied 0.5 MoT-2B	Qwen3-VL 2B	Qwen3-VL 4B	RoboBrain 2.5 4B	MiMo-Embodied 7B
ERQA	54.5	41.8	47.3	43.3	46.8
EmbSpatial-Bench	82.8	75.9	80.7	73.8	76.2
RoboBench-MCQ	49.2	36.9	45.8	44.4	43.6
RoboBench-Planning	54.2	36.2	36.4	39.2	58.7
RoboSpatial-Home	55.7	45.3	63.2	62.3	61.8
ShareRobot-Aff.	26.8	19.8	25.5	25.5	9.0
ShareRobot-Traj.	73.3	41.6	62.2	81.4	50.6
Ego-Plan2	45.5	35.5	38.8	52.6	39.9

Spatial Understanding

Benchmark	HY-Embodied 0.5 MoT-2B	Qwen3-VL 2B	Qwen3-VL 4B	RoboBrain 2.5 4B	MiMo-Embodied 7B
3DSRBench	57.0	39.9	43.9	44.8	42.0
All-Angles Bench	55.1	42.3	46.7	43.8	49.0
MindCube	66.3	28.4	31.0	26.9	36.2
MMSI-Bench	33.2	23.6	25.1	20.5	31.9
RefSpatial-Bench	45.8	28.9	45.3	56.0	48.0
SAT	76.7	45.3	56.7	51.3	78.7
SIBench-mini	58.2	42.0	50.9	47.3	53.1
SITE-Bench-Image	62.7	52.3	61.0	57.9	49.9
SITE-Bench-Video	63.5	52.2	58.0	54.8	58.9
ViewSpatial	53.1	37.2	41.6	36.6	36.1
VSIBench	60.5	48.0	55.2	41.7	48.5
Where2Place	68.0	45.0	59.0	65.0	63.6

Note: Results for HY-Embodied-0.5 MoT-2B are reported in thinking mode, while for all other models, we report the better performance between non-thinking and thinking modes.

📚 Citation

If you find it useful for your research and applications, please cite our paper using this BibTeX:

@article{tencent2026hyembodied05,
    title={HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents},
    author={Tencent Robotics X and HY Vision Team},
    journal={arXiv preprint arXiv:2604.07430},
    year={2026}
}

🙏 Acknowledgements

We thank the Hugging Face community for their support and the open-source contributions that made this implementation possible.