Interfaze

logo

Beta

pricing

docs

blog

sign in

HY Embodied 0.5

HY Embodied 0.5 by tencent, a image-text-to-text model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

FeatureHY Embodied 0.5Interfaze
Input Modalities

image, text

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingNoYes
Language Support

unknown

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsNoYes
Context Input Size

32.8K

1M

Tool CallingNo

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureHY Embodied 0.5Interfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

🔥 Updates

  • [2026-04-09] 🚀 We have released HY-Embodied-0.5, featuring the open-sourced HY-Embodied-0.5 MoT-2B weights on Hugging Face along with the official inference code!

📖 Abstract

We introduce HY-Embodied-0.5, a suite of foundation models tailored specifically for real-world embodied intelligence. To bridge the gap between general Vision-Language Models (VLMs) and the strict demands of physical agents, our models are engineered to excel in spatial-temporal visual perception and complex embodied reasoning (prediction, interaction, and planning).

The suite features an innovative Mixture-of-Transformers (MoT) architecture utilizing latent tokens for modality-specific computing, significantly enhancing fine-grained perception. It includes two primary variants: a highly efficient 2B model for edge deployment and a powerful 32B model for complex reasoning. Through a self-evolving post-training paradigm and large-to-small on-policy distillation, our compact MoT-2B outperforms state-of-the-art models of similar size across 16 benchmarks, while the 32B variant achieves frontier-level performance comparable to Gemini 3.0 Pro. Ultimately, HY-Embodied serves as a robust "brain" for Vision-Language-Action (VLA) pipelines, delivering compelling results in real-world physical robot control.

⭐️ Key Features

  • 🧠 Evolved MoT Architecture: Designed for maximum efficiency without sacrificing visual acuity. The MoT-2B variant contains 4B total parameters but requires only 2.2B activated parameters during inference. By emphasizing modality-specific computing in the vision pathway, it achieves the high inference speed of a dense 2B model while delivering superior, fine-grained perceptual representations.
  • 🔗 High-Quality Mixed Chain Reasoning: We introduce an advanced iterative, self-evolving post-training pipeline. By employing on-policy distillation, we successfully transfer the sophisticated step-by-step reasoning, planning, and high-quality "thinking" capabilities from our powerful 32B model directly to the compact 2B variant.
  • 🌍 Large-Scale Embodied Pre-training: Grounded in a massive, specially curated dataset comprising >100 million embodied and spatial-specific data points. Trained on a corpus exceeding 200 billion tokens, the model develops a deep, native understanding of 3D spaces, physical object interactions, and agent dynamics.
  • 🦾 Stronger VLA Application: Beyond standard academic benchmarks, HY-Embodied is engineered to be the core cognitive engine for physical robots. It seamlessly integrates into Vision-Language-Action (VLA) frameworks, acting as a highly robust and capable brain to drive high success rates in complex, real-world robotic control tasks.

📅 Plannings

  • Transformers Inference
  • vLLM Inference
  • Online Gradio Demo

🛠️ Dependencies and Installation

Prerequisites

  • 🖥️ Operating System: Linux (recommended)
  • 🐍 Python: 3.12+ (recommended and tested)
  • CUDA: 12.6
  • 🔥 PyTorch: 2.8.0
  • 🎮 GPU: NVIDIA GPU with CUDA support

Installation

  1. Install the specific Transformers version required for this model:
pip install git+https://github.com/huggingface/transformers@9293856c419762ebf98fbe2bd9440f9ce7069f1a

Note: We will merge the improvements into the Transformers main branch later.

  1. Install other dependencies:
pip install -r requirements.txt

Quick Start

  1. Clone the repository:
git clone https://github.com/Tencent-Hunyuan/HY-Embodied
cd HY-Embodied/
  1. Install dependencies:
pip install -r requirements.txt
  1. Run inference:
python inference.py

The example script demonstrates both single generation and batch generation capabilities.

Model Download

The code automatically downloads the model tencent/HY-Embodied-0.5 from Hugging Face Hub. Ensure you have sufficient disk space (8 GB) for the model weights.

Hardware Requirements

  • GPU: Recommended for optimal performance (NVIDIA GPU with at least 16GB VRAM)
  • CPU: Supported but slower
  • Memory: At least 16GB RAM recommended
  • Storage: 20GB+ free space for model and dependencies

🚀 Quick Start with Transformers

Basic Inference Example

import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor


MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8

processor = AutoProcessor.from_pretrained(MODEL_PATH)


chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
    processor.chat_template = open(chat_template_path).read()

model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()


messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "./figures/example.jpg"},
            {"type": "text", "text": "Describe the image in detail."},
        ],
    }
]


inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    enable_thinking=THINKING_MODE,
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=32768,
        use_cache=True,
        temperature=TEMPERATURE,
        do_sample=TEMPERATURE > 0,
    )

output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)]
print(processor.batch_decode(output_ids, skip_special_tokens=True)[0])

Batch Inference

import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor


MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8

processor = AutoProcessor.from_pretrained(MODEL_PATH)


chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
    processor.chat_template = open(chat_template_path).read()

model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()


messages_batch = [
    # Sample A: image + text
    [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "./figures/example.jpg"},
                {"type": "text", "text": "Describe the image in detail."},
            ],
        }
    ],
    # Sample B: text only
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "How to open a fridge?"},
            ],
        }
    ],
]


all_inputs = []
for msgs in messages_batch:
    inp = processor.apply_chat_template(
        msgs,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        enable_thinking=THINKING_MODE,
    )
    all_inputs.append(inp)


batch = processor.pad(all_inputs, padding=True, padding_side="left").to(model.device)

with torch.no_grad():
    batch_generated_ids = model.generate(
        **batch,
        max_new_tokens=32768,
        use_cache=True,
        temperature=TEMPERATURE,
        do_sample=TEMPERATURE > 0,
    )


padded_input_len = batch["input_ids"].shape[1]
for i, msgs in enumerate(messages_batch):
    out_ids = batch_generated_ids[i][padded_input_len:]
    print(f"\n--- Sample {i} ---")
    print(processor.decode(out_ids, skip_special_tokens=True))

📊 Evaluation

Visual Perception

Note: We evaluated HY-Embodied-0.5 MoT-2B across 22 embodied-relevant benchmarks against models of similar size. For detailed performance metrics and methodology, please refer to our technical report.

Note: We observed that small models from the Qwen3.5 series produce repetitive thinking patterns in some benchmarks, which leads to lower overall results. Therefore, we compare against Qwen3-VL models in our evaluations.

BenchmarkHY-Embodied 0.5 MoT-2BQwen3-VL 2BQwen3-VL 4BRoboBrain 2.5 4BMiMo-Embodied 7B
CV-Bench89.280.085.786.988.8
DA-2K92.369.576.579.472.2

Embodied Understanding

BenchmarkHY-Embodied 0.5 MoT-2BQwen3-VL 2BQwen3-VL 4BRoboBrain 2.5 4BMiMo-Embodied 7B
ERQA54.541.847.343.346.8
EmbSpatial-Bench82.875.980.773.876.2
RoboBench-MCQ49.236.945.844.443.6
RoboBench-Planning54.236.236.439.258.7
RoboSpatial-Home55.745.363.262.361.8
ShareRobot-Aff.26.819.825.525.59.0
ShareRobot-Traj.73.341.662.281.450.6
Ego-Plan245.535.538.852.639.9

Spatial Understanding

BenchmarkHY-Embodied 0.5 MoT-2BQwen3-VL 2BQwen3-VL 4BRoboBrain 2.5 4BMiMo-Embodied 7B
3DSRBench57.039.943.944.842.0
All-Angles Bench55.142.346.743.849.0
MindCube66.328.431.026.936.2
MMSI-Bench33.223.625.120.531.9
RefSpatial-Bench45.828.945.356.048.0
SAT76.745.356.751.378.7
SIBench-mini58.242.050.947.353.1
SITE-Bench-Image62.752.361.057.949.9
SITE-Bench-Video63.552.258.054.858.9
ViewSpatial53.137.241.636.636.1
VSIBench60.548.055.241.748.5
Where2Place68.045.059.065.063.6

Note: Results for HY-Embodied-0.5 MoT-2B are reported in thinking mode, while for all other models, we report the better performance between non-thinking and thinking modes.

📚 Citation

If you find it useful for your research and applications, please cite our paper using this BibTeX:

@article{tencent2026hyembodied05,
    title={HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents},
    author={Tencent Robotics X and HY Vision Team},
    journal={arXiv preprint arXiv:2604.07430},
    year={2026}
}

🙏 Acknowledgements

We thank the Hugging Face community for their support and the open-source contributions that made this implementation possible.

Want more deterministic results?