Object detection (NL prompts)
Visual grounding: given a free-form natural-language description, the model must return the exact bounding box of the object referenced — not classify it, but locate it.
Acc@0.5 — fraction of predicted bounding boxes whose IoU with the ground truth is ≥ 0.5. Higher is better. Hover a bar to reveal the exact score.
Every model evaluated on RefCOCO, ranked highest to lowest.
| # | Model | Score |
|---|---|---|
| 1 | Interfaze | 82.1% |
| 2 | Claude-Sonnet-4.6 | 75.5% |
| 3 | Gemini-3-Flash | 75.2% |
| 4 | GPT-5.4-Mini | 67.0% |
| 5 | Grok-4.3 | 25.0% |