
copy markdown
Let's start with a common question we use to test an LLM: "How many r's in strawberry?"
Is a model more intelligent if it counts the r's in strawberry on its own, or if it refuses to count, asks for a sandbox environment, and runs a Python script to count them?
Today, most reasoning models can solve this without any code. But are they truly counting the characters as required, or are we just throwing more compute at the problem?
GPT-4o would fail at this test. Take a new reasoning model, GPT-5.5 and the success rate shoots up. Why? Because now there's reasoning, more iterations of guesses, more compute spent on the same problem, and more tokens used to store mistakes.
Now take the latest model and increase the number of r's in strawberry to 40 or 50. It gets it wrong again, because it isn't truly counting characters, it's making the next best prediction. As it sees tokens not characters, and predicts tokens.
LLMs rarely reject a task outright. Even when they are unreliable, they will attempt an answer which leads to hallucinations.
A human on the other hand would reject the task, ask for a tool like a scripting environment, or even the basics: a pen and paper.
Yes.
A model that can recognize its own limitations and explicitly say “I need tool x to solve this..” is often more useful than a model that confidently improvises.
For tasks like character counting, Llama 3 + a regex tool would easily be more accurate, faster and cheaper than Claude Opus 4.7 high-thinking model.
Problems like character counting is just a good example to show case that the right tool is required for the task, and it scales to larger issues like OCR, document translation, audio understanding, and more.
Training the model's decoder on a rejection dataset so it learns its own limits. This is counterintuitive to most training practices. Instead of training reasoning tokens to teach the model how to count characters, we teach it that it's bad at counting and that it should refuse unless there's a scripting sandbox available.
Training Interfaze on the custom tools we serve with every request. Interfaze comes with a fully isolated sandbox environment, browsers, Web Index, and more at no extra cost. Each tool use is trained into the model alongside its own limits, the model's capabilities, and its rejection policy.
Interfaze's Tools are baked right into the model datasets, which lets us cut tool-calling cost. Calls to sandboxes, browsers, and other built-in tools carry no separate cost beyond the tokens used.
OpenAI SDK
Vercel AI SDK
LangChain SDK
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const CountingSchema = z.object({
answer: z.number(),
});
const response = await interfaze.chat.completions.create({
model: "interfaze-beta",
messages: [
{
role: "user",
content: "How many r's are there in strawberry?",
},
],
response_format: zodResponseFormat(CountingSchema, "counting_schema"),
});
console.log(response.choices[0].message.content);
//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("Sandbox results:", precontext?.[0]?.result);object carries the schema response. precontext carries the sandbox call the model made instead of guessing.
{
"object": {
"answer": 3
},
"precontext": [
{
"name": "code_execute",
"result": {
"code_script": "print('strawberry'.count('r'))",
"language": "python",
"output": "3\n"
}
}
]
}Same prompt, longer string. The model still delegates to the sandbox instead of guessing.
OpenAI SDK
Vercel AI SDK
LangChain SDK
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const CountingSchema = z.object({
answer: z.number(),
});
const response = await interfaze.chat.completions.create({
model: "interfaze-beta",
messages: [
{
role: "user",
content:
"How many r's are there in strrrrrrrrrrrawberrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrry?",
},
],
response_format: zodResponseFormat(CountingSchema, "counting_schema"),
});
console.log(response.choices[0].message.content);
//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("Sandbox results:", precontext?.[0]?.result);{
"object": {
"answer": 49
},
"precontext": [
{
"name": "code_execute",
"result": {
"code_script": "print('strrrrrrrrrrrawberrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrry'.count('r'))",
"language": "python",
"output": "49\n",
"is_action_tool": true
}
}
]
}Reading web pages is another common hallucinated task depending on the quality of the tool provided. Browser tools might provide incomplete information, or even no information at all depending on bot blocks and other factors. Many models will confidently fill gaps with hallucinated guesses.
Knowing when to differentiate the quality of result provided by a limited tools is key to rejecting hallucinated results and delegating to the right tool.
OpenAI SDK
Vercel AI SDK
LangChain SDK
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const LinkedInProfileSchema = z.object({
first_name: z.string(),
last_name: z.string(),
location: z.string(),
latest_education: z.string(),
current_job: z.string(),
followers: z.number(),
});
const response = await interfaze.chat.completions.create({
model: "interfaze-beta",
messages: [
{
role: "user",
content: "Extract profile info from https://www.linkedin.com/in/yoeven/",
},
],
response_format: zodResponseFormat(LinkedInProfileSchema, "linkedin_profile_schema"),
});
console.log(response.choices[0].message.content);
//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("Scraper results:", precontext?.[0]?.result);{
"object": {
"first_name": "Yoeven",
"last_name": "Khemlani",
"location": "San Francisco, California, United States",
"latest_education": "Imperial College London",
"current_job": "Interfaze (Previously JigsawStack)",
"followers": 4000
},
"precontext": [
{
"name": "web_extract",
"result": {
"scraped_content": {
"first_name": ["Yoeven"],
"last_name": ["Khemlani"],
"location": ["San Francisco, California, United States"],
"latest_education": ["Imperial College London"],
"current_job": ["Interfaze (Previously JigsawStack)"]
}
}
}
]
}This example scrapes a LinkedIn profile, a notoriously difficult website to scrape and commonly provides incomplete information hidden behind auth walls.
Interfaze manages to differentiate between the quality of result provided, reject invalid data and delegates to the right infrastructure to get the job done before returning a result.
The smartest answer a model can give is sometimes "I shouldn't be the one answering this."
Interfaze has four key tools built in: Isolated Sandbox, Smart proxied Browser, Web index and a SQL database.