copy markdown
This post covers why neither a CNN nor a VLM/LLM alone is good enough for OCR, why tool calling might close some of the gap but not all of it, and how fusing the two achieves the best of both worlds.
For the rest of this post, "CNN" means an OCR-specialized detect-and-recognize stack (DBNet + CRNN/SVTR style), not any convolutional model in general.
Both work. Both fail in very specific ways.
| Dimension | Pure CNN OCR | Pure VLM OCR |
|---|---|---|
| Bounding boxes | Pixel-tight polygons per word or line | Coarse regions, often imprecise |
| Confidence scores | Per-character probabilities from the decoder | Token probs on generated text, not visual confidence |
| Determinism | Same image in, same text out | Stochastic, output drifts between runs |
| Speed and cost | Milliseconds on CPU (e.g. PP-OCRv5) | Seconds on GPU for a 72B model |
| Schema flexibility | Rigid, needs custom post-processing per document type | Ask for JSON, markdown, summaries, filtered fields, all in one call |
| Semantic understanding | None, "Total" is just pixels | Knows "Total" = "Grand Total" = "Amount Due", resolves ambiguity |
| Unseen document types | Usually needs per-domain fine-tuning | Zero-shot on almost anything, including handwriting and messy scans |
| Reasoning | None | Multi-turn, e.g. "extract line items, flag duplicates" in one prompt |
| Hallucination | Can't hallucinate, only misreads | Fabricates plausible text on blurry, cropped, low-contrast inputs |
Pure CNN gives you precision without understanding. Pure VLM gives you understanding without precision.
Here's what that looks like in practice.
Pure CNN OCR: every word boxed perfectly, but the output is a giant flat array of text fragments. You know where each token is, not what any of it means.

[
{ "text": "California", "bbox": [40, 30, 310, 90], "confidence": 0.994 },
{ "text": "USA", "bbox": [315, 55, 360, 85], "confidence": 0.981 },
{ "text": "DRIVER", "bbox": [430, 45, 570, 85], "confidence": 0.989 },
{ "text": "LICENSE", "bbox": [580, 45, 740, 85], "confidence": 0.992 },
{ "text": "Y4067081", "bbox": [350, 170, 560, 205], "confidence": 0.986 },
{ "text": "09/12/2027", "bbox": [350, 215, 540, 245], "confidence": 0.978 },
{ "text": "MUÑOZ", "bbox": [350, 260, 470, 295], "confidence": 0.984 },
{ "text": "ESTRADA", "bbox": [475, 260, 620, 295], "confidence": 0.972 },
{ "text": "IVAN", "bbox": [350, 300, 430, 330], "confidence": 0.988 },
{ "text": "ICHET", "bbox": [435, 300, 530, 330], "confidence": 0.974 },
{ "text": "14223", "bbox": [350, 340, 420, 368], "confidence": 0.991 },
{ "text": "BELGATE", "bbox": [425, 340, 540, 368], "confidence": 0.987 },
{ "text": "ST", "bbox": [545, 340, 575, 368], "confidence": 0.995 },
{ "text": "BALDWIN", "bbox": [350, 370, 475, 398], "confidence": 0.989 },
{ "text": "PARK", "bbox": [480, 370, 555, 398], "confidence": 0.993 },
{ "text": "09/12/1987", "bbox": [360, 430, 560, 462], "confidence": 0.982 }
// ... 30+ more entries
]No schema, no field names, no idea that "MUÑOZ ESTRADA" is a last name or that "09/12/1987" is the DOB vs. the expiration. You still have to write a parser.
Pure VLM OCR: clean structured JSON you can drop straight into your database, but the bounding boxes are sloppy, misaligned, or just invented.

{
"first_name": "Iván",
"last_name": "Muñoz Estrada",
"date_of_birth": "1987-09-12",
"document_number": "Y4067081",
"country_or_state": "California, USA",
"expiration": "2027-09-12"
}Schema is perfect. Bounding boxes on the image above are not — one is oversized, one is floating over the state seal with no text under it, one is on the portrait instead of the name, one is cropping the address in half.
Most VLMs bolt a generic SigLIP 2 or ViT encoder onto a language model. These encoders were trained to align images and text broadly: cats, skateboards, memes.
Reading text needs different features: high spatial resolution, sharp edges, stroke-level detail, tolerance to rotation and noise.
Training methodology matters more than size here. A 400M SigLIP 2 encoder beats a 5.9B InternViT on most VLM benchmarks. But neither is purpose-built for OCR, which is why even frontier VLMs misread dense receipts and multi-column PDFs.
The obvious fix is tool calling: give the VLM an ocr tool backed by PaddleOCR or Tesseract, let it call the specialist, and feed results back into context.
This is a real improvement. Accuracy jumps because the actual transcription now comes from a CNN that can read pixels. Bounding boxes and confidence scores exist. Hallucination on clean text drops sharply.
But it still hits the same ceiling, for the reasons listed above:
Tool calling is great for orchestration. It's the wrong pattern when the specialist's output needs to land byte-for-byte in the final response and the two models need to cross-check each other.
The real answer is one system where a CNN OCR specialist feeds directly into the VLM's decoder, alongside the generic vision encoder. Each component contributes what it's good at:
| Signal | Source | Why It Matters |
|---|---|---|
| Semantic visual tokens | Generic VLM encoder | Layout, diagrams, context |
| Character-level tokens | CNN OCR head | Pixel-perfect reads of small dense text |
| Bounding boxes | CNN detector | Grounded, structured metadata |
| Confidence scores | CNN decoder | Uncertainty and escalation signal |
| Reasoning + schema | LLM decoder | Flexible, structured output |
The flow looks like this:
The CNN doesn't replace the vision encoder, it augments it. The decoder stops guessing at pixels and starts treating the CNN's output as a grounded source of truth, while also being able to flag or correct obvious CNN mistakes against the semantic signal from the vision encoder.
No tool call latency, no format-step hallucination, real bounding boxes and real confidence on the same response.
This is the architecture behind Interfaze. We trained a dedicated OCR CNN and adapted it to feed natively into the Interfaze decoder, not through a tool call.
Interfaze gives you:
More detail in our paper and the Interfaze beta writeup.
In practice, the VLM returns your structured JSON, and precontext carries the CNN-side OCR metadata on the same response:
OpenAI SDK
Vercel AI SDK
LangChain SDK
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const interfaze = new OpenAI({
apiKey: "<INTERFAZE_API_KEY>",
baseURL: "https://api.interfaze.ai/v1",
});
const IDSchema = z.object({
first_name: z.string().describe("First name on the ID"),
last_name: z.string().describe("Last name on the ID"),
dob: z.string().describe("Date of birth on the ID"),
driver_licence_number: z.string().describe("Driver licence number on the ID"),
});
const response = await interfaze.chat.completions.create({
model: "interfaze-beta",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Extract the details from this ID" },
{
type: "image_url",
image_url: {
url: "https://r2public.jigsawstack.com/interfaze/examples/id.jpg",
},
},
],
},
],
response_format: zodResponseFormat(IDSchema, "id_schema"),
});
console.log(response.choices[0].message.content);
//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("OCR bounding boxes + confidence:", precontext[0]?.result);One call. CNN-grade metadata. VLM-grade flexibility.
content, CNN-grade metadata in precontext, both signals cross-checking each other.If you're doing OCR in production and care about bounding boxes, confidence scores, or not hallucinating on bad inputs, a pure VLM isn't enough and tool calling isn't the finish line — fuse at the decoder.