CNN + VLM > VLM

copy markdown

This post covers why neither a CNN nor a VLM/LLM alone is good enough for OCR, why tool calling might close some of the gap but not all of it, and how fusing the two achieves the best of both worlds.

The Two Schools of OCR

Classical CNN pipelines. CNNs span classification (ResNet), segmentation (U-Net), and object detection (Faster R-CNN, Mask R-CNN). For OCR you want a detect-then-recognize stack: DBNet + CRNN/SVTR, usually via PaddleOCR. Small, fast, one job: find pixels of text, decode them into characters.
Vision-language models. Generalists like Qwen3-VL, GPT-5, Gemini 3 Pro, Claude Sonnet 4.6, plus OCR-specialists like dots.mocr and olmOCR 2. They treat OCR as text generation conditioned on an image.

For the rest of this post, "CNN" means an OCR-specialized detect-and-recognize stack (DBNet + CRNN/SVTR style), not any convolutional model in general.

Both work. Both fail in very specific ways.

CNN vs VLM: The Tradeoff

Dimension	Pure CNN OCR	Pure VLM OCR
Bounding boxes	Pixel-tight polygons per word or line	Coarse regions, often imprecise
Confidence scores	Per-character probabilities from the decoder	Token probs on generated text, not visual confidence
Determinism	Same image in, same text out	Stochastic, output drifts between runs
Speed and cost	Milliseconds on CPU (e.g. PP-OCRv5)	Seconds on GPU for a 72B model
Schema flexibility	Rigid, needs custom post-processing per document type	Ask for JSON, markdown, summaries, filtered fields, all in one call
Semantic understanding	None, "Total" is just pixels	Knows "Total" = "Grand Total" = "Amount Due", resolves ambiguity
Unseen document types	Usually needs per-domain fine-tuning	Zero-shot on almost anything, including handwriting and messy scans
Reasoning	None	Multi-turn, e.g. "extract line items, flag duplicates" in one prompt
Hallucination	Can't hallucinate, only misreads	Fabricates plausible text on blurry, cropped, low-contrast inputs

Pure CNN gives you precision without understanding. Pure VLM gives you understanding without precision.

Here's what that looks like in practice.

Pure CNN OCR: every word boxed perfectly, but the output is a giant flat array of text fragments. You know where each token is, not what any of it means.

Pure CNN OCR: pixel-tight bounding boxes around every token on a driver license

[
  { "text": "California", "bbox": [40, 30, 310, 90], "confidence": 0.994 },
  { "text": "USA", "bbox": [315, 55, 360, 85], "confidence": 0.981 },
  { "text": "DRIVER", "bbox": [430, 45, 570, 85], "confidence": 0.989 },
  { "text": "LICENSE", "bbox": [580, 45, 740, 85], "confidence": 0.992 },
  { "text": "Y4067081", "bbox": [350, 170, 560, 205], "confidence": 0.986 },
  { "text": "09/12/2027", "bbox": [350, 215, 540, 245], "confidence": 0.978 },
  { "text": "MUÑOZ", "bbox": [350, 260, 470, 295], "confidence": 0.984 },
  { "text": "ESTRADA", "bbox": [475, 260, 620, 295], "confidence": 0.972 },
  { "text": "IVAN", "bbox": [350, 300, 430, 330], "confidence": 0.988 },
  { "text": "ICHET", "bbox": [435, 300, 530, 330], "confidence": 0.974 },
  { "text": "14223", "bbox": [350, 340, 420, 368], "confidence": 0.991 },
  { "text": "BELGATE", "bbox": [425, 340, 540, 368], "confidence": 0.987 },
  { "text": "ST", "bbox": [545, 340, 575, 368], "confidence": 0.995 },
  { "text": "BALDWIN", "bbox": [350, 370, 475, 398], "confidence": 0.989 },
  { "text": "PARK", "bbox": [480, 370, 555, 398], "confidence": 0.993 },
  { "text": "09/12/1987", "bbox": [360, 430, 560, 462], "confidence": 0.982 }
  // ... 30+ more entries
]

No schema, no field names, no idea that "MUÑOZ ESTRADA" is a last name or that "09/12/1987" is the DOB vs. the expiration. You still have to write a parser.

Pure VLM OCR: clean structured JSON you can drop straight into your database, but the bounding boxes are sloppy, misaligned, or just invented.

Pure VLM OCR: four clearly wrong red bounding boxes on the same driver license (misaligned, floating, wrong field, partial)

{
  "first_name": "Iván",
  "last_name": "Muñoz Estrada",
  "date_of_birth": "1987-09-12",
  "document_number": "Y4067081",
  "country_or_state": "California, USA",
  "expiration": "2027-09-12"
}

Schema is perfect. Bounding boxes on the image above are not — one is oversized, one is floating over the state seal with no text under it, one is on the portrait instead of the name, one is cropping the address in half.

The Vision Encoder Problem

Most VLMs bolt a generic SigLIP 2 or ViT encoder onto a language model. These encoders were trained to align images and text broadly: cats, skateboards, memes.

Reading text needs different features: high spatial resolution, sharp edges, stroke-level detail, tolerance to rotation and noise.

Training methodology matters more than size here. A 400M SigLIP 2 encoder beats a 5.9B InternViT on most VLM benchmarks. But neither is purpose-built for OCR, which is why even frontier VLMs misread dense receipts and multi-column PDFs.

Tool Calling: Most of the Gap, But Not All

The obvious fix is tool calling: give the VLM an ocr tool backed by PaddleOCR or Tesseract, let it call the specialist, and feed results back into context.

This is a real improvement. Accuracy jumps because the actual transcription now comes from a CNN that can read pixels. Bounding boxes and confidence scores exist. Hallucination on clean text drops sharply.

But it still hits the same ceiling, for the reasons listed above:

The VLM is still the final author. It can paraphrase, drop details, or hallucinate during the formatting step, even with perfect OCR input in context.
No auto-correction. The VLM can't tell whether the CNN misread a character, skipped a line, or merged two fields. It takes whatever lands in context as ground truth, which creates confident false positives.
Bounding boxes are fragile. The tool returns precise coordinates, but threading them through a text-generation model without corruption is unreliable.
Latency multiplies. VLM decides, tool runs, VLM formats. Two or three model passes per document.
Routing is unpredictable. The VLM may skip the tool when it shouldn't, or call it and override the result with its own guess.

Tool calling is great for orchestration. It's the wrong pattern when the specialist's output needs to land byte-for-byte in the final response and the two models need to cross-check each other.

The Fix: Fuse Them at the Model Level

The real answer is one system where a CNN OCR specialist feeds directly into the VLM's decoder, alongside the generic vision encoder. Each component contributes what it's good at:

Signal	Source	Why It Matters
Semantic visual tokens	Generic VLM encoder	Layout, diagrams, context
Character-level tokens	CNN OCR head	Pixel-perfect reads of small dense text
Bounding boxes	CNN detector	Grounded, structured metadata
Confidence scores	CNN decoder	Uncertainty and escalation signal
Reasoning + schema	LLM decoder	Flexible, structured output

The flow looks like this:

The CNN doesn't replace the vision encoder, it augments it. The decoder stops guessing at pixels and starts treating the CNN's output as a grounded source of truth, while also being able to flag or correct obvious CNN mistakes against the semantic signal from the vision encoder.

No tool call latency, no format-step hallucination, real bounding boxes and real confidence on the same response.

How Interfaze Does This

This is the architecture behind Interfaze. We trained a dedicated OCR CNN and adapted it to feed natively into the Interfaze decoder, not through a tool call.

Interfaze gives you:

VLM-quality reasoning and schema flexibility
CNN-quality bounding boxes and confidence scores as first-class metadata
One model pass per document
No hallucination on degraded images, because the CNN head anchors the decoder to actual pixels

More detail in our paper and the Interfaze beta writeup.

In practice, the VLM returns your structured JSON, and precontext carries the CNN-side OCR metadata on the same response:

OpenAI SDK

Vercel AI SDK

LangChain SDK

import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const interfaze = new OpenAI({
	apiKey: "<INTERFAZE_API_KEY>",
	baseURL: "https://api.interfaze.ai/v1",
});

const IDSchema = z.object({
	first_name: z.string().describe("First name on the ID"),
	last_name: z.string().describe("Last name on the ID"),
	dob: z.string().describe("Date of birth on the ID"),
	driver_licence_number: z.string().describe("Driver licence number on the ID"),
});

const response = await interfaze.chat.completions.create({
	model: "interfaze-beta",
	messages: [
		{
			role: "user",
			content: [
				{ type: "text", text: "Extract the details from this ID" },
				{
					type: "image_url",
					image_url: {
						url: "https://r2public.jigsawstack.com/interfaze/examples/id.jpg",
					},
				},
			],
		},
	],
	response_format: zodResponseFormat(IDSchema, "id_schema"),
});

console.log(response.choices[0].message.content);

//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("OCR bounding boxes + confidence:", precontext[0]?.result);

One call. CNN-grade metadata. VLM-grade flexibility.

Takeaway

Pure CNN OCR: pixel-tight boxes and real confidence, but a flat bag of tokens with no schema and no semantics.
Pure VLM OCR: clean structured output and reasoning, but imprecise boxes, token-prob confidence, and hallucinations on degraded inputs.
Tool calling: closes the accuracy gap, but adds multi-pass latency, unpredictable routing, fragile bbox hand-off, and no way for the VLM to auto-correct CNN mistakes.
Model-level fusion (Interfaze): the CNN OCR head feeds text, boxes, and confidence directly into the VLM decoder alongside the vision encoder. One pass, structured JSON in content, CNN-grade metadata in precontext, both signals cross-checking each other.

If you're doing OCR in production and care about bounding boxes, confidence scores, or not hallucinating on bad inputs, a pure VLM isn't enough and tool calling isn't the finish line — fuse at the decoder.