copy markdown
Introducing Interfaze, a new model architecture designed for high-accuracy tasks that is deterministic by nature.
Before 2018, DNNs and CNNs were the default for almost any ML task. You trained a model on a task-specific dataset, and it worked great for that learnt data.
These models output confidence scores and other task-related data on every run. That allows developers to build confidence driven systems conditioned on final probabilities/values from the ML models.
Here's an example output from AWS Textract, a traditional deep-learning OCR service.
{
"Type": {
"Text": "TOTAL",
"Confidence": 99.94717407226562
},
"LabelDetection": {
"Text": "Total:",
"Geometry": {
"BoundingBox": {
"Width": 0.09809663146734238,
"Height": 0.0234375,
"Left": 0.36822840571403503,
"Top": 0.8017578125
},
"Polygon": [
{ "X": 0.36822840571403503, "Y": 0.8017578125 },
{ "X": 0.466325044631958, "Y": 0.8017578125 },
{ "X": 0.466325044631958, "Y": 0.8251953125 },
{ "X": 0.36822840571403503, "Y": 0.8251953125 }
]
},
"Confidence": 97.10792541503906
},
"ValueDetection": {
"Text": "$55.64",
"Currency": {
"Code": "USD"
},
"Geometry": {
"BoundingBox": {
"Width": 0.10395314544439316,
"Height": 0.0244140625,
"Left": 0.66837477684021,
"Top": 0.802734375
},
"Polygon": [
{ "X": 0.66837477684021, "Y": 0.802734375 },
{ "X": 0.7723279595375061, "Y": 0.802734375 },
{ "X": 0.7723279595375061, "Y": 0.8271484375 },
{ "X": 0.66837477684021, "Y": 0.8271484375 }
]
},
"Confidence": 99.85165405273438
},
"PageNumber": 1
}Metadata like BoundingBox and Confidence is generated by default on every run. It gives developers a per-output signal they can rely on as threshold or auto-accept high-confidence results and route low-confidence ones to review. The scores are consistent and actionable, which is what production systems actually need.
The glaring problem with DNN/CNN models is that they only know what they were trained on. Introducing a new document type or format means retraining the entire model, which requires specialized ML engineers.
| Pros | Cons |
|---|---|
| Cheap to scale (CPU based or small GPU) | Breaks on new data it has never seen |
| High accuracy on trained data (98%-99%+) | High retraining cost with specialized ML engineers required |
| Per-output signals (confidence scores, etc) you can threshold on | Single-dimension task only, not multimodal or flexible |
| Deterministic metadata | Gets outdated quickly |
| Fast inference |
Since 2022, with the launch of ChatGPT and particularly GPT-3.5, transformer-based models have become the standard, and AGI has become the market objective.
These models are great at Natural Language Processing (NLP) & Natural Language Inference (NLI) tasks. They handle human-level nuance and turn words into actions, even for tasks they never saw during training.
Here's an example output from GPT-4.1, a transformer-based model used for OCR tasks.
{
"total": "$55.64",
"currency": "USD"
}Transformer-based models like GPT-4.1 use a user-defined schema to generate structured JSON and return only the required fields. Metadata isn't natively available, and it gets hallucinated if you request it in the output.
LLMs are great at understanding human-level inputs in natural languages and handling a wide range of document formats, even ones they haven't seen. But they're far from doing this consistently and accurately.
| Pros | Cons |
|---|---|
| Multi-modal by default | Expensive to scale (Minimum H100 GPU clusters), limited by scaling law, the more compute = better results |
| Flexible on data type and filling in the gaps | Hallucinations with no accuracy measurements |
| Dynamic structured output | Inconsistency (same input doesn't give the same output) |
| Human-level language understanding | Slow to run |
Can we get the best of both worlds?
DNNs and CNNs for tasks that require high accuracy, paired with transformers for their general nuance and dynamic output generation.
To test this thesis, we built a prototype. We used the tool-calling feature of a standard LLM (Gemini 2.5 Flash) and let it call specialized DNN models: PaddleOCR for OCR, SAM 3 for object detection, Whisper for audio, and more.
We published v0.1 of the experiment and its results in our paper, accepted into IEEE CAI 2026.
We saw a jump in results, especially on vision-heavy tasks, outperforming both the base Gemini model and Claude Opus 4 on many of them.
This was promising, but we knew the limitations:
You can find the full prototype code on GitHub.
The prototype was a good indicator of what's possible: higher-quality output and more verifiable metadata, while keeping the flexibility of transformers.
It's not a perfect implementation. There's still plenty of data loss, slower performance, high cost, and inconsistency on each run.
Even so, it showed a glimpse of what's possible if we trained a model natively from the ground up with a hybrid architecture.
So what's next? The goal for v1 is a native hybrid model where transformers and task-specific DNNs work hand in hand, connected through a shared embedding space so the transformer can read and write task-model representations directly (still an open design question we're prototyping).
To achieve this, we need to accomplish the following:
Our goal over the next 3-6 months is to ship a working v1 of our hybrid architecture in production. It should outperform both CNN/DNN models and low-cost production LLMs like the Gemini Flash series, which are commonly used for high-accuracy tasks.