Byte Sauna Logo

The vision of combating model hallucinations with formal verification

March 16, 2026

The emergence of the post-ChatGPT GenAI models has triggered a fundamental shift in how software is built. There are flashy highlights. Spotify, for example, has announced that their best developers haven’t written a line of code since December. There have been layoffs attributed to AI — though it remains contentious whether this is merely a convenient excuse. AI has even been able to solve previously open math problems, which suggests high potential for (at least pattern-matching-based) reasoning.

It seems fair to say that we are currently witnessing a period of adjustment. No one quite knows the exact AI capabilities and limits. What is undeniable, however, is that AI models are notoriously unreliable: While AI can stun with the occasional elegant solution, equally often they fumble with mundane tasks that should be trivial to an equally capable human programmer.

We have a huge variation in medium-term predictions. The vocal minorities seem to be the relentlessly bullish AI enthusiasts and their counterparts, the doomers, who still consider AI a temporary fad. The silent majority (or what I believe is the majority) are kind of standing on the sidelines and trying to make sense of the conflicting opinions. What I think is a key unknown right now is trust. It’s clear that code (as well as most other products of knowledge work) is, in theory, almost free. The gist of the matter is whether the code generated by an AI model can be trusted, or what kind of actions should be taken to make the output trustworthy.

To give a recent example, Amazon has signaled that they are running into this exact issue, and they are tackling it in the most obvious way: manual review. This begs the question, though, whether manual review is the only way forward as it’s slow and tedious. To put it simply: reviewing code is often harder than writing it. What would we gain from outsourcing the easy part only to make the difficult part even more difficult?

Could we tackle this via formalization? There was an interesting post by Leonardo de Moura where he envisions software shipping with a “proof suite” that formally verifies its behaviour. I’m personally a fan of Lean and I find his vision interesting. In a world where form would be a standard part of a software development project, the role of the software developer would shift from writing the business logic itself to expressing the spec in Lean (or some other equivalent).

This may be slightly futuristic, and I’m not sure how practical it is at the moment. The upside, though, would be extremely lucrative as, in theory, formal verification would be an absolute guarantee against the “trust issues” we currently have with AI. There would be a mathematical certainty that the program behaves according to the spec now and in the future. In this sense, the formal verification would play a part akin to the test suites repos currently ship with, except that while tests are (potentially insufficient) empirical experiments, formal verification would be based on ironclad deductive reasoning.

My photo

Article by

Matias Heikkilä