Reliable Software in the LLM Era

2026-03-128:1910835quint-lang.org

Quint - Executable specs for reliable systems

Written by: Gabriela Moreira

Quint was born from the concern of making software more reliable. This has been Informal Systems’ mission from its conception. We want a world where people can trust the software they use.

LLMs have transformed how we write code, but they’ve also created new frustrations. We’ve all been there: staring at a huge AI-generated diff with no clue if it’s actually right. AI fooling us with code that seemed to work but was subtly wrong. Tests that all passed but didn’t actually test anything meaningful. The whole point of LLMs is producing text that looks correct - and that’s exactly what makes validation so hard.

AI is overconfident and needs reality checks.

But here’s the hopeful part: LLMs also enable us to take care of reliability with a much smaller time investment. And no, this is not simply generating specs. It’s not about “putting AI into Quint”. It’s about how Quint fits in the new world of software development, with the same goal it was born with.

Let me tell you how we used Quint to guardrail LLMs and get them to make a significant core change on a complex codebase that we feel confident about.

How Quint fits into this story

The fundamental challenge with LLMs is validation. LLMs excel at code generation, but we humans struggle to validate correctness through code review alone. We typically use a combination of natural language documentation and code, and both are surprisingly hard to validate.

Quint solves this by sitting between English and code as an ideal validation point. It’s more abstract than code, which makes it easier to reason about, yet it’s executable unlike English, which makes it mechanically verifiable. Quint’s tooling - the simulator, model checker, and REPL - lets you build confidence through exploration and property checking.

But what about validating the actual code? Since Quint is executable, we can establish deterministic connections between the specification and the implementation through model-based testing: you run the same scenarios in both the spec and the code, and verify they behave identically. The confidence you build at the spec level transfers to the code level.

The Use Case: A Complex Change in a Complex System

We tested this on Malachite, a production-grade BFT consensus engine that implements Tendermint consensus protocol. This is not a toy example - if it works for this, it will work for simpler changes and simpler systems. Malachite is a complex system with several optimizations on top of Tendermint, and the consensus core alone is made of four different components.

Malachite was acquired by Circle (the company behind USDC) to build Arc, their new blockchain. Much of Malachite’s protocol design was done with Quint from the beginning.

The change: Switch to Fast Tendermint - a variant that requires 5F + 1 nodes to tolerate F Byzantine nodes (instead of the classical 3F + 1), with potential for better performance through one less communication step.

Traditional estimate for this change: a couple of months.

We challenged ourselves to do it with Quint and AI in a week. And we did.

Starting Point

We Already Had a Spec

We started with an existing Tendermint spec in Quint - specifically one using the Choreo framework (our framework for modeling distributed systems). We have many versions of Tendermint specs because it’s our favorite protocol to model.

Could AI write the first spec for you? It can definitely help you, and we are exploring more ways to expand this help. Even if you have to write the first spec by hand (a few days of work for a complex protocol), the ROI for complex systems is compelling: that initial investment pays off for every subsequent change, refactor, or optimization you make.

LLMs as Translators, Not Designers

Let me be clear: we are not delegating protocol design to AI. We’re not asking AI to design state of the art BFT consensus algorithms.

Manu, one of our best protocol designers, did the actual work: research, the English protocol description, draft proofs and paper. We wanted to validate that his protocol was correct. That’s where Quint comes in - but this time, with help from AI.

So we started with two artifacts: A Quint specification for the original Tendermint, and an English description of the new Fast Tendermint.

The Workflow: Four Steps to Confidence

Input: Existing Tendermint spec + Manu’s English protocol description
Process: AI modifies the Quint spec based on the new protocol
Output: First version of Fast Tendermint spec

After each change, we run Quint’s basic validation tools - quint parse to check syntax, and quint typecheck to verify types, references, and function signatures. This tight feedback loop catches basic errors immediately before moving forward, iterating until we have a structurally sound spec.

But structural soundness isn’t the same as behavioral correctness.

Spec Validation

This is where humans should spend most of their time. The time you save by not writing code can now go into deeper validation.

You query the spec interactively: “Can I reach a scenario where X happens?” “Does this property hold?” AI helps you leverage Quint’s tools while you bring your domain expertise. AI can suggest some scenarios, and you drive it to make yourself more confident.

For Fast Tendermint, we validated:

  • The re-broadcast mechanism: Can we trigger re-broadcasts? Can we receive and act on them?
  • Alternative decision propagation: When does this mechanism trigger instead of the normal flow? We found and documented those scenarios.
  • Byzantine tolerance assumptions: We specified a version with too many Byzantine nodes. In this model, disagreement should be possible. If it’s not, we’re making unrealistic assumptions in our model.

The goal: test that all flows are reachable and explore the state space systematically. Once we’re confident about coverage, we check properties. This brings connection back - the opposite of the disconnected feeling you get when AI generates code and you just hope it’s right.

Results: We found two small bugs in Manu’s English spec in one afternoon. They were easy to fix at this stage.

Once validation passes - all flows reachable, properties hold, assumptions verified - we have a spec we are confident about. This becomes our source of truth for code generation.

Code Change

Now we have a validated spec. Time to generate the actual implementation.

Good news: AI is great at code generation when you have a precise spec to translate from. We give the AI the old Quint spec, the new Quint spec, and the diff between them - formal models rather than ambiguous natural language.

Malachite has architectural complexity that’s not represented in the abstract spec (optimizations, storage decisions, etc.), so we need to guide AI in these decisions. That’s part of our expertise and profession.

Code Validation

We validated the spec, but what we really care about is the code. Model-based testing is the technique that gives us confidence that our code follows those validated behaviors from the spec.

AI generates “glue code” that connects the spec to the implementation. This glue code takes scenarios from Step 2 (witnesses - properties that demonstrate state reachability - or quint runs) and replays them in the code: it takes a scenario, calls the matching entry points in the implementation, and asserts that the results match what the spec predicts. This produces a test suite that lives in your CI.

Beyond model-based testing: We’re also working on trace validation - taking traces from real environments where your code runs (testing environments, testnets, production) and validating that those traces are compliant with the spec. This closes the loop: you’re not just checking that code can behave like the spec, but that it actually does in practice.

The Results

The refactor was successful. We modified Malachite’s core consensus mechanism - a change traditionally estimated at months of work - in about 2 days for the spec modifications and validation, plus ~1 week for code generation and testing (including refinement and debugging). There’s much to mature in this process, but we were very happy with the results.

A Key Benefit: Spec as Debugging Compass

One benefit we found: the validated Quint spec prevents you from diving into non-issues during debugging. Here’s a pattern we encountered repeatedly:

  1. AI (reading test logs): “I know what is wrong - we should be broadcasting X when doing Y!”
  2. AI (reads Quint spec): “Wait, the spec shows we don’t broadcast X during Y. That’s by design.”
  3. Result: Move on immediately to investigate other hypotheses.

With an English spec, you’d second-guess: “What if we forgot to document it? What if it’s incomplete?” You waste time investigating false leads. But with a validated Quint spec that you’ve already explored deeply, you know the spec is correct. If broadcasting X during Y was needed, the validation phase would have exposed it. The spec says we don’t do it, so we confidently rule out that hypothesis and move on.

This trust in the spec as a definitive reference saves significant debugging time. It keeps both humans and AI from suspecting the wrong things.

Conclusion

Remember those frustrations? The huge diffs with no clue if they’re right? The disconnected feeling of endless prompts with no structure? The tests that passed but tested nothing?

This workflow addresses them. Interactive spec validation brings connection back - you’re not just hoping the AI got it right, you’re exploring and building confidence. When AI generates code from a validated spec, you have a definitive reference to check against. The spec becomes your debugging compass, preventing you from diving into non-issues. And model-based testing ensures your tests actually validate what matters.

Executable specifications are the ideal validation point in LLM-assisted development. They sit at the sweet spot between English and code: abstract enough for human reasoning, precise enough for mechanical verification. This is what beats AI’s overconfidence - not more AI, but better validation tools.

Here’s what’s remarkable: Quint was built to help humans reason about complex systems. Giving Quint to AI helps guardrail it to work with more complexity, without making things impossible to validate. We use LLMs for what they excel at: translating between Quint specs, documentation, and implementation code. LLMs don’t think, they translate. Quint’s deterministic tools do the reasoning.

Our work is shifting from writing code to validating AI-generated code. Defining what correct means has become extremely important - and that’s exactly what executable specs do.

This workflow - spec modification, interactive exploration, guided code generation, and model-based testing - scales to real-world systems. We proved it on Malachite.

Want to see this in action? We’re doing initial demos for teams interested in applying this approach to their projects. If you’re working on distributed systems such as consensus or interoperability protocols, large scale databases, or other types of complex systems such as trading apps (for example DEX/CEX) or cryptographic protocols (ZKP, privacy preserving apps), or any other complex core logic where reliability matters, schedule a consultation and we’ll show you our demo and happily explore how we can help you.

Thanks for reading. We’re building the future of reliable software, one executable spec at a time.



Read the original article

Comments

  • By _pdp_ 2026-03-1212:515 reply

    Nothing changes in terms of how to make reliable software. You need the same things like unit tests, integration tests, monitoring tools, etc.

    Basically AI now makes every product operate as if it has a vibrant open-source community with hundreds of contributions per day and a small core team with limited capacity.

    • By flykespice 2026-03-1215:18

      > Nothing changes in terms of how to make reliable software. You need the same things like unit tests, integration tests, monitoring tools, etc.

      It just changes in terms of doubling the work you have to do in order verify your system rather than you writing the code from scratch, because you have to figure out whatever code your AI agent spitted out before beginning the formal verification process.

      With you having written the code from scratch, you already know it beforehand and the verification process is more smoother.

    • By joshribakoff 2026-03-1214:05

      While nothing fundamentally changes i have found an increased need for tests and their taxonomies — because the LLM can “hack” the tests. So, having more robust tests with more ways to organize and run the tests. For example instead of 200 tests maybe i have 1,200, along with some lightweight tools to run tests in different parts of the test taxonomies.

      A more concrete example is maybe you have tests that show you put a highlight on the active item tests that show you don’t put the highlight on the inactive items, but with an llm you might also want to have tests that wait a while and verify the highlight is not flickering on and off overtime (something so absurd you wouldn’t even test for it before AI).

      The value of these test is in catching areas of the code where things are drifting towards nonsense because humans aren’t reviewing as thoroughly. I don’t think that you can realistically have 100% data coverage and prevent every single bug and not review the code. It’s just that I found that slightly more tests are warranted if you do want to step back.

    • By hrmtst93837 2026-03-1213:071 reply

      The tough part is that the "core team" can't see inside most model updates so even if you have great tests, judgment calls by the model can change silently and break contracts you didn't even know you had. Traditional monitoring can catch obvious failures but subtle regressions or drift in LLM outputs need their own whole bag of tricks. If you treat LLM integration like any other code lib you'll be chasing ghosts every time the upstream swaps a training data set or tweaks a prompt template.

      • By _pdp_ 2026-03-1213:33

        This is no different than receiving PRs from anonymous users on the Internet. Some more successful open source projects are already doing this at scale.

    • By 2001zhaozhao 2026-03-1220:41

      > AI now makes every product operate as if it has a vibrant open-source community with hundreds of contributions per day and a small core team with limited capacity.

      Hmm this is an interesting analogy. It suggests that we could design autonomous AI coding scaffold tools based on the patterns found in open source development communities.

    • By ok123456 2026-03-1214:03

      Exactly. NO SILVER BULLET.

  • By dijit 2026-03-1217:09

    > But here’s the hopeful part:

    I hope this is a tongue in cheek jab at how AI writes prose, because Claude loves to prefix lines with this.

  • By sastraxi 2026-03-1212:451 reply

    The idea is interesting, but have some more respect for your potential readers and actually write the post. There’s so much AI sales drivel here it’s hard to see what’s interesting about your product. I’m more interested in the choices behind your design decisions than being told “trust me, it’ll work”.

    • By bugarela 2026-03-1218:422 reply

      Hey! We were not really sure how to pass on the information back when I wrote this in November, but since then we've packaged an opensourced all agents and AI stuff involved in that post: https://github.com/informalsystems/quint-llm-kit

      It's true what they say that it is easy to make a demo in AI, but super hard to turn demo into some product or thing other people can use. We are trying :) but also, most posts I read on this topic are just philosophical and give absolutely nothing you can learn and use. We are trying to provide concrete ideas on the things we are exploring, like in our newest post: https://quint-lang.org/posts/cognitive_debt

      I'm also a bit happy you see some sales drive in that post since I'm 100% technical and trying to be more sales-inclined. I'm learning to find the balance. If it helps, it's more like I'm so extremely hyped about this and want to convince people to use it. And everything we built so far is open source, so it's really about selling the cool idea of formal methods at this point.

      • By sastraxi 2026-03-1222:311 reply

        Yeah, I understand it's hard to find the balance here. I'd imagine you feel the need to ship at the same time as writing about it. For me, the post you shared in your reply as well as in the OP have been expanded to be about 3-4x as long as they need to be, and I'm going to assume you're using AI to generate them due to the writing style. My feedback is to consider writing shorter posts, but doing it by hand -- your prose style here is friendly, engaging and direct. I wish that your articles were the same.

        TL;DR: I feel like my time is being wasted when I read AI-written articles, so I stop reading. Do with my anecdote what you will!

        • By bugarela 2026-03-149:57

          Interesting, thank you! I feel the same about avoiding reading AI stuff, that's my worry here. In OP, I had way too much content and had AI help me reduce the size of if. But I learned a lot since then and some parts of the post bother me a bit now.

          For the second post, it was a wall of prose that my colleague, Josef, wrote (I'm pretty sure without AI). I added the more concrete example sections and then asked AI to suggest some heading names to break down the prose, and I'm assuming this was my mistake as the headings are the first thing people read and AI loves headings. I always broke down text in headings before AI, this suck haha.

          Anyway, I'll definitely increase my neurotic levels about this. It's not worth at all to rush things if it will make the post more empty to read in any way.

      • By ptak_dev 2026-03-1220:58

        [flagged]

HackerNews