Against vibes: When is a generative model useful

2026-03-1019:5910627www.williamjbowman.com

2026-03-05 :: academia, research Let’s suppose I wanted to answer a question: is the tool X useful for the task Y. If I were scientific about this, I would analyze the properties of tool X and develop…

:: academia, research

Let’s suppose I wanted to answer a question: is the tool X useful for the task Y. If I were scientific about this, I would analyze the properties of tool X and develop a model, and the task Y and the requirements for it and develop a model, and I would use my models to predict the behaviour of tool X in the context of task Y. “Can I use timber instead of stainless steel as a support beam for this structure?” “Will this acid be an appropriate solvent for this reaction?” “Will this programming language provide these real-time guarantees?”

The discourse on generative models is not like this. Instead, you get claims like “software engineering is dead”, and attempts to shove generative models into literally everything without a thought. Search? Generative models. Code completion? Generative models. Summarization? Generative models. Voice to text? Generative models. Stock images? Generative models.

Any attempt to criticize this tends to go in circles and/or have people arguing past each other. Is a generative model useful for internet search? Well, look, it produces text that is plausibly related to the input prompt, so. So… what? That doesn’t answer the question. But the newest models are so much better! Better at.. what?

I was upset about this when it was being called “prompt engineering”, and found no sign of engineering, but instead a series of vibes about how to phrase a prompt in a particular version of a particular model, which sometimes produced output that is plausibly related to the input prompt and therefore plausibly close to what you might have intended. I’m upset now when people are making claims that agents are so useful, but can’t tell me when or why or how they’re useful beyond vibes about feeling more productive (vibes that have been refuted by real science contrasting objective measure of productivity vs. subjective reports), or examples of having produced a lot of plausible output.

(Okay, there are some researchers doing actual science and writing papers; I’m talking about the arguments being made as this stuff is integrating into schools, workplaces, etc.)

I want to know when generative models are useful. I don’t want to feel like they’re useful, that’s just a vibe. I’ve been a generative model skeptic basically from the beginning. I could not convince myself that generative models were useful. But I was also skeptical of my own subjective experience. I could imagine that a model capable of produce code from natural language would be useful, in some use cases that I had not found. I imagine there must be a model of when a generative model X is useful for task Y.

In this post, I’m not addressing ethical, political, or social questions. Those questions are important, and I want to address them separately from what the technology is capable of. Just for context: I think the widespread deployment of this technology is deeply problematic and irresponsible. I think further investment in it at its current scale is an almost criminal level of fiduciary negligence and will cause economic harm. I think the ethics of all of this are deeply troubling.

But for now, I just want to know what they’re technically capable of.

A model of generative model utility

I think the usefulness of a generative model is a function of three things:

  1. What is the cost of encoding a generative task in a prompt vs. directly producing the artifact? This is a function of the task, the model, and the user.
  2. What is the cost of verifying the generated artifact meets requirements vs. a directly produced artifact? This is mostly a function of the task and the user, but also the generative model.
  3. How much is the task dependent on the artifact vs. the process? This is a function of the task.

Each of these touch on things many others have said, but I think all three considered simultaneously are important. They make it possible to be scientific in an argument about the use of generative models.

If you want to claim a new model is “more useful”, you must specify all of these variables. You must specify a class of tasks, and demonstrate the cost of encoding for some set of users is lower than directly producing the artifacts of tasks; or that perhaps encoding is higher, but that verifying design requirements is lower.

More importantly, if I want to predict whether a generative model will be useful, I have a model to work with.

My model predicts that usefulness of a generative model may decrease as task complexity increases. Generative models are probabilistic: the output will be less likely to satisfy complex requirements, particularly if those requirements differ from common patterns in the training data, or worse, subtly different from common patterns. Verifying complex requirements is also hard, and harder than having a human following good engineering processes that lead to more easily verified outputs.

On the other hand, generative models should be useful when directly creating the artifact is hard for the user, but verifying the artifact is trivial. This could be the case for artifacts that require cross-referencing extremely specific information that is time consuming for a user to do, but once done, is trivial to check. It could also be the case for generative models integrated into formal verification systems with extremely reliable and highly automated verification, where no knowledge of the artifact being generated is necessary. But in general, it is unlikely to be the case for a novice in some domain trying to generate a complex artifact, since the user will not have the expertise to ensure the output meets requirements. This predicts there will still be a need for users of generative models to have domain expertise.

The model also predict that generative models are essentially useless for tasks that are highly process dependent, since all a generative model can do is produce an artifact by a blackbox process.

1. Relative Encoding Cost

A lot of arguments in favour of the usefulness of generative models make arguments about, in essence, the relative encoding cost. For a generative model to be useful, the total encoding cost must be lower than the total cost of directly producing the artifact.

The total encoding cost includes all the work that goes in to writing a prompt, and all of the compute required to run the prompt. If the task is simple to express in a prompt, the total encoding cost is low. If the task is both simple to express in a prompt, and tedious or difficult to produce directly, the relative encoding cost is low. As models get more capable, more complex prompts can be easily expressed: more semantically dense prompts can be used, referencing more information from the training data. An agent capable of refining or retrying a task after an initial prompt might succeed at a complex task after a single simple prompt. However, both of these also increase the compute cost of the prompt, sometimes substantially, driving up the total encoding cost. More “capable” models may have a higher probability of producing correct output, reducing costs reprompting with more information (“prompt engineering”), and possibly reducing verification costs.

Moreover, as a user gets more capable, they may be able to complete the tasks directly much faster than they can prompt a model to do it, driving up the relative encoding cost.

One may argue that the newest models are “more powerful” or “actually intelligent” or whatever. But these are unscientific claims.

The scientific version of these claims is “the total encoding cost (for some class of tasks) is lower than previous models”. Phrased this way, it’s clear this still doesn’t mean the new models are useful.

For most of my tasks, I think the relative encoding cost has been high. Many of my software engineering tasks are constructing small, semantically dense programs, with very specific design requirements, in a language much more concise than English that I can write fluently. I can systematically design and implement such software faster than I can encode the specification into a prompt for a generative model.

As one example, I tried using Claude Opus 4.6 to generate a program that would interpret a custom DSL I use for typesetting grammars, and generate Haskell type definitions. After 8 hours of prompting, several million tokens, the code it generated was still absolutely useless. It passed the tests I had prompted it on, but just looking at the code, one could easily identify type errors and logic that tried to special case specific identifiers from the tests. The logic for sanitizing identifiers was a mess, and would occasionally generate empty strings. A correct implementation would take me 300—400 line of code to write, which I can certainly write in less than 8 hours.

However, not true of all tasks are writing semantically dense code with very tight design requirements. For example, I was recently trying to install a package whose name I forgot. I prompted the model to “install that x11 fake gui thing”, a trivial prompt. Actually completing the task myself would have required a lot of tedious work, with lots of accidental complexity. I would have needed to search the internet to identify the name of this software, cross-reference that with the distribution of the operating system I was running and the name used by its package manager, possibly cross-reference the installation command for this particular package manager, and then write and execute a shell script to perform the install. I was able to use the agent to do all of this with an extremely easy to write prompt. This task had a very low relative encoding cost.

2. Relative Verification Cost

Some arguments about generative models focus on verification: “formal verification will become more important as more code is generated”.

I think these arguments are also unscientific. Verification is also not magic. Software designed in one way may be easier to verify than in another way. A user who carefully designed and implemented the software may be able to more easily verify it than when dropped into a fully generated code base.

All of this depends on the task, the user, and what the model is capable of generating. For my prompt to install a package, verification is trivial. I will recognize the right command when I see it, and it’s one line long.

Relative verification cost also goes up with the size of the task. If you’re generating a 1 line script, no problem. If you’re trying to generate a very large artifact, you’re going to get bored validating every command, every edit. You’re not going to be able to check every line of code that was generated. You’re going to need some other approach to verifying the output, increasing cost.

Relative verification cost also depends on the user. If I’m prompting a model to produce Racket, a language I am very fluent in, I can quickly evaluate the design and implementation of the generated code. If I tried to prompt a model to produce C, I’d be far better off just writing the C myself, following a systematic approach that would result in safe C. And then running it in a sandbox. After running some sanitizers on it.

Relative verification cost somewhat depends on the capabilities of the model, too. Some of the early models I experimented with produced trash code. Not merely bad code with bad design, but errors so basic I wouldn’t think to look for them: it would produce Racket with mismatched parenthesis, references to functions that didn’t exist, etc. Those are easy enough to detect by running the compiler, but what about the ones that aren’t so easy to detect?

One key part of this relative verification cost is that generative models produce plausible output. It’s not accurate to say a model produces “correct” or “incorrect” output, or “makes mistakes”. It does exactly what it’s designed to do: produce output that is statistically related to the input prompt, in some way. That doesn’t mean “statistically correct”, just “statistically related”. All output is correct, in the sense that all it’s suppose to be is a point in the distribution of things related to the prompt. Maybe you produce C code with memory errors most of the time, but most C code has memory errors. Maybe you mostly produce correct bash scripts for installing packages, because most bash scripts for installing packages on the internet are correct.

Plausibility of generative models greatly increases the relative verification cost, since the output is essentially optimized to be close to correct. I’d predict that relative verification cost could go up as the models get more complex. The class of errors we’re likely to find in generated code will be very different than the class of errors we’re used to looking for in human generated code: generated code will have subtle errors. As the models get more capable, you might be more likely to trust the output, and less likely to spot these subtle errors. This cost can be reduced by formal methods, but formal methods aren’t necessarily cheap. You might be better off with an engineer following a design process.

For some tasks, verifying the output may be impossible, or at least impossible without redoing exactly the work you were trying to use a generative model for. I think internet search is a good example of this. A generative model generated response to an internet search query that you did not know the answer too is essentially unverifable, unless you go and search for credible sources to verify the summary. At which point, the generative model did entirely wasted work.

3. Artifact vs. Process

Some tasks aren’t about the output. Or maybe they aren’t just about the output, and require the output be created following a specific process.

Easy examples of this are in education. I don’t need students to implement factorial for the one billionth time because I need an implementation of factorial. They implement factorial because going through the process creates knowledge in their head. Writing code is a fundamentally different process than reading code, in the same way that writing this blog post is a fundamentally different process than reading it.

This blog post is an example of a process-driven task. I’m writing this post. My hands are typing the words that appear in this post. They are not merely typing prompts that cause a generative model to generate plausibly-related words. That’s because I’m not trying to create a blog post. I’m trying to create knowledge, within myself and then within others. Writing this post is me thinking through all the details.

Process-driven tasks also come up in engineering. Some engineering requires specific processes are followed, because if the processes are followed, the end result will satisfy certain properties that may be difficult or impossible to verify just looking at the artifact. A large fruit company, for example, might forbid engineers from contributing to open source projects as part of the process by which they engineer software in order to mitigate intellectual property risks. There is no way, just looking at the software the engineer writes, to guarantee freedom from those risks.

The process argument against generative models comes up a lot. The argument goes something like “X is about human communication, or creativity, so generative models cannot be used to create Y”. And I really sympathize with this argument, because I think far too much is produced by ignoring the process.

But there are certainly tasks for which I really only* care about the output. My shell script example is one: I don’t care how the package gets installed; I care that it’s installed. (* Well, assuming the output wasn’t produced through some truly problematic process, which, well… but that’s a future post.)

The same can be true of writing. I am writing this post manually, because the process matters, but some writing is functional. For example, I used a generative model to draft a policy document. Policy documents don’t have much creative structure to them; they express a set of rules. I’m still reading and redrafting the document, since I need the particulars to better suit me, but it was useful to start from a generated draft, in the same way I might start from a template.

But even when the output is boring and easily verified, the process may be important. Junior engineers might write lots of boring, easily verified code. It might be extremely cheap the replace them with agents. But junior engineers writing that code are going through a process by which they gain experience, knowledge, and skills. Generative models can replace their output, but nothing can replace that process.

For almost all software I write, I do care about the process. I’m typically designing software as part of research, and me doing the design and implementation work creates knowledge that I will then share. The software isn’t the important output, or not the only important output. I think this is another big reason I haven’t found these things useful, and why it’s been such a struggle to figure out how they could possibly be useful.

We just need people who can make a computer produce useful work

So when is a generative model useful? Just when the (1) relative cost of encoding the work in a prompt is low (compared to doing the work some other way); (2) and/or relative cost of verifying the output satisfies requirements is low; (3) and the process used to complete the work doesn’t matter. To judge all of this accurately, the user of the model needs to know quite a lot about the work being done, about verifying design requirements in the domain, and about working with generative models and/or the model in question.

Navigating these trade-offs is engineering. If you’re navigating those trade-offs to produce software, you’re doing software engineering. If you’re not considering these trade-offs, you’re just going on vibes and what you produce will be something between accidentally useful and extremely harmful.

These trade-offs aren’t unique to generative models, but one thing is: they’ve made it incredibly cheap to produce an immense amount of output that is plausibly described by a natural language description. But plausible doesn’t mean useful, and there’s nothing in generative models that could ever guarantee useful output. As the models get more sophisticated, the complexity of the output and the prompts are getting more sophisticated. That’s not necessarily more useful. As that complexity goes up, so do the costs: of compute, of verification, and of relying on output over process.

I understand the temptation of these tools. Sometimes useful work is incredibly complex and frustrating to do. Writing software, running scripts, and organizing all my notes can be very tedious. Sometimes that is accidental complexity, but much of the time it is essential. It is very easy to use a generative model produce output. I don’t think it’s very easy to use them produce useful output.


Read the original article

Comments

  • By andai 2026-03-123:321 reply

    > What is the cost of verifying the generated artifact meets requirements vs. a directly produced artifact? This is mostly a function of the task and the user, but also the generative model.

    So this is the fun one for programming.

    I let AI agents do some programming on my codebases, but then I had to spend more time catching up with their changes.

    So first I was bored waiting for them to finish, and then I was confused and frustrated making sense of the result.

    Whereas, when I am asking AI small things like "edit this function so it does this instead", and accepting changes manually, my mental model stays synced the whole time. And I can stay active and in flow.

    (Also for such fine grained tasks, small fast cheap models are actually superior because they allow realtime usage. Even small latency makes a big difference.)

    • By lukan 2026-03-126:341 reply

      Yes, the more you let agents loose, the less you are in control and the more time you spend later cleaning up their mess.

      It is tempting letting them loose, after they delivered unexpectedly good results for a while, but for me it is not worth it. Manually approve and actual read. (And manually edit CLAUDE.md etc. if necessary. )

      • By dreadnip 2026-03-129:231 reply

        This is exactly why I don't like those "swarm" approaches with 8 Claude Code's running in parallel. Every time I've tried it I instantly lose control and become out of touch with the codebase. The quantity of the produced output is simply too fast & large to follow, so I tune out and it becomes a 100% vibe coded project.

        • By dominotw 2026-03-1212:59

          start with good prompts and good intentions , drift into sloppy prompt vibecoding ,finally "still not working" prompt in a loop.

          this has been my story in every one of my personal projects.

  • By smilindave26 2026-03-121:481 reply

    > For almost all software I write, I do care about the process. I’m typically designing software as part of research, and me doing the design and implementation work creates knowledge that I will then share.

    Similar here. For a lot of software I write, I don't really know what the essential "abstraction" I need is until I'm actively writing it. The answers, when I get them right, look obvious in retrospect. Sometimes, starting with Claude Code, I can get there, but my mindset is that I'm using this tool to generate software that helps me immerse myself in the problem space. It's a different pace to the process - sometimes it speeds me up, sometimes I end up taking bad concepts a lot further than I normally would before getting to the better path

    • By neonstatic 2026-03-123:011 reply

      I agree it's a different process. Personally, I do not enjoy it. If I get code wrong or the solution I came up with is clunky, I am okay to start over. At least I learned something valuable. With Claude, I get irritated, frustrated, and frankly just really tired. I feel like I've been burning hour after hour of my precious time trying to explain something to a machine, which just doesn't understand, cannot understand, and what comes out as output from that process is just disappointing. I feel that I don't trust the code it produces and I don't have it in me to even read that code. I never felt that way about code written by me or another person.

      I will admit, that Claude has been helpful as an assistant (especially helping me with syntax I am not familiar with), but as a programmer that does things for me, it's been awful. YMMV.

      Btw. a week of doing that (treating Claude as a programmer who does things for me) did help me in a way. I now have an intuitive understanding of what it means these things are not intelligence. I am now certain, that an LLM doesn't understand anything. It seems to be able to map text to some representations and then see if these representations match or compose. I know this might sound like intelligence, but in practice it's just not enough. Pattern recognition, sure. Not intelligence. Not even close.

      • By lukan 2026-03-128:061 reply

        " Pattern recognition, sure. Not intelligence. Not even close."

        To me it is a form of intelligence, just not general intelligence.

        And yes, the trick is not treat them as intelligent, but like an idiot. Explain every single detail. Document everything in detail. Remove anything distracting. And then it might work like a charm at times.

        • By neonstatic 2026-03-128:341 reply

          No to be nitpicky or difficult, but I find it strange that we don't really have a solid, agreed upon definition of intelligence, but suddenly we have variants of the non-definition - general, super, etc. I think it's just marketing fluff.

          If the model understood what it sees, it wouldn't need to be treated like someone who doesn't? And if it doesn't understand, how can it be intelligent?

          • By lukan 2026-03-129:481 reply

            I don't have the answer here.

            I just know, that if I would point a average human to a messy old codebase, he or she would just shrug helplessly. Even most programmers.

            But if I tell claude to start digging in, refactor, update outdated tools .. it produces results. So there is some "understanding" I don't know how else to call it. So surely it is not a general intelligence, but it is certainly useful.

            • By neonstatic 2026-03-1219:15

              I think what you are describing is the actual usefulness of the tech. It can do some things and, contrary to humans, it doesn't get demotivated or uninterested, it's a machine.

              I will stick to my earlier statement (I hope I made it in this thread) - it seems to be treating blocks of texts as concepts and tries to compose those concepts like lego blocks. It is quite amazing that it can transform characters into meanings, even if it doesn't really understand these meanings, then compose them. I just don't think that's enough to call it intelligent (but certainly it's enough to find it useful for some tasks, as you point out).

  • By janalsncm 2026-03-126:31

    LLMs have significantly reduced the time I’ve spent chasing down cryptic errors on stack overflow, old github issues, or asking in random slack channels about it. Even if that’s all they did, they would be very valuable.

    If that means I’m actually coding instead of figuring out why xyz random plugin isn’t doing its job right now, some subsystem that I need but don’t care to learn the internals of, then I am happy.

HackerNews