Talking About LLMs

This is a blog form of a toot thread I wrote as part of a discussion of LLMs–“Large Language Models”, the type of technology behind the current AI hype.

Let me start by saying this: criticism of LLMs tends to be in three different flavors. First, LLMs are bad at what they claim to do. Second, LLMs are bad for ethical reasons because of how they’re built, and how they’re deployed. Third, LLMs are bad due to their environmental impact. It’s tricky, because arguments against tend to elide those things together a bit, and because defenders often shift defense to a flavor that isn’t addressed by the argument.

In this post, I’m going to talk about criticism #1 (“LLMs are bad at what they claim to do”), because without that the rest is irrelevant. That said, I’m also going to assume criticisms #2 and #3 are at least somewhat valid, if only from a cost/benefits perspective.

“LLMs are bad at what they claim to do” is particularly hard because it’s both straightforwardly and obviously true to anybody with a background in machine learning (or psychology, or several other disciplines), and it’s also the one that is clearly false to someone without those backgrounds who interacts with one. It’s further complicated by “what they claim to do” covering a very wide range from “decent grammar-checker” to “lets you fire everybody who isn’t a venture capitalist” to “on the cusp of being a general intelligence greater than any human and capable of destroying civilization”.

Now, in some ways, LLMs are decent grammar-checkers. In fact, they can be seen as a victory for the “grammar is descriptive not prescriptive” crowd–they spot patterns in human-generated text, and can point out where your text matches or deviates from those patterns. However, if that’s their use, it’s absurdly expensive from the perspectives of criticisms #2 and #3. Indeed, even accepting the downside of mining all copyrightable material on the internet, you could use it to make a purpose-built grammar-checker boiling far fewer oceans that would end up small enough to run on a toaster. Without accepting that, you could build one from scratch and get an even smaller one that is very nearly as good (word processors have been doing this for decades, and work predates even those).

Further, it has weaknesses even as a grammar-checker that make human checking a requirement. LLMs have no analysis, for example, so cannot tell when somebody is code-switching to a different register within a given language. The rules of grammar often change during a code switch, so your LLM will end up making mistakes in those situations. Similarly, as pointed out elsewhere, an LLM is by its nature a snapshot of the past, and cannot be updated for new changes to grammar (which we are accepting as part and parcel with the “grammar as descriptive” approach). Instead, you have to boil an ocean to build a new model every time it needs an update.

Even supposing that the LLM is better than the alternatives for checking grammar, how much better must it be to justify the outrageously large cost by comparison? If a from-scratch decision tree can get you 90% accuracy, and a purpose-built derived-from-everything grammar checker can get you to 94%, how much are we willing to pay to get our LLM to 97%? Are we still willing to even though we know it will never hit 100%?

And that’s just one very low-end claim. Other low-end claims include:

“LLMs are better search engines”. It’s not at all clear to me that this is true. We know, for example, that google has intentionally made their search engine worse in order to increase time spent on the site. We also know that people have used LLMs (and, to be fair, less costly but similarly spammy text-generation methods) to populate the web with increasing amounts of un-useful text, so building a search engine is harder than it used to be. Critically, given the latter, we’re looking at LLMs-as-search-engines being trained on text generated by non-humans, which is a guaranteed negative feedback loop on quality¹.

In other words, even if LLMs are better than google as a search engine right now (which is questionable, as far as I’m concerned), they’re not necessarily better than non-enshittified search engines, and they’re also doomed given the nature of current internet content.

“LLMs are great for making software”. There are a couple points to make here. First, I’ll say that I expect programming languages, being as strict and formal as they are, to be easier to “correctly” generate by LLMs than natural language text. And yet, LLMs will still sometimes generate code that isn’t even syntactically correct. That’s before you get into semantic correctness, which is what you need for the code to work as you expect and intend it to.

And that’s well before you get to the second point, which is that making software isn’t just about generating code. A professional software developer spends time coding, yes, but also deciding what code shouldn’t be added. And how new code should be added. The process of code review wasn’t created to show off all the new code, it was created because it matters what code we add, and how we add it.

Every large codebase has its own semantics and standards of style and organization, often unwritten. Unless the LLM is trained specifically on that codebase, and not on similar codebases with conflicting standards, it’s going to generate bad code. Not necessarily code that doesn’t work, but code that makes maintaining the whole system much more difficult by using different definitions and formats than the rest of the system.

If you work with a codebase long enough, you can look at some code and say with a reasonable degree of confidence “oh, that’s Jem’s code, it’s in their style”. That is, in itself, an added complexity like the ones I mention above, but with an LLM it becomes so much worse: every single contribution is potentially from a different “coder” with a different “style”.

And all that is before we get to bugs. Not the sort that prevents compiling, but the sort that happens when the author has no domain knowledge in the business. Or, worse, the sort that can be exploited to compromise data. The former is something the LLM simply cannot have, barring an exhaustive prompt that your legal team would gasp at letting out of the company. The latter is something that will run afoul of two problems: one, most code that the LLM is trained on is unlikely to be as strict about such things as you need it to be, and two, that field is ever-evolving so your LLM will never be up-to-date (and will require constant ocean-boiling to update).

“Okay, sure, but ‘vibecoding’”. Let’s say you don’t care about correctness, or security, and you’re working on an entirely new codebase so you don’t care about standards or style. You just want to have fun and learn a new library or framework.

The old way to do this: Work through the tutorial. Work through a couple howtos. Trial and error through your way to adding what you want as you figure out the new library or framework.

Vibecoding approach: Ask the LLM to get you past the tutorial and howtos and directly to the trial-and-error phase, ideally without the errors. Add what you want without the base of knowledge from the tutorial or howtos.

Are LLMs good for this? shrug Sure, maybe? In my opinion there are better ways of having fun while doing tinker-coding (notably: using a language or environment that is more fun by design (e.g. a modern BASIC, or Racket, or Scratch, or whatever)). But fun varies so if it works for you, then, yeah, I’ll say “LLMs are good for this”.

But that brings us back to criticisms #2 and #3–is the fun of vibecoding worth the informational and environmental cost of LLMs?

And there remains a major limitation: vibecoding is going to be much better supported for libraries and frameworks that already have a large basis of code out there to digest. An LLM is simply unlikely to give you useful results for a brand new framework, by its nature. And if there’s already plenty of code examples out there, vibecoding is getting closer and closer to “LLMs are better search engines”, which we’ve established is at-best questionably true.

All this leads to a really big question: if LLMs are good at only a few of the very low-end claims, why would we even discuss the high-end claims? And without the high-end claims, how can we justify the costs in criticisms #2 and #3?

In short: let’s assume that human-generated content is higher quality than LLM-generated content (a reasonable assumption given that LLMs are built to mimic human-generated content). Let’s also assume X as the probability that any given LLM output will be un-human-like. It’s safe to infer (based on how LLMs work) that X will vary based on the quality of training data. If the training data is “the whole internet”, and we’re seeing an increase in non-human-generated content in that training data, X will increase over time. Based on that, it’s reasonable to believe that LLMs may be at their peak quality right now. ↩

Talking About LLMs

Last posts