Why LLMs Alone Can’t Fix the Documentation Problem

My co-founder recently wrote about how technical documentation became infrastructure in the agentic coding era — and why manual writing can’t meet the bar that emerging agent consumers of technical documentation require. The thesis was that documentation has shifted from a human-readable artifact to a machine-executable one, and that the accuracy floor has risen accordingly.

That post named the problem. This one is about the technical path through it.

There are two approaches to generating technical documentation: LLM-first and structural reasoning-first. An LLM-first approach uses an LLM to do it all: read the code, decide what’s true about it, and write the documentation. Think: asking Claude, ChatGPT, or Claude Code to look at your repo and your docs to generate doc updates. A structural reasoning-first approach starts with deterministic deep static analysis to extract the formal properties of the code — symbols, signatures, types, call relationships, inheritance chains. Then it employs an LLM to write prose grounded in those extracted facts.

Both approaches involve LLMs producing prose, which is the kind of task LLMs are unambiguously good at. Where they fundamentally differ is in how they decide what is true about the code. In the LLM-first approach, establishing ground truth about the code is done by the LLM using probabilistic generation. In the structural reasoning-first approach, establishing ground truth about the code is done by deterministic extraction.

This post is the argument for why a structural reasoning-first approach is the only viable path forward.

Shortcomings of establishing ground truth with a probabilistic model

The failure mode of LLM-first documentation isn’t that it produces bad output most of the time. It’s that the output reads as uniformly confident — and a fraction of it is factually wrong in ways the system itself cannot detect. That fraction is the visible surface of the probabilistic-method-applied-to-deterministic-source mismatch.

Phantom references are the most prominent failure. An LLM generating a code sample sometimes produces a function signature that doesn’t exist in the actual codebase — drawn from a similar pattern in its training data rather than from the code in front of it. The output looks plausible. The function it references doesn’t exist, or the argument order is wrong, or the type is incorrect. The model isn’t lying; it’s interpolating. Interpolation is a normal property of probabilistic methods. It’s just not a property we should accept when the source is actually deterministic.

Missed cross-file relationships fail in the opposite direction. Where phantom references involve the model claiming things that aren’t there, missed cross-file relationships involve the model missing things that are. Both are consequences of LLM generation operating on partial information. An LLM looking at a single file — or even a small set of retrieved files — sees only a fraction of the relevant context. It writes documentation based on what it can see and infers the rest from pattern-matching against similar codebases. The function being documented may delegate to a helper in another module that wraps a third call that ultimately determines behavior. A standard agent misses this because it doesn’t have any dependency graph. It’s doing local pattern recognition. The technical documentation describes what the function appears to do; but the actual behavior emerges from interactions the model never traced.

Incorrect pattern matching is the most subtle of the three. The first two failures involve the model getting something wrong about what exists. This one involves the model getting something wrong about what’s true. LLMs are trained to recognize patterns and produce output consistent with what those patterns usually mean. When a function looks like other functions the model has seen, the technical documentation describes the function as if it behaves like those other functions. Most of the time it does. Some of the time it doesn’t — like when the function does something subtly different from what its name and signature suggest. But the technical documentation will still confidently assert the expected behavior rather than the actual one.

These three failure modes are all manifestations of using LLMs to establish ground truth about code.

Why structural reasoning-first is the necessary path

The two approaches to establishing ground truth about code are philosophically different.

LLM-first uses LLMs to read the codebase’s textual surface and uses probabilities to generate high probability ground truth. LLMs are interpretive. Interpretation is probabilistic, which means the answers an LLM provides can change as the model or inputs change — even when the code doesn’t. LLM agents with grep and other tools still miss important context since they are based on simple search without proper grounded understanding.

Structural reasoning-first uses deterministic extraction to read the code’s formal structure to deterministically establish ground truth, which means the ground truth it extracts only changes when the code itself changes.

LLMs — large language models — were designed to model patterns in natural language and produce plausible text in response to prompts. They’re probability machines: they generate output by sampling from learned distributions of what’s likely to come next. Natural human language does require interpretation. Sentences have multiple plausible meanings; words have context-dependent definitions; reference depends on shared knowledge between writer and reader. A probabilistic approach matches that ambiguity — given the same input, multiple outputs can all be reasonable, and the model’s job is to produce one of the reasonable ones.

But using the probabilistic approach of an LLM to establish ground truth about the code — basically interpreting the code the way you’d interpret a poem — introduces uncertainty when the code never had any. LLMs cannot be the layer that establishes the ground truth that the technical documentation is based on. That layer should be deterministic, because the source it’s drawing from is deterministic.

When you build documentation from the AST up, you start with a deterministic representation of what the code is. Symbols, function signatures, type information, import relationships, and call structure are extracted from the code’s actual structure, not inferred from its surface text. So in the technical documentation grounded on the AST, the symbols it references actually exist. The function signatures it describes actually match the code. The cross-file relationships it represents actually hold.

Etchblok’s market thesis is that technical documentation must be accurate — both because inaccuracy has a real business cost and because AI agents are increasingly consuming docs as ground truth. Etchblok solves documentation drift by tethering the docs to the code, so the docs change with the code. If we won’t tolerate inaccuracy from doc drift, why would we tolerate inaccuracy from an LLM’s probabilistic estimate of the code’s ground truth?

The counterargument

Some may argue: “LLMs are improving quickly. The grounding problem will be solved by better models, longer context windows, and improved retrieval.”

Better models and inputs will produce more accurate documentation. But what kinds of errors will persist?

The errors cluster on patterns that are rare in training data — which means they concentrate on the codebases that have distinctive architectural choices, unusual frameworks, or custom abstractions. The codebases most worth documenting accurately are the ones where LLM accuracy is worst. Capability improvements that come from larger general-purpose models reduce average error rates but don’t fix this clustering — the rare patterns remain rare relative to the training distribution. Capability improvements that come from training on more code address the average but not the specific: your codebase contains combinations of choices that exist only in your codebase, and those combinations are new to any model regardless of how much code it was trained on.

You could argue that with enough training data, even rare patterns will have been seen — eventually the model will have encountered every relevant combination. This is true in theory, but unlikely in practice. Codebase-specific patterns aren’t drawn from a finite catalog of possible patterns; new combinations of architectural choices, conventions, and abstractions that are unique to each team are emerging every day. Models handle novel combinations through generalization rather than memorization, and generalization on rare combinations requires much more training data (where returns are logarithmic) or much larger models (with linearly scaling costs). Both options are cost-prohibitive.

The path that would address codebase-specific accuracy directly is per-customer fine-tuning, which is its own architectural commitment with its own costs: infrastructure, maintenance, model drift from the base model over time. Most teams don’t actually do this, and the few that do are buying themselves into a category of ongoing operational complexity that the structural reasoning-first approach avoids.

Even with better inputs and models, errors remain “blackbox” — they cannot be audited. When an LLM produces wrong documentation, there’s no way to ask the system “why did you say this?” and get a deterministic answer. The error is the output of a probabilistic blackbox; it can’t be reproduced reliably or fixed at the source. Deterministic extraction errors can be: every line traces back to a specific extraction from a specific AST node, and bugs can be found and fixed at the layer that produced them. As tech docs become production infrastructure, the importance of maintainability is amplified.

What this means for the category

Documentation is becoming part of the production engineering stack — alongside testing, deployment, and dependency management. Each of those went through the same evolution: from manual to rule-based to automated. Testing went from QA teams to test frameworks to CI on every commit. Deployment went from manual releases to scripted deploys to declarative infrastructure-as-code. Dependency management went from manually copying libraries to package managers to lockfiles with reproducible builds.

Each of these shifts happened when the cadence of the underlying work outran what humans could sustain manually, and when the tooling matured to make automated approaches reliable.

Technical documentation is following the same arc. Agentic coding is enabling engineering teams to ship at record velocity, and it is no longer possible for a human to read and understand the entire surface area of what shipped and translate that into tech docs.

There is another catalyst driving technical documentation to full automation and that’s the fact that coding agents are fetching technical documentation, treating them as ground truth, and writing code against them. This elevates the floor for accuracy. Technical documentation has become infrastructure and it cannot afford to be inaccurate, whether that inaccuracy is from doc drift or from hallucinations.

The era of manual documentation ended when code started shipping faster than humans could write about it and when docs became infrastructure. Continuous Documentation is what comes next, and it’s grounded in structural understanding of the code. That’s why I’ve spent the last year building Etchblok — grounded in structural understanding — to ensure your code and your context remain permanently, continuously synchronized.

See documentation generated by Etchblok →Learn more about Etchblok →