How a (small) language model walks through its training text

Language models are not magic. They are mathematical games for exploring the text on which they are trained. This particular game, the simplest of them, dates from 1945 (see the About / History tab) and shows two things at once: how much the training text matters, and how a plain random walk through that text can be generative. Start with the Designed tab, then try your own text. Explore and enjoy.

Paste/write any text. It is analyzed as-is: the real bigram graph is built with frequencies, and the walk samples likelier continuations more often. Out-degrees vary wildly, the opposite of the designed demo's uniform branching.

Enter one or more URLs (comma- or newline-separated). The local backend fetches each page and uses BeautifulSoup to strip it to plain text, then analyzes the combined result. Requires running python3 src/server.py.

Text excerpts bundled with the app. Pick one to build its real bigram graph and walk it. Eugene Onegin is the text Andrey Markov used in 1913 for the first Markov chain — see the About / History tab.

The "Designed" tab builds a training text by construction, so the bigram model's branching is small and controlled instead of whatever a real corpus happens to give. Here is the shared machinery, the four ways the dropdown authors the text, and the two checks that keep it honest.

A bigram graph

A bigram model reads a token sequence and records, for each word, the set of words seen immediately after it. Write that as a directed graph: vertices are the distinct words, and there is an edge u → v whenever v follows u somewhere in the text. A word's out-degree is the number of distinct words that follow it. The generative step is a random walk on this graph, sampling each next word in proportion to how often it followed the current one; the out-degree is the branching factor of that choice. The point of the Designed tab is to make that branching uniform and low (mostly 2) so the generative process is legible.

Constructing the lattice

Lay out a cyclic part-of-speech template (DET → ADJ → NOUN → VERB → DET → …) and fill each slot with a column of m distinct words. Number each column's words 0…m-1. Word i points to words in the next column at indices 2i mod m and 2i+1 mod m; for m ≥ 2 those are distinct, so each word gets out-degree 2. There is no displayed "END" token: the period "." is a real token that plays the role of end-of- sequence. The m (words per slot) is what the Length slider sets, so the slider is really a vocabulary control.

Out-degree counts strings, not slots

The subtle hazard is that out-degree aggregates over every occurrence of a surface string, not over a grammatical slot. Put "the" in two columns and the graph unions its successors, silently landing it at out-degree 4. The per-slot regularity is invisible; only string identity counts. The fix is a discipline: give every column a disjoint word pool, so each surface string lives in exactly one slot. Then the empirical out-degree equals the assigned one — provided every assigned edge actually appears in the displayed text. That proviso drives the covering walk below.

Five strategies (the Strategy dropdown)

All five keep a word-level bigram model and a one-word reading head; they differ only in how the training text is authored, which changes the out-degree profile:

The strict lattice asserts a flat 2/3/2 pass/fail; the other four run in report mode, showing the real out-degree histogram instead, because their whole point is to relax the flat-k=2 claim for readability. (Agreement and tense are not varied, and there are no relative clauses — those would either need more word forms or balloon the branching past what you can follow by eye.)

Covering every edge in the walk

The displayed text is a walk, so it shows at most (tokens − 1) edges; to exhibit all the graph's edges it needs at least (edges + 1) tokens, achieved only by an Eulerian path. But the lattice is not balanced (the first determiners are pointed at more often than they point out), so no single Eulerian pass covers every edge once, and a naive routine would "jump" at the seams and emit pairs that are not real edges. So we solve the directed route- inspection (Guan's route) problem: measure each vertex's imbalance, add the fewest duplicate edges (along shortest paths) to balance it, then take an Eulerian circuit with Hierholzer's algorithm and rotate it to end on a period. The leftover phrase repetition is structural: the imbalanced nodes sit on opposite sides of the cycle, so the duplicate paths are long.

Two checks that keep it honest

None of this is taken on faith. First, a build-time coverage guard runs for every strategy: it recomputes the bigram graph from the emitted tokens and requires it to equal the designed graph — no spurious adjacency, no uncovered edge. If they differ it shows a "GRAPH/TEXT MISMATCH" banner and disables generation, so the text you read can never silently diverge from the graph the panel draws. Second, the self-check panel reports the out-degrees: a hard PASS/FAIL for the strict lattice (2 for ordinary words, 3 for sentence-enders, 2 for the period), or the real histogram for the report-mode strategies. The model is Shannon's 1948 first-order word model throughout; the construction only rigs the training text so its branching is something you can follow by eye.

How lively is the generated text? Scale is not the whole story

The generative liveliness of a bigram model is how much its next word is in doubt: the per-step conditional entropy of the source, weighted by how often each word actually occurs, H̄ = ∑w₁ (n(w₁)/N) H[p(w₂|w₁)] in bits, from the text's bigram counts. A word with one successor (out-degree k=1) contributes 0 bits; a word that never recurs (k=0, a dead end) also contributes 0 and drags the average down. This is the conditional entropy of Shannon's n-gram model (1948; and Prediction and Entropy of Printed English, 1951). You might expect a bigger vocabulary to mean a livelier model, but the curves below show that liveliness depends as much on the structure of the training text as on its size: each curve is the most entropy reachable at each vocabulary size, over every contiguous whole-sentence excerpt of one short text.

Three repetitive texts, same axes. Aristophanes' frog chorus (The Frogs, 405 BC) wrings real entropy out of a 3-word vocabulary (0.67 bits/step): tiny but structured. The cumulative rhyme The House That Jack Built (1755) climbs highest because its high-traffic words ("that", "the") branch heavily as the verse accumulates. Same scale, very different liveliness, depending on how the words connect.

Appendix: the texts

The three public-domain texts behind the curves above. (The House That Jack Built and Ring a Ring o' Roses are also in the Standard-texts tab, to walk directly.)

The Frogs (Aristophanes, 405 BC):
Brekekekex koax koax. Brekekekex koax koax. We children of the fountain and the lake. Let us raise our harmonious song, koax koax. The fair notes of our music, brekekekex koax koax.

Ring a Ring o' Roses (traditional):
Ring a ring o' roses. A pocket full of posies. A-tishoo. A-tishoo. We all fall down.

The House That Jack Built (1755):
This is the house that Jack built. This is the malt that lay in the house that Jack built. This is the rat that ate the malt that lay in the house that Jack built. This is the cat that killed the rat that ate the malt that lay in the house that Jack built. This is the dog that worried the cat that killed the rat that ate the malt that lay in the house that Jack built. This is the cow with the crumpled horn that tossed the dog that worried the cat that killed the rat that ate the malt that lay in the house that Jack built.

Shannon, 1945

The bigram model you are watching first appears, fully formed, in a classified Bell Labs war memorandum by Claude Shannon, A Mathematical Theory of Cryptography (1 September 1945). It is all there: transition probabilities pi(j), the probability that letter i is followed by letter j; digram and trigram tables; and the famous "series of approximations to English," where you generate text by sampling each letter (then each word) from the frequencies with which it followed what came before. The bigram walk in this demo is exactly Shannon's first-order word approximation. The version most people cite, A Mathematical Theory of Communication (1948), reused these passages three years later; the language model itself is wartime work. (Earlier still, Andrey Markov had in 1913 counted vowel/consonant transitions across 20,000 letters of Pushkin's Eugene Onegin — the first chain of its kind. Markov counted letters; Shannon's step was to make those counts generative — a model that does not just measure language but produces it. You can run the bigram walk on Markov's own text in the Standard texts tab.)

Turing and Shannon, over tea at Bell Labs

Shannon wrote it as a codebreaker. A cipher is attacked through the statistical structure of the underlying language — its redundancy, the predictability that a letter constrains the next. The same n-gram statistics that let Shannon's model generate English-like text let a cryptanalyst recognize it emerging from a partly-broken cipher. The 1945 memo treats the language model and the attack as two sides of one mathematics. It was written after the Allied codebreaking effort had industrialized: Colossus, the electronic machine built to break the German Lorenz cipher at Bletchley Park, was running by the end of 1943.

The Allied lead in this mathematics had itself come from Poland. Three mathematicians of the Polish Cipher Bureau — Marian Rejewski, Henryk Zygalski, and Jerzy Różycki — had broken Enigma years before the British, treating the cipher as a problem in the statistics and structure of its messages. With engineers from the AVA Radio Company they built the bomba to search Enigma settings, and at a secret meeting in the Pyry forest south of Warsaw, on 25 July 1939, handed reconstructed machines and their methods to French and British intelligence — weeks before the war began.

For about two months from January 1943, Alan Turing — who had reported to Bletchley on 4 September 1939, the day after Britain declared war, and who led the attack on the naval Enigma — was at Bell Labs on SIGSALY secure-speech work. He and Shannon took tea together, by Shannon's account daily.

Language, communication, and stochastic walks

The walk samples each next word in proportion to how often it followed the current word in the training text. That is all Shannon's model does; there is no temperature or other sampling knob (those came later, with neural language models). The "Designed" tab rigs the training text so every word has exactly two choices, to make the branching visible; the other tabs show what real text looks like, where a word like "the" may have dozens of continuations.

Shannon's bigram model already made a machine "use language," after a fashion — a small generative engine — even before the term artificial intelligence was coined, in a 1955 proposal for the 1956 Dartmouth summer workshop on artificial intelligence (which Shannon co-authored).

McCarthy, Shannon, and the “AI”: the brand

The man who coined artificial intelligence, John McCarthy — a Princeton mathematics instructor, having taken his PhD there in 1951 — spent the summer of 1952 working with Shannon at Bell Labs, where the two went on to co-edit Automata Studies (Princeton University Press, 1956). In his own account:

I invented the term artificial intelligence. I invented it because when we were trying to get money for a summer study, I had a previous bad experience. In 1952 Claude Shannon and I decided to collect a batch of studies which we hoped would contribute to launching this field, and Shannon thought that artificial intelligence was too flashy a term and might attract unfavorable notice. So we agreed to call it Automata studies, and I was terribly disappointed when the papers we received were about automata, and very few of them had anything to do with the goal that at least I was interested in. So I decided not to fly any false flags anymore, but to say that this is study aimed at the long-term goal of achieving human-level intelligence. John McCarthy, Lighthill debate, 1973 (video, from 2:33)

Credits

Vibecoded/spellcast by Chris Wiggins (chris.wiggins@gmail.com), inspired by the work in How Data Happened: A History from the Age of Reason to the Age of Algorithms and the course Data: Past, Present, and Future, authored and designed, respectively, with Matthew L. Jones.

Source code (MIT-licensed) is on GitHub: github.com/chrishwiggins/shannon-language-model. Clone it and run make to serve the site locally.

last updated 2026-06-01T11-11-59 NYC time

Where the words come from

The "Designed" tab's vocabulary is not random. Its words are drawn from a small canon of texts about language, machines, and whether fluent output amounts to understanding — so even the nonsense the model emits carries an echo of that argument.

Appendix: the words, by source

The designed tab draws from a fixed pool of words, each evoking one of the sources above. Any single training text uses only a shuffled slice (its size set by the tokens control), so a given walk shows a subset; raise the token count to see more. That is why a word like "delusional" or "semantics" appears only sometimes.

Every word the demo can emit is verbatim from one of these sources; there are no invented filler words. (Some words appear in more than one source, e.g. "machine.")

The words are slotted by true part of speech so the lattice still composes into grammatical clauses; the verbs are third person singular.

References

Trained language model (T to randomize); each column is a part of speech

current word choice 0 choice 1 STOP (hidden token)

Generated language ( for next, space to play/pause, for slower, faster)

Next-word choice

Bright word = where we are. The chips are its choices in the graph; one is sampled.

Training text (T to randomize, tokens: fewer, more)

Self-check — out-degrees recomputed from the displayed text

A property of the whole training text above, computed once when the text is generated. It does not change as the walk steps; it is checking the text, not your current position. Out-degree = how many different words follow a given word. The check rebuilds the bigram graph straight from the displayed tokens and confirms every word has the out-degree the design promises (in the Designed tabs, two).

how many
words
out-degree (number of distinct next-words) →
tokens speed