How a (small) language model walks

The "Designed" tab builds a training text by construction, so the bigram model's branching is small and controlled instead of whatever a real corpus happens to give. Here is the shared machinery, the four ways the dropdown authors the text, and the two checks that keep it honest.

A bigram graph

A bigram model reads a token sequence and records, for each word, the set of words seen immediately after it. Write that as a directed graph: vertices are the distinct words, and there is an edge u → v whenever v follows u somewhere in the text. A word's out-degree is the number of distinct words that follow it. The generative step is a random walk on this graph, sampling each next word in proportion to how often it followed the current one; the out-degree is the branching factor of that choice. The point of the Designed tab is to make that branching uniform and low (mostly 2) so the generative process is legible.

Constructing the lattice

Lay out a cyclic part-of-speech template (DET → ADJ → NOUN → VERB → DET → …) and fill each slot with a column of m distinct words. Number each column's words 0…m-1. Word i points to words in the next column at indices 2i mod m and 2i+1 mod m; for m ≥ 2 those are distinct, so each word gets out-degree 2. There is no displayed "END" token: the period "." is a real token that plays the role of end-of- sequence. The m (words per slot) is what the Length slider sets, so the slider is really a vocabulary control.

Out-degree counts strings, not slots

The subtle hazard is that out-degree aggregates over every occurrence of a surface string, not over a grammatical slot. Put "the" in two columns and the graph unions its successors, silently landing it at out-degree 4. The per-slot regularity is invisible; only string identity counts. The fix is a discipline: give every column a disjoint word pool, so each surface string lives in exactly one slot. Then the empirical out-degree equals the assigned one — provided every assigned edge actually appears in the displayed text. That proviso drives the covering walk below.

Five strategies (the Strategy dropdown)

All five keep a word-level bigram model and a one-word reading head; they differ only in how the training text is authored, which changes the out-degree profile:

lattice (strict k=2) — every word has exactly 2 successors; a verb can also take the period (out-degree 3) to end a sentence. The cleanest case, and the one the math above describes exactly. The self-check asserts a hard pass/fail here.
mixed (k = 2 or 3) — each word gets 2 or 3 successors at random, so the branching reads less mechanically while staying tightly bounded.
prose (objects + conjunctions) — clauses take objects (DET ADJ NOUN VERB NOUN) and join mid-sentence with "and"/"yet". Verbs reach out-degree 3-4 (a real word, an object, the period, a conjunction); the model still just learns the adjacencies.
natural (optional adjectives + prepositions) — the default. Two edges relax the rigid slot order toward real English: a determiner can go straight to a noun (so "the machine" appears alongside "the tender machine"), and a noun can open a prepositional phrase instead of taking its verb ("the idea in the meadow sleeps"). Most words still have out-degree 2; the graph stays one strongly-connected piece so the covering walk and guard still hold.
free function words — content words stay at 2, but determiners and the period branch widely, the way real function words do, giving natural connective tissue.

The strict lattice asserts a flat 2/3/2 pass/fail; the other four run in report mode, showing the real out-degree histogram instead, because their whole point is to relax the flat-k=2 claim for readability. (Agreement and tense are not varied, and there are no relative clauses — those would either need more word forms or balloon the branching past what you can follow by eye.)

Covering every edge in the walk

The displayed text is a walk, so it shows at most (tokens − 1) edges; to exhibit all the graph's edges it needs at least (edges + 1) tokens, achieved only by an Eulerian path. But the lattice is not balanced (the first determiners are pointed at more often than they point out), so no single Eulerian pass covers every edge once, and a naive routine would "jump" at the seams and emit pairs that are not real edges. So we solve the directed route- inspection (Guan's route) problem: measure each vertex's imbalance, add the fewest duplicate edges (along shortest paths) to balance it, then take an Eulerian circuit with Hierholzer's algorithm and rotate it to end on a period. The leftover phrase repetition is structural: the imbalanced nodes sit on opposite sides of the cycle, so the duplicate paths are long.

Two checks that keep it honest

None of this is taken on faith. First, a build-time coverage guard runs for every strategy: it recomputes the bigram graph from the emitted tokens and requires it to equal the designed graph — no spurious adjacency, no uncovered edge. If they differ it shows a "GRAPH/TEXT MISMATCH" banner and disables generation, so the text you read can never silently diverge from the graph the panel draws. Second, the self-check panel reports the out-degrees: a hard PASS/FAIL for the strict lattice (2 for ordinary words, 3 for sentence-enders, 2 for the period), or the real histogram for the report-mode strategies. The model is Shannon's 1948 first-order word model throughout; the construction only rigs the training text so its branching is something you can follow by eye.

How lively is the generated text? Scale is not the whole story

The generative liveliness of a bigram model is how much its next word is in doubt: the per-step conditional entropy of the source, weighted by how often each word actually occurs, H̄ = ∑_w₁ (n(w₁)/N) H[p(w₂|w₁)] in bits, from the text's bigram counts. A word with one successor (out-degree k=1) contributes 0 bits; a word that never recurs (k=0, a dead end) also contributes 0 and drags the average down. This is the conditional entropy of Shannon's n-gram model (1948; and Prediction and Entropy of Printed English, 1951). You might expect a bigger vocabulary to mean a livelier model, but the curves below show that liveliness depends as much on the structure of the training text as on its size: each curve is the most entropy reachable at each vocabulary size, over every contiguous whole-sentence excerpt of one short text.

Three repetitive texts, same axes. Aristophanes' frog chorus (The Frogs, 405 BC) wrings real entropy out of a 3-word vocabulary (0.67 bits/step): tiny but structured. The cumulative rhyme The House That Jack Built (1755) climbs highest because its high-traffic words ("that", "the") branch heavily as the verse accumulates. Same scale, very different liveliness, depending on how the words connect.

Appendix: the texts

The three public-domain texts behind the curves above. (The House That Jack Built and Ring a Ring o' Roses are also in the Standard-texts tab, to walk directly.)

The Frogs (Aristophanes, 405 BC):
Brekekekex koax koax. Brekekekex koax koax. We children of the fountain and the lake. Let us raise our harmonious song, koax koax. The fair notes of our music, brekekekex koax koax.

Ring a Ring o' Roses (traditional):
Ring a ring o' roses. A pocket full of posies. A-tishoo. A-tishoo. We all fall down.

The House That Jack Built (1755):
This is the house that Jack built. This is the malt that lay in the house that Jack built. This is the rat that ate the malt that lay in the house that Jack built. This is the cat that killed the rat that ate the malt that lay in the house that Jack built. This is the dog that worried the cat that killed the rat that ate the malt that lay in the house that Jack built. This is the cow with the crumpled horn that tossed the dog that worried the cat that killed the rat that ate the malt that lay in the house that Jack built.

Shannon, 1945

The bigram model you are watching first appears, fully formed, in a classified Bell Labs war memorandum by Claude Shannon, A Mathematical Theory of Cryptography (1 September 1945). It is all there: transition probabilities p_i(j), the probability that letter i is followed by letter j; digram and trigram tables; and the famous "series of approximations to English," where you generate text by sampling each letter (then each word) from the frequencies with which it followed what came before. The bigram walk in this demo is exactly Shannon's first-order word approximation. The version most people cite, A Mathematical Theory of Communication (1948), reused these passages three years later; the language model itself is wartime work. (Earlier still, Andrey Markov had in 1913 counted vowel/consonant transitions across 20,000 letters of Pushkin's Eugene Onegin — the first chain of its kind. Markov counted letters; Shannon's step was to make those counts generative — a model that does not just measure language but produces it. You can run the bigram walk on Markov's own text in the Standard texts tab.)

Turing and Shannon, over tea at Bell Labs

Shannon wrote it as a codebreaker. A cipher is attacked through the statistical structure of the underlying language — its redundancy, the predictability that a letter constrains the next. The same n-gram statistics that let Shannon's model generate English-like text let a cryptanalyst recognize it emerging from a partly-broken cipher. The 1945 memo treats the language model and the attack as two sides of one mathematics. It was written after the Allied codebreaking effort had industrialized: Colossus, the electronic machine built to break the German Lorenz cipher at Bletchley Park, was running by the end of 1943.

The Allied lead in this mathematics had itself come from Poland. Three mathematicians of the Polish Cipher Bureau — Marian Rejewski, Henryk Zygalski, and Jerzy Różycki — had broken Enigma years before the British, treating the cipher as a problem in the statistics and structure of its messages. With engineers from the AVA Radio Company they built the bomba to search Enigma settings, and at a secret meeting in the Pyry forest south of Warsaw, on 25 July 1939, handed reconstructed machines and their methods to French and British intelligence — weeks before the war began.

For about two months from January 1943, Alan Turing — who had reported to Bletchley on 4 September 1939, the day after Britain declared war, and who led the attack on the naval Enigma — was at Bell Labs on SIGSALY secure-speech work. He and Shannon took tea together, by Shannon's account daily.

Language, communication, and stochastic walks

The walk samples each next word in proportion to how often it followed the current word in the training text. That is all Shannon's model does; there is no temperature or other sampling knob (those came later, with neural language models). The "Designed" tab rigs the training text so every word has exactly two choices, to make the branching visible; the other tabs show what real text looks like, where a word like "the" may have dozens of continuations.

Shannon's bigram model already made a machine "use language," after a fashion — a small generative engine — even before the term artificial intelligence was coined, in a 1955 proposal for the 1956 Dartmouth summer workshop on artificial intelligence (which Shannon co-authored).

McCarthy, Shannon, and the “AI”: the brand

The man who coined artificial intelligence, John McCarthy — a Princeton mathematics instructor, having taken his PhD there in 1951 — spent the summer of 1952 working with Shannon at Bell Labs, where the two went on to co-edit Automata Studies (Princeton University Press, 1956). In his own account:

I invented the term artificial intelligence. I invented it because when we were trying to get money for a summer study, I had a previous bad experience. In 1952 Claude Shannon and I decided to collect a batch of studies which we hoped would contribute to launching this field, and Shannon thought that artificial intelligence was too flashy a term and might attract unfavorable notice. So we agreed to call it Automata studies, and I was terribly disappointed when the papers we received were about automata, and very few of them had anything to do with the goal that at least I was interested in. So I decided not to fly any false flags anymore, but to say that this is study aimed at the long-term goal of achieving human-level intelligence. John McCarthy, Lighthill debate, 1973 (video, from 2:33)

Credits

Vibecoded/spellcast by Chris Wiggins (chris.wiggins@gmail.com), inspired by the work in How Data Happened: A History from the Age of Reason to the Age of Algorithms and the course Data: Past, Present, and Future, authored and designed, respectively, with Matthew L. Jones.

Source code (MIT-licensed) is on GitHub: github.com/chrishwiggins/shannon-language-model. Clone it and run make to serve the site locally.

last updated 2026-06-01T11-11-59 NYC time

Where the words come from

The "Designed" tab's vocabulary is not random. Its words are drawn from a small canon of texts about language, machines, and whether fluent output amounts to understanding — so even the nonsense the model emits carries an echo of that argument.

Christopher Strachey's love letters (1952). On the Ferranti Mark 1 at Manchester, Strachey filled a fixed template — "YOU ARE MY [adjective] [noun]..." — from word lists drawn from Roget's Thesaurus, signed "M.U.C." It is the direct ancestor of this demo: a grammatical template filled from pools, twelve years before ELIZA. The romantic words here (darling, adoring, tender, heart, yearns, adores) are its register.
Chomsky, Syntactic Structures (1957). The specimen sentence colorless green ideas sleep furiously — perfectly grammatical, perfectly meaningless. Exactly what a bigram chain produces; "colorless," "green," "idea," "sleeps" come from it.
Brautigan, All Watched Over by Machines of Loving Grace (1967). The poem's vision of machines and nature in harmony gives "meadow" and "machine."
"Daisy Bell" (Harry Dacre, 1892). Sung by an IBM mainframe at Bell Labs in 1961, the first recording of a computer singing — a touchstone for the whole lineage (the demo cites it by title; its full public-domain lyric is not used).
Weizenbaum, on ELIZA (1976). "What I had not realized is that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people." A bigram walk is such a program.
Searle, the Chinese Room (1980). "Syntax is not sufficient for semantics." The model has the syntax (which word follows which) and none of the semantics (no idea what any of it means).
Orwell, "Politics and the English Language" (1946). As Strunk and White put it in The Elements of Style: to show what happens when strong writing is deprived of its vigor, Orwell took a passage from Ecclesiastes and drained it of its blood. The result begins "Objective considerations of contemporary phenomena compel the conclusion that success or failure in competitive activities exhibits no tendency to be commensurate with innate capacity..." That is grammar smothering sense, the failure mode this demo makes visible.

Appendix: the words, by source

The designed tab draws from a fixed pool of words, each evoking one of the sources above. Any single training text uses only a shuffled slice (its size set by the ↑↓ tokens control), so a given walk shows a subset; raise the token count to see more. That is why a word like "delusional" or "semantics" appears only sometimes.

Chomsky (colorless green ideas sleep furiously): colorless, green, idea, sleeps
Brautigan (All Watched Over...): cybernetic, meadow, loving; and the words "machine," "the," "and"
White / Orwell (the Ecclesiastes parody, quoted in The Elements of Style): objective, machine; and the function words this, every, that, and, yet
McCarthy / Dartmouth (the 1955 proposal): intelligent, abstract, intelligence, conjecture, concept, language, simulates, reasons; and some, each, then
Weizenbaum (the ELIZA "delusional thinking" line): delusional, program
Searle (the Chinese Room): formal, syntax, semantics
Strachey (his love-letter word lists): tender, wistful, passionate, darling, heart, yearns, adores, loves, treasures, woos
prepositions (for the "natural" strategy's prepositional phrases, each verbatim in the sources above): in, of, with

Every word the demo can emit is verbatim from one of these sources; there are no invented filler words. (Some words appear in more than one source, e.g. "machine.")

The words are slotted by true part of speech so the lattice still composes into grammatical clauses; the verbs are third person singular.

References

A. A. Markov, "An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains," lecture 23 Jan 1913 (Eng. trans. Science in Context 19(4), 2006). overview
C. E. Shannon, A Mathematical Theory of Cryptography (classified Bell Labs memo, 1 Sep 1945). IACR scan
C. E. Shannon, A Mathematical Theory of Communication (Bell System Technical Journal, 1948).
A. Hodges, Alan Turing: The Enigma (1983) — the source for Turing's stay at Bell Labs from January 1943, consulting on the SIGSALY speech-encryption system and taking tea with Shannon: the wartime link between the Bletchley Park codebreaking effort and Bell Labs.
Polish Cipher Bureau — Rejewski, Zygalski, Różycki; AVA Radio Company; the Pyry forest meeting, 25 Jul 1939.
N. Chomsky, Syntactic Structures (1957) — the example sentence "colorless green ideas sleep furiously".
R. Brautigan, All Watched Over by Machines of Loving Grace (1967). author
The 1955 proposal for the Dartmouth Summer Research Project on Artificial Intelligence (McCarthy, Minsky, Rochester, Shannon).
J. McCarthy, Automata Studies (co-edited with C. E. Shannon; Princeton University Press, 1956), and McCarthy's account of coining "artificial intelligence" after their 1952 Bell Labs collaboration, from the Lighthill debate, 1973 (video, from 2:33).
"Daisy Bell" (Harry Dacre, 1892), sung by an IBM mainframe at Bell Labs, 1961. Wikipedia
J. Weizenbaum, Computer Power and Human Reason (1976); Weizenbaum, ELIZA.
M. Kranzberg, "Technology and History: Kranzberg's Laws," Technology and Culture (1986). Kranzberg

How a (small) language model walks through its training text

A bigram graph

Constructing the lattice

Out-degree counts strings, not slots

Five strategies (the Strategy dropdown)

Covering every edge in the walk

Two checks that keep it honest

How lively is the generated text? Scale is not the whole story

Appendix: the texts

Shannon, 1945

Turing and Shannon, over tea at Bell Labs

Language, communication, and stochastic walks

McCarthy, Shannon, and the “AI”: the brand

Credits

Where the words come from

Appendix: the words, by source

References

Trained language model (`T` to randomize); each column is a part of speech

Generated language (`↵` for next, `space` to play/pause, `←→` for slower, faster)

Next-word choice

Training text (`T` to randomize, `↑↓` tokens: fewer, more)

Self-check — out-degrees recomputed from the displayed text

A bigram graph

Constructing the lattice

Out-degree counts strings, not slots

Five strategies (the Strategy dropdown)

Covering every edge in the walk

Two checks that keep it honest

How lively is the generated text? Scale is not the whole story

Appendix: the texts

Shannon, 1945

Turing and Shannon, over tea at Bell Labs

Language, communication, and stochastic walks

McCarthy, Shannon, and the “AI”: the brand

Credits

Where the words come from

Appendix: the words, by source

References

Trained language model (T to randomize); each column is a part of speech

Generated language (↵ for next, space to play/pause, ←→ for slower, faster)

Next-word choice

Training text (T to randomize, ↑↓ tokens: fewer, more)

Self-check — out-degrees recomputed from the displayed text

Trained language model (`T` to randomize); each column is a part of speech

Generated language (`↵` for next, `space` to play/pause, `←→` for slower, faster)

Training text (`T` to randomize, `↑↓` tokens: fewer, more)