Watch AI solve Wordle using constraint satisfaction, entropy maximization, and backtracking search
The solver reformulates Wordle as a classic CSP 〈X, D, C〉. Variables X = {P0, P1, …, Pn-1} represent each letter slot. Domains D(Pi) ⊆ {a–z} hold the currently legal letters per position. Constraints C encode every piece of feedback: exact matches (green), misplaced letters (yellow), and absent letters (grey). A solution is a complete assignment σ: X → D that satisfies all constraints simultaneously — i.e., a valid candidate word. This formulation unlocks the entire CSP algorithmic toolkit: propagation, arc consistency, heuristic ordering, and backtracking — turning a word game into a rigorous search problem.
Rather than passively storing constraints and checking them during search, the solver actively propagates their consequences to shrink variable domains before any guessing begins. A green clue collapses D(Pi) to a singleton {c}. A yellow clue removes c from D(Pi) while adding a global “must-contain” constraint guaranteeing c appears in at least one other slot. A grey clue eliminates c from every position's domain (unless another instance is green/yellow). Crucially, these propagations cascade: shrinking one domain may trigger further eliminations in neighboring variables, producing a fixpoint — a maximally reduced state before any search is invoked. This often cuts the candidate list by 80–95% after a single guess.
The backbone search strategy is depth-first backtracking over the CSP's assignment space. It assigns one variable at a time — choosing a letter c ∈ D(Pi) — then checks whether the partial assignment is consistent with all active constraints. If consistent, it recurses to the next unassigned variable; if not, it backtracks, undoing the assignment and trying the next value. This avoids the combinatorial explosion of generate-and-test (which would need to enumerate up to 26n assignments). Backtracking guarantees completeness — if a valid word exists, the solver will find it — and its efficiency depends heavily on the heuristics (MRV, LCV, forward checking) layered on top, which reduce the effective branching factor from ~26 to often <3.
The solver doesn't pick guesses at random — it uses Shannon entropy as a heuristic evaluation function, analogous to how A* search uses h(n) to prioritize the most promising frontier node. For each candidate guess g, the solver computes the expected information gain: how evenly g partitions the remaining candidate set across all 3n possible feedback patterns. A guess that creates many small, equally-sized partitions has high entropy ≈ high discriminative power. This transforms the “which word next?” decision from blind trial-and-error into a principled best-first selection, consistently reducing the expected solve depth from 5–6 guesses (random) to 3–4 (entropy-guided).
Also called the fail-first heuristic. When selecting which variable to assign next during backtracking, always pick the position Pi with the smallest |D(Pi)|. Intuition: if P3 has only {a, e} while P0 still has 18 candidates, assigning P3 first means we'll discover a dead end (if one exists) after at most 2 branches — not 18. This exponentially reduces the number of nodes explored by ensuring failures surface at shallow depths. In practice, MRV combined with constraint propagation often reduces backtracking to near-zero for typical Wordle states, because the tightly constrained positions get resolved first and cascade domain reductions to the rest.
The fail-last principle for value ordering — the complement of MRV. Once we've selected which position to assign (via MRV), we must decide which letter to try first. LCV scores each candidate letter c by counting how many values it would eliminate from neighboring variables' domains if assigned. We try the letter that rules out the fewest options elsewhere first. Why? If we're going to succeed, we want to maximize the flexibility left for unassigned positions, giving them the best chance of finding a consistent value without backtracking. MRV asks “where is failure most likely?” and goes there first; LCV asks “which choice is most likely to succeed?” and tries it first. Together, they prune the tree from both ends.
A tiebreaker for MRV. When two or more positions share the same smallest domain size, the degree heuristic selects the one involved in the most active constraints with other unassigned variables. A position constrained by multiple yellow/green interactions with other slots is harder to satisfy downstream — resolving it early propagates more information and avoids deep, wasted backtracking branches. In Wordle terms: if P1 and P4 both have |D| = 3, but P1 participates in a “must-contain” constraint, a “not-duplicate” constraint, and an arc-consistency constraint with P2, while P4 has only one constraint — assign P1 first. Degree + MRV together implement a robust variable-ordering policy that minimizes expected tree size.
A lightweight look-ahead strategy that catches dead ends one step before they happen. Immediately after assigning Pi = c, forward checking inspects every unassigned variable Pj that shares a constraint with Pi and removes any value from D(Pj) that is now inconsistent with the assignment. If any D(Pj) becomes empty (∅), we know the current path is unsolvable — backtrack immediately without exploring further. This avoids the “thrashing” behavior of naive backtracking, where an inconsistency at depth k isn't discovered until depth k+3. Forward checking is O(n · d) per assignment but saves exponential work by preventing doomed subtrees from ever being entered. It's the minimum viable pruning — cheap, effective, and always on.
A stronger consistency enforcement than forward checking, operating on all variable pairs, not just those adjacent to the latest assignment. AC-3 maintains a queue of arcs (Pi, Pj). For each arc, it checks: does every value a ∈ D(Pi) have at least one supporting value b ∈ D(Pj) such that the pair (a, b) satisfies all constraints between Pi and Pj? If not, a is removed from D(Pi). Whenever a domain shrinks, all arcs pointing into that variable are re-queued for re-examination. The algorithm terminates at a fixpoint — a state where every remaining value is pairwise consistent. Worst-case complexity is O(n2d3), but in practice it converges quickly on Wordle's small (n=5) problem and often eliminates values that forward checking alone would miss.
Inspired by unit propagation in SAT solvers. When any consistency technique (forward checking or AC-3) reduces a domain to exactly one value — |D(Pi)| = 1 — that value is a forced assignment. Singleton propagation treats it as decided and immediately propagates its consequences: removing the assigned letter from all other positions' domains (since Wordle answers don't repeat letters, or applying appropriate duplicate-handling constraints if they do). This can trigger a chain reaction: fixing P2 = 'r' shrinks D(P0) and D(P4); if D(P4) drops to 1, it propagates again. The cascade continues until a fixpoint is reached — sometimes solving multiple positions for free without ever invoking backtracking. This is the solver's most satisfying optimization to watch in the domain grid.
After every round of constraint propagation and consistency enforcement, the solver performs a full pass over the candidate word list. A word w = c0c1…cn-1 survives only if ci ∈ D(Pi) for every position i. This is a direct application of the CSP domains as a filter — any word containing even one letter that has been eliminated from its position's domain is immediately discarded. The filtering is O(|W| × n) where |W| is the current word list size, and it typically shrinks the candidate pool by 60–95% per guess round. Combined with propagation, it ensures backtracking search only ever considers words that are domain-consistent — a necessary (though not sufficient) condition for being the answer.
The information-theoretic heart of the solver's guess selection. For each candidate guess g, compute the entropy of the feedback distribution: H(g) = −∑ pk · log2(pk), where pk is the fraction of remaining candidates that would produce feedback pattern k. High entropy means the guess splits the candidate pool into many nearly equal groups — each outcome carries maximum information, like a perfectly balanced binary tree. Low entropy means most candidates cluster into one or two feedback patterns — the guess is uninformative. The solver always selects the highest-entropy guess, directly optimizing the expected bits of information gained per turn. This is the single most impactful technique: it's the difference between average solve depths of ~3.5 (entropy-optimal) vs ~5.5 (random selection). Grounded in Shannon's 1948 mathematical theory of communication.
To compute entropy, the solver must first build the feedback distribution for each guess. For a candidate guess g and each remaining answer a, it simulates the Wordle comparison function to produce a feedback pattern — a string of green/yellow/grey signals. With n letter positions, there are 3n theoretically possible patterns (243 for n=5). The solver groups all remaining candidates by the pattern they'd produce, yielding a probability distribution {p1, p2, …, pm} over observed patterns. This is conceptually a one-level lookahead, akin to minimax without an adversary: “If I guess g, and the world responds with pattern k, how many candidates remain?” The guess that minimizes the expected surviving candidate count (equivalently, maximizes entropy) is chosen. The simulation is O(|W|2 × n) in the worst case but is heavily optimized via early termination and pattern caching.
The Cytoscape.js graph renders the solver's actual search tree as an interactive directed graph. The root node represents the initial state (full candidate list, unconstrained domains). Each child node represents the state after a guess + feedback round: its label shows the guessed word, remaining candidate count, and domain sizes. Edges encode the constraints applied between states (color-coded feedback). The green path traces the solver's chosen trajectory. Red branches show pruned/rejected alternatives. The teal node marks the goal state (answer found, |candidates| = 1). This is the textbook search tree from AIMA Ch. 3.3 made tangible — you can click nodes, trace decisions, and see exactly how constraint propagation + entropy ranking navigated the space from ~2,300 candidates to a single answer in 3–4 steps.