The Grammar of Structure
What the Langlands Program Might Tell Us About Learning Machines
![]() |
| Copyright: Sanjay Basu |
I. Introduction
There is a persistent mystery at the heart of mathematics. Objects that appear entirely unrelated, defined in different languages, studied by different communities, sometimes turn out to encode the same information. A question about prime numbers becomes equivalent to a question about symmetries of certain functions. A problem in algebra transforms into a problem in analysis. The translation is not metaphorical. It is exact.
This phenomenon troubles people who encounter it for the first time. Mathematics is supposed to be about definitions and consequences. If you define two things differently, why should they be the same? And yet they are. Again and again.
Robert Langlands, working at the Institute for Advanced Study in the late 1960s, proposed something ambitious: that these scattered coincidences were not accidents but symptoms of a deeper unity. His conjectures suggested that vast families of mathematical objects, representations of Galois groups on one side and automorphic forms on the other, were in correspondence. Not approximately. Precisely. The program he initiated has shaped number theory for more than fifty years and remains largely unproven.
Meanwhile, a different kind of structure-discovery has emerged. Transformer-based neural networks, trained on enormous datasets, develop internal representations that capture patterns their designers never explicitly specified. These models transfer across tasks in ways that surprise researchers. They seem to find something general, something structural, in data that should have no inherent organization.
The parallel is worth taking seriously, though it requires care. The Langlands program is a framework of precise mathematical conjectures. Machine learning is an empirical enterprise. One concerns necessary truths about arithmetic objects. The other concerns contingent patterns learned from data. Any comparison must acknowledge this gap.
Still, both domains circle the same question: What does it mean for different representations to encode the same underlying structure? And what would it take to translate between them?
II. What the Langlands Program Actually Is
To understand what Langlands proposed, we need to understand what representation means in mathematics.
A representation is a way of realizing an abstract algebraic object as concrete transformations. If you have a group, which is a set with a multiplication rule satisfying certain axioms, you can study it by watching how it acts on a vector space. Each group element becomes a matrix. The group multiplication becomes matrix multiplication. This translation preserves structure while making the group tractable.
The insight that representations are often more revealing than the objects themselves runs through modern mathematics. You study a thing by studying how it acts.
The Langlands program concerns two families of representations that, at first glance, have nothing to do with each other.
On one side are Galois representations. The Galois group of a number field encodes all the symmetries of its algebraic closure, all the ways you can permute roots of polynomials while respecting arithmetic. This group is enormous, complicated, and fundamental to number theory. A Galois representation is a homomorphism from this group into matrices, a way of watching how arithmetic symmetries act on a vector space.
On the other side are automorphic representations. These arise from analysis, from the study of functions on certain symmetric spaces that satisfy invariance conditions. The theory of automorphic forms generalizes classical objects like modular forms, which appear throughout number theory, from the proof of Fermat’s Last Theorem to the theory of elliptic curves.
The Langlands program conjectures that these two families are in correspondence. Given an irreducible representation of a Galois group, there should exist a corresponding automorphic representation, and vice versa. The correspondence is mediated by L-functions, complex analytic objects that serve as invariants. If two representations correspond, their L-functions match.
A. W. Knapp, in his introduction to the program, describes this precisely: the goal is to show that “Artin L functions are Hecke L functions” in the abelian case, and more generally that Artin L-functions attached to Galois representations equal the L-functions attached to automorphic representations. The abelian case is already deep. It essentially amounts to class field theory, which describes how the arithmetic of a number field governs the structure of its abelian extensions. The nonabelian case, which is most of the program, remains largely open.
The program is not a single conjecture but a web of interlocking statements. The Local Langlands Conjecture concerns what happens at each prime and at the infinite places. The global conjectures concern how local information assembles into something coherent. The principle of functoriality asserts that homomorphisms between certain dual groups should induce correspondences between automorphic representations.
None of this is easily summarized. Knapp’s exposition runs to fifty-eight dense pages and is explicitly an introduction. The actual literature fills thousands of pages.
What matters for our purposes is the animating vision: that fundamentally different mathematical domains, arithmetic on one side and analysis on the other, are not merely analogous but the same, once you know how to translate.
III. Representation as the Central Theme
The word representation does serious work in both mathematics and machine learning. But the meanings diverge, and the divergence is instructive.
In mathematics, a representation is a structure-preserving map. The original object’s relationships must be respected in the target space. If two group elements multiply to give a third, their representing matrices must multiply accordingly. The representation is faithful if no information is lost, if distinct elements map to distinct matrices. The representation is irreducible if the vector space has no proper subspaces that are themselves preserved by the action.
These are precise conditions with sharp consequences. Representation theory is not about finding useful embeddings. It is about classifying all possible ways a structure can manifest.
In machine learning, representation learning names something different. A neural network learns to transform inputs into internal activations, vectors in high-dimensional spaces. These learned representations are useful insofar as they support downstream tasks. A good representation for image classification puts similar images near each other. A good representation for language modeling captures semantic and syntactic regularities.
The constraints are empirical, not algebraic. There is no requirement that certain relationships be preserved. There is no classification theorem saying which representations are possible. The learned representation is whatever the optimization process discovers.
And yet something structural is happening.
When a transformer model is trained on text, it learns representations that encode relationships never specified in the training objective. Words with similar meanings cluster together. Syntactic roles emerge in activation patterns. Models trained on different languages develop partially aligned representational spaces, as if finding the same underlying structure through different surface forms.
This is not representation in the mathematical sense. But it suggests that neural networks discover some kind of organizing principles in data, principles that were not built in but emerged.
The Langlands program asks: given that two representational systems exist, can we establish that they encode the same information? Machine learning faces a related question: given that different models trained on different data develop similar representational structures, what does that tell us about the structure being represented?
The questions are not the same. But they rhyme.
IV. Attention and Correspondence
The transformer architecture computes through attention. At each layer, every position in a sequence attends to every other position, computing weighted sums based on learned compatibility functions. The weights determine which other positions influence the representation at a given position.
Attention is often described as capturing relationships. A word attends to another word because they are syntactically related, or because one helps predict the other, or because they participate in the same semantic pattern. The mechanism learns these relationships from data.
Is this a correspondence in anything like the Langlands sense?
Functoriality, in the Langlands program, asserts that homomorphisms between L-groups induce correspondences between automorphic representations. Knapp describes this carefully: if you have an L-homomorphism from one dual group to another, there should be an induced map on representations. The correspondence is not arbitrary. It must respect L-functions. It must be compatible with local and global structure.
The key word is must. Functoriality is a constraint. It says that if a correspondence exists, it must behave in specific ways. The content of the conjecture is that such correspondences actually exist and that they satisfy the required conditions.
Attention mechanisms have no such constraints. They learn whatever patterns minimize training loss. There is no requirement that the learned attention weights form a structure-preserving map. There is no L-function serving as an invariant. There is no theorem saying what the attention must do, only an empirical observation of what it tends to do.
This is where the analogy breaks down, and the breakdown is important.
Mathematical correspondences are constrained by their definitions. They exist or they do not. When they exist, they satisfy precise conditions. The Langlands correspondence, if it holds, does not approximately match L-functions. It matches them exactly.
Learned correspondences in neural networks are contingent. They depend on architecture, training data, optimization hyperparameters, and random initialization. They are approximate, shifting, and task-dependent. A model might learn attention patterns that capture meaningful relationships on one distribution and fail completely on another.
So the structural parallel has clear limits. Attention is not functoriality. The learned relationships in a transformer are not correspondences in the mathematical sense.
But there is a weaker claim worth considering. Perhaps what attention learns is not a correspondence but an approximation to one. Perhaps the training process discovers rough shadows of structural relationships that a principled theory would make precise.
This is speculative. It is also the kind of speculation that resists easy testing. But it suggests a research direction: can we characterize what attention learns in terms of what it would need to learn to satisfy stronger constraints?
V. Universality and Transfer
The Langlands program aims at a kind of universality. The correspondence between Galois representations and automorphic representations is supposed to be general, not limited to special cases. The reciprocity laws that motivated the program, from quadratic reciprocity onward, were always hints at something larger. Langlands proposed the larger thing.
Knapp emphasizes that the program concerns reductive groups in general, not just specific examples. The L-group construction works for any such group. The conjectures apply across the full family. If they hold, they reveal that arithmetic and analysis are unified at the level of structure, not just in scattered instances.
Transformer models exhibit their own version of universality through transfer learning. A model trained on one task, typically next-token prediction on large text corpora, develops representations useful for entirely different tasks. Classification, translation, question answering, and code generation all benefit from pre-trained representations. The transfer works even when the downstream task looks nothing like the pre-training objective.
This generalization was not obvious in advance. Early neural language models were good at the specific tasks they were trained on. They did not transfer well. Transformers changed this. Scale helped, but architecture mattered more. The attention mechanism, with its ability to learn flexible long-range dependencies, seems to produce representations that capture something general.
What is that something?
One hypothesis is that natural language has deep regularities, statistical patterns reflecting underlying linguistic and conceptual structure. Pre-training discovers these regularities. Because the regularities are general, the representations transfer.
Another hypothesis emphasizes inductive bias. Transformers are architecturally suited to learning certain kinds of patterns, perhaps compositional patterns, perhaps patterns involving long-range dependency. The architecture, not just the data, shapes what is learned.
A third hypothesis points to scale. With enough parameters and enough data, a model can memorize sufficient patterns to approximate general transfer. The universality is not principled but brute-forced.
These hypotheses are not mutually exclusive. All three probably contribute. But they have different implications for how we understand what transformers are doing.
The Langlands perspective would favor a principled explanation. If the Langlands program is true, the universality of its correspondences is not accidental. It reflects deep structural facts about arithmetic and representation. The universality is a consequence of the mathematics.
Transfer learning in transformers might be similar. Perhaps the generalization reflects structural features of language and reasoning that the architecture is suited to capture. Or perhaps it reflects something else entirely, something that has nothing to do with principled structure.
We do not know. That is an honest assessment of where the field stands. Transfer works. Why it works is disputed. Whether it works for principled reasons, reasons that could be theorized in advance, or whether it works for contingent reasons that just happen to hold for current models and current data, remains unclear.
VI. Why This Might All Be Wrong
Here is an uncomfortable possibility. The analogy between Langlands and transformers might be not just limited but actively misleading.
The Langlands program resists shortcuts. It cannot be solved by collecting data. The conjectures concern necessary truths about mathematical objects, and these truths must be established by proof. No amount of experimental verification substitutes. You cannot train a model to discover the Langlands correspondence. You can only prove it or not.
Modern machine learning takes the opposite approach. Structure is discovered empirically. Models are trained on data and evaluated on held-out data. If the metrics improve, the model has learned something useful. What exactly it has learned may remain opaque. The standard is predictive performance, not principled understanding.
This difference matters.
When Andrew Wiles proved Fermat’s Last Theorem, he did so by establishing a modularity result, a case of the Langlands correspondence for elliptic curves. The proof took seven years and hundreds of pages. It required deep engagement with the specific structures involved. It could not have been brute-forced.
When a transformer learns to perform a task, the process is essentially brute-forced. Gradient descent on a loss function, repeated billions of times, adjusts parameters until performance improves. The model does not understand what it is doing in any sense that would satisfy a mathematician. It approximates.
The Langlands program suggests that mathematical universality comes from understanding why correspondences hold, not just that they hold. The proof is the understanding. Without the proof, you have observed coincidences, not established truths.
Transformer universality might be different in kind. The models generalize, but they do not understand why they generalize. Their representations might capture structure, but they capture it implicitly, in ways that resist extraction and interpretation. The universality, if it exists, is not demonstrated but merely exhibited.
This is not a criticism of machine learning. Empirical science often works by observing patterns before explaining them. Newton’s laws described gravity long before general relativity explained it. But the comparison to Langlands should give pause.
If the analogy is apt, it suggests that current transformer performance might be fragile, dependent on contingent features of current architectures and datasets, and subject to failure in ways we cannot anticipate. Principled understanding would be more robust. We do not have principled understanding.
If the analogy is inapt, it is misleading to draw inspiration from Langlands at all. The program might tell us nothing useful about learning machines, because learning machines operate in a fundamentally different regime.
I lean toward the view that the analogy is suggestive but limited. It points toward questions worth asking, such as how to characterize what representations capture, and what structural constraints would make learned correspondences more principled. But it does not provide answers. And it might seduce us into thinking we understand more than we do.
VII. Endnotes
What would a principled theory of learned representation look like?
This question is not rhetorical. It names a gap in current understanding. We have powerful models. We have empirical results showing that transfer works. We do not have a theory explaining why, in a way that would let us predict when it will fail.
The Langlands program offers one model for what such a theory might look like. It specifies precise conditions that correspondences must satisfy. It provides invariants, the L-functions, that detect when correspondences hold. It connects local structure to global structure through the adelic formalism. It makes predictions that can be checked.
Nothing comparable exists for representation learning in neural networks. We have heuristics. We have empirical regularities. We do not have a framework that tells us what representations must satisfy to support transfer, or what invariants would detect structural equivalence between different models.
Maybe such a framework is impossible. Neural networks might be too flexible, too contingent, too dependent on specifics to admit general theory. The representations they learn might be genuinely arbitrary, useful only because the optimization process happened to find them.
Or maybe a framework is possible but not yet discovered. Perhaps there are structural constraints on learned representations that we have not identified, constraints that would explain transfer and predict generalization. Perhaps the right mathematics exists and waits to be applied.
The Langlands program took decades to formulate and remains mostly unproven after more than fifty years. If a comparable theory of learned representation is possible, it might take equally long to develop.
In the meantime, we work with what we have. Models that work for reasons we do not fully understand. Analogies that suggest questions but do not answer them. A sense that structure matters, without a precise account of what that means.
This is not a satisfying place to end. But it is an honest one.
VIII. References
- Knapp AW. Introduction to the Langlands program. Proc Symp Pure Math. 1997;61:245–302.
- Langlands RP. Problems in the theory of automorphic forms. Lect Notes Math. 1970;170:18–61.
- Langlands RP. On the classification of irreducible representations of real algebraic groups. In: Sally PJ, Vogan DA, eds. Representation Theory and Harmonic Analysis on Semisimple Lie Groups. American Mathematical Society; 1989:101–170.
- Gelbart SS. An elementary introduction to the Langlands program. Bull Am Math Soc. 1984;10:177–219.
- Tate JT. Number theoretic background. Proc Symp Pure Math. 1979;33(2):3–26.
- Borel A. Automorphic L-functions. Proc Symp Pure Math. 1979;33(2):27–61.
- Jacquet H, Langlands RP. Automorphic Forms on GL(2). Springer; 1970. Lecture Notes in Mathematics; vol 114.
- Artin E, Tate J. Class Field Theory. W.A. Benjamin Inc; 1967.
- Serre JP. Local Fields. Springer-Verlag; 1979.
- Godement R, Jacquet H. Zeta-Functions of Simple Algebras. Springer; 1972. Lecture Notes in Mathematics; vol 260.
- Cassels JWS, Fröhlich A, eds. Algebraic Number Theory. Academic Press; 1967.
- Arthur J, Clozel L. Simple Algebras, Base Change, and the Advanced Theory of the Trace Formula. Princeton University Press; 1989.

Comments
Post a Comment