The Compiler Nobody Sees

 

Copyright: Sanjay Basu

How the software stack between PyTorch and GPU silicon became the most strategically important layer in AI infrastructure

There is a peculiar blindness in the AI industry, and it looks something like this.

A machine learning engineer writes a PyTorch model. They define layers, specify activations, wire up attention mechanisms. They call model.to('cuda') and trainer.fit() and watch loss curves descend toward enlightenment. When the model trains successfully, they celebrate. When it fails, they blame the data, the hyperparameters, the architecture. What they almost never think about is the extraordinary tower of software that transforms their Python incantations into actual instructions that execute on actual silicon.

Between torch.nn.Linear(512, 256) and the voltage patterns that propagate through NVIDIA's tensor cores lies a compiler stack of remarkable sophistication. This stack takes high-level mathematical operations, fuses them, optimizes their memory access patterns, tiles them to fit hardware constraints, schedules them across thousands of parallel execution units, and emits highly optimized machine code. It does all of this invisibly. And that invisibility has made it the single most strategically consequential layer in AI infrastructure.

Copyright: Sanjay Basu

The compiler is where lock-in actually lives.

The Invisible Stack

When you write model(input) in PyTorch, what actually happens? The answer reveals an entire universe of software engineering that most practitioners never encounter. Your model definition gets traced into a computational graph, which is an abstract representation of the mathematical operations you have specified. This graph then descends through multiple levels of transformation, each level bringing it closer to the hardware while preserving its mathematical semantics.

At the top sits your framework of choice. PyTorch, TensorFlow, JAX. These frameworks provide the user-facing API, the automatic differentiation machinery, the tensor abstractions that make deep learning tractable for humans. But they are not compilers in any meaningful sense. They are orchestrators that delegate the real work to lower layers.

Below the framework sits the graph capture and optimization layer. In PyTorch 2.x, this role belongs to TorchDynamo and TorchInductor. TorchDynamo intercepts Python bytecode execution and captures the computational graph without requiring users to rewrite their code in a restricted subset of Python. This was a genuine breakthrough. Earlier approaches like TorchScript required developers to annotate their code or avoid dynamic Python features. TorchDynamo captures arbitrary Python, including control flow and data-dependent branching, by observing actual execution paths.

The captured graph then passes to TorchInductor, which performs optimizations that transform high-level operations into efficient low-level kernels. The most important optimization is operator fusion. Consider a simple sequence like ReLU(MatMul(x, W) + b). In naive eager execution, this would launch three separate GPU kernels with three separate memory round-trips. Inductor fuses these into a single kernel that loads data once, computes everything, and writes results once. For memory-bound workloads, which describes most inference scenarios, this fusion can deliver two to three times speedup with zero changes to user code.

Copyright: Sanjay Basu


Below Inductor sits the layer that actually generates GPU code. For NVIDIA GPUs, this role increasingly belongs to OpenAI's Triton. Triton is a Python-based domain-specific language for writing GPU kernels that abstracts away the most painful aspects of CUDA programming while still allowing fine-grained control over performance-critical details. Where CUDA requires managing threads, warps, shared memory, and memory coalescing manually, Triton operates on blocks of data and lets its compiler figure out the parallel scheduling. The result is code that is dramatically more accessible than hand-written CUDA while achieving eighty to one hundred percent of hand-tuned performance in many cases.

The Players in the Compiler Wars

Understanding the AI compiler landscape requires understanding the distinct philosophies and strategic positions of the major players. Each has made different bets about the future of heterogeneous computing.

Copyright: Sanjay Basu


XLA and the Google Philosophy. XLA, the Accelerated Linear Algebra compiler, emerged from Google's need to compile TensorFlow graphs to TPU hardware. But its ambitions grew far beyond TPU support. Today XLA serves as the compilation backend for JAX, Google's functional array computing framework, and through PyTorch/XLA it supports PyTorch workloads on TPUs as well. The XLA philosophy is compiler-centric. Rather than relying on hand-optimized kernel libraries for each hardware platform, XLA performs whole-program analysis to fuse operators and optimize memory layouts automatically. This approach trades some peak performance for generality. When a new model architecture emerges, XLA can often deliver reasonable performance immediately, without waiting for specialized kernels to be written.

Google's recent vLLM TPU work demonstrates this philosophy in action. Rather than maintaining separate code paths for TPU execution, the team unified PyTorch and JAX models under a single XLA compilation pipeline. The result was a twenty percent throughput improvement simply from switching to JAX's more mature HLO graph generation. The model code remained identical. Only the lowering path changed.

Triton and the OpenAI Vision. OpenAI's Triton takes a different approach. Rather than hiding the GPU programming model entirely, Triton exposes it at a higher level of abstraction that remains productive for non-experts while still allowing experts to reason about performance. A Triton kernel operates on blocks of data, and the compiler handles the translation to warps, threads, and memory transactions. This blocked programming model maps naturally to how modern AI workloads want to compute. Matrix multiplications and attention mechanisms both fundamentally operate on tiles of data, so Triton's abstraction aligns with the computation.

The Triton ecosystem has expanded rapidly. The third Triton Developer Conference in October 2025 drew participation from NVIDIA, AMD, Intel, Meta, and others. NVIDIA has invested heavily in Triton support for Blackwell, ensuring that the automatic MMA pipelining and Tensor Core exploitation work seamlessly. AMD supports Triton through ROCm, allowing the same kernel code to target both NVIDIA and AMD hardware with appropriate autotuning. This cross-vendor portability makes Triton increasingly central to hardware-agnostic AI development.

MLIR and the Infrastructure Layer. MLIR, the Multi-Level Intermediate Representation, represents yet another philosophy. Created by Chris Lattner at Google in 2018 and now part of the LLVM project, MLIR is not itself a compiler. It is compiler infrastructure. Where LLVM provides a single low-level IR suitable for traditional languages, MLIR supports multiple levels of abstraction through its dialect mechanism. A dialect defines a set of operations, types, and transformations appropriate for a particular domain or abstraction level.

This design allows compilation pipelines to maintain high-level semantics until as late as possible, enabling domain-specific optimizations that would be impossible if everything were immediately lowered to LLVM IR. TensorFlow's XLA uses MLIR internally. The Mojo programming language, also created by Lattner at Modular, is built entirely on MLIR. TPU-MLIR, ONNX-MLIR, and torch-mlir all leverage the infrastructure to target different domains and hardware. MLIR's role is foundational rather than visible. It is the substrate on which many other compilers are built.

TVM and the Portable Promise. Apache TVM emerged from research at the University of Washington with a specific focus on deployment portability. TVM takes trained models from any major framework, optimizes them through a combination of graph-level and operator-level transformations, and generates efficient code for diverse hardware targets including CPUs, GPUs, FPGAs, and specialized accelerators. Its AutoTVM and AutoScheduler components automate the search through optimization parameter spaces, learning cost models that predict kernel performance without exhaustive profiling.

TVM's promise is write once, run anywhere. A model compiled through TVM can target x86 CPUs with AVX512, ARM processors with NEON, NVIDIA GPUs with CUDA, AMD GPUs with ROCm, and various embedded accelerators. This portability comes with tradeoffs. TVM's generated code rarely matches hand-tuned vendor libraries on any single platform. But for organizations deploying to heterogeneous fleets, that portability may matter more than peak single-platform performance.

NVIDIA's Proprietary Fortress

And then there is NVIDIA. While the open-source compiler ecosystem grows increasingly sophisticated, NVIDIA maintains the most consequential closed-source compilation assets in AI infrastructure.

cuDNN, the CUDA Deep Neural Network library, provides highly tuned implementations for convolutions, attention mechanisms, normalizations, and the other building blocks of neural networks. These implementations encode deep knowledge of NVIDIA hardware, exploiting undocumented microarchitectural features and using optimization tricks discovered through years of engineering. When PyTorch executes a convolution on CUDA, it typically dispatches to cuDNN. The user sees an innocuous function call. Under the hood, cuDNN selects from dozens of algorithm variants based on tensor shapes, datatypes, and hardware generation, then executes code that represents hundreds of engineer-years of optimization.

cuBLAS provides similar magic for linear algebra. Matrix multiplication, the beating heart of transformer models, gets routed through cuBLAS kernels that squeeze every available FLOP from tensor cores. These kernels use internal heuristics to select optimal tile sizes, memory access patterns, and arithmetic pipelines. The selection logic itself is proprietary. Competitors can benchmark the outputs but cannot see the decision-making process that achieves those results.

The strategic implications are profound. When you use PyTorch on NVIDIA hardware, you are not merely using open-source software that happens to run on proprietary hardware. You are using an open-source orchestration layer that delegates performance-critical work to closed-source libraries that encode NVIDIA's accumulated competitive knowledge. Every optimization trick that NVIDIA discovers goes into cuDNN and cuBLAS. Every hardware feature gets exposed through these libraries before it appears in any open documentation. The libraries are the moat.

Copyright: Sanjay Basu

This is why alternative hardware vendors struggle. AMD can ship GPUs with competitive raw specifications. They can open-source their entire software stack, as they have with ROCm. They can provide rocBLAS and MIOpen as open-source equivalents to cuBLAS and cuDNN. But matching NVIDIA's library performance requires matching NVIDIA's accumulated optimization knowledge, which represents two decades of focused investment. The gap has narrowed significantly in recent years. ROCm 7.0 brought substantial improvements, and AMD's MI300X achieves competitive performance on many workloads. But the library performance differential remains the primary reason why identical model code runs faster on NVIDIA silicon.

Where Lock-In Actually Lives

The AI infrastructure discourse obsesses over hardware. H100 allocation. GPU hours. Cluster topology. But the deepest lock-in is not in the hardware. It is in the software layers that make hardware useful.

Copyright: Sanjay Basu


Consider a production inference deployment. The model itself is hardware-agnostic. A PyTorch model can be exported to ONNX, which can theoretically target any backend. But the moment you optimize that model for production, you start accumulating platform-specific dependencies. TensorRT quantization schemes. Custom CUDA kernels for novel attention patterns. Flash Attention implementations tuned for specific GPU memory hierarchies. Each optimization decision embeds assumptions about the target hardware.

Training is worse. Distributed training code uses NCCL for collective communications. NCCL is NVIDIA's library, optimized for NVLink and NVSwitch topologies. The open-source alternatives exist. AMD has RCCL. Intel has oneCCL. But switching communication libraries means retuning all-reduce algorithms, gradient synchronization patterns, and pipeline parallelism schedules. The model weights are portable. The training code is tied to its communication fabric.

The compiler layer is where all of this comes together. A compiler that targets multiple backends provides genuine hardware portability. A compiler that targets only CUDA, no matter how efficient, is a lock-in mechanism disguised as a productivity tool. This framing helps explain the strategic importance of projects like Triton, TVM, and XLA. They are not merely convenience tools. They are infrastructure for optionality.

The Open Alternative

AMD has made a deliberate strategic choice to compete on openness. ROCm is open-source from the driver level up. HIP, the Heterogeneous-compute Interface for Portability, allows code to compile against either AMD or NVIDIA backends. HIPIFY can mechanically translate CUDA source code to HIP. The message is clear. AMD cannot out-invest NVIDIA in proprietary libraries. But they can make the open alternative good enough that the proprietary advantage becomes insufficient to justify the lock-in cost.

The ROCm ecosystem has matured significantly. PyTorch offers official ROCm packages. vLLM runs on AMD hardware with comparable performance on many workloads. The Frontier supercomputer, the first exascale system, runs on AMD GPUs with ROCm. This represents validation at the highest performance tier. If ROCm can handle exascale scientific computing, it can handle enterprise AI workloads.

Intel has taken yet another approach with oneAPI and SYCL. Rather than providing CUDA compatibility, Intel has bet on standards. SYCL is a Khronos Group standard for heterogeneous computing in C++. OneAPI builds on SYCL to provide a complete development toolkit that targets CPUs, GPUs, FPGAs, and accelerators from any vendor. The SYCLomatic tool migrates CUDA code to SYCL, providing a path away from vendor lock-in.

These open alternatives matter because the AI hardware landscape is diversifying. Google's TPUs. Amazon's Trainium and Inferentia. Microsoft's Maia. Cerebras's wafer-scale engines. Graphcore's IPUs. SambaNova's reconfigurable dataflow units. Each represents a different bet about optimal AI acceleration architecture. The compiler layer determines whether applications can move between these platforms or whether they remain captive to their original target.

The Path to Hardware-Agnostic AI

This is the thesis that animates work on hardware-agnostic AI deployment. Making AI agent deployment portable across NVIDIA, AMD, TPU, and Inferentia accelerators requires solving the compiler problem. You can abstract GPUs behind software layers. You can translate between CUDA and ROCm at the source level. But if your performance-critical paths depend on vendor-specific kernel libraries, the abstraction remains leaky.

True hardware-agnostic deployment requires compilers that generate efficient code for multiple backends without sacrificing performance on any of them. This is hard. Different architectures have different memory hierarchies, different execution models, different optimal tile sizes. A kernel that flies on H100 may crawl on MI300X if it assumes NVIDIA-specific memory access patterns.

The solution involves multiple layers. At the top, framework-level abstractions like PyTorch's torch.compile provide hardware-agnostic entry points. Users write standard PyTorch code. The framework traces it into graphs and delegates to backend compilers. At the middle layer, compiler infrastructure like MLIR allows optimizations to be written once and applied across backends. At the bottom, backend-specific code generation exploits each target's unique capabilities.

The key insight is that portability and performance are not inherently opposed. A well-designed compilation stack can specialize aggressively for each target while maintaining source-level portability. The specialization happens at compile time, invisible to the user. This is the model that traditional language compilers have used for decades. The same C++ code compiles to efficient x86 and ARM binaries. AI compilers are learning to do the same for heterogeneous accelerators.

Copyright: Sanjay Basu


The Future Is Compiled

The eager execution model that made PyTorch beloved for research is giving way to compiled execution for production. torch.compile is not an optional optimization anymore. It is increasingly the expected deployment path. The TorchBench benchmark suite shows 1.8 to 2x geometric mean speedups across eighty models. These are not marginal gains. They represent a phase transition in how AI software relates to AI hardware.

JAX has always been compiled-first, and its adoption trajectory shows where PyTorch is heading. Leading AI labs including Anthropic, xAI, and Apple train frontier models on JAX. Google's internal AI workloads run on JAX and XLA. The combination of functional programming semantics and aggressive compiler optimization enables parallelism strategies that are difficult to express in eager frameworks.

The compiler stack is where the next decade of AI infrastructure competition will play out. NVIDIA's moat is not the GPU. It is cuDNN and cuBLAS and the accumulated optimization knowledge they embody. AMD's opportunity is not to build better chips. It is to build compilers that close the performance gap. Google's TPU advantage is not the hardware. It is XLA and the compiler-centric architecture that makes TPU viable.

For practitioners, the implication is that compiler literacy is becoming essential. Understanding how torch.compile works, when to use it, how to debug graph breaks, how to write Triton kernels for custom operations. These skills will increasingly separate high-performance from merely functional deployments. The days when you could ignore everything below the PyTorch API are ending.

For infrastructure architects, the implication is that compiler choice is platform choice. Betting on TensorRT means betting on NVIDIA. Betting on XLA means betting on Google or on the broader OpenXLA ecosystem. Betting on Triton means betting on portable GPU computing. These are strategic decisions that will compound over years as optimization investments accumulate in the chosen stack.

Copyright: Sanjay Basu


The compiler nobody sees is becoming the compiler everybody needs to understand. The invisible layer is becoming visible. And in that visibility lies the path beyond vendor lock-in, toward an AI infrastructure ecosystem where hardware choices are hardware choices and not lifetime commitments.

The real question is not which GPU to buy. It is which compiler stack to invest in. The hardware decision follows from that, not the other way around.


Thoughts?

Comments

Popular posts from this blog

Digital Selfhood

Axiomatic Thinking

How MSPs Can Deliver IT-as-a-Service with Better Governance