The Split Personality of AI Inference

How LLM-D Parallel Runs Are Rewriting the Rules of Model Inference

When One Brain Isn’t Enough

What if the secret to making AI faster wasn’t building bigger machines, but teaching it to think with two minds at once?

For anyone who’s ever typed a prompt into ChatGPT and watched those little dots dance across the screen, there’s an invisible orchestra playing behind the curtain. Large language models don’t just materialize answers from thin air. They’re running a two-act play every single time: first, they digest your question (prefill), and then they generate your answer, token by token (decode). Traditionally, these two acts happened on the same stage, using the same resources. And like any double-booked theater, chaos ensued.

Enter LLM-D, the distributed inference framework that said, “What if we gave each act its own theater?”

The result? A system that can serve AI models faster, cheaper, and more reliably by splitting the inference process into specialized workloads that run in parallel across multiple computing resources. It’s disaggregated serving, and it’s changing how we think about deploying AI at scale.

The Two-Phase Tango

Let’s start with something concrete. Every time you prompt an LLM, you’re triggering a two-phase process that’s as predictable as it is computationally expensive.

Prefill, the first phase, is compute-bound. It takes your entire prompt and processes every token in parallel, building what’s called a KV cache — essentially a memory of what you asked. This phase is brutally computational. If your prompt is 2,000 tokens long, the model has to do heavy mathematical lifting across all of them simultaneously. It’s like reading an entire book in one glance, then writing detailed margin notes.

Decode, the second phase, is memory-bound. It generates your response one token at a time, reusing that KV cache it built during prefill. This phase is lighter on computation but hungry for memory bandwidth. It’s the difference between lifting a heavy weight once (prefill) and doing a thousand lighter reps (decode).

Here’s the problem: traditional inference systems lump both phases together on the same GPU. They’re forcing your hardware to be a weightlifter and a marathon runner simultaneously. And as anyone who’s tried to specialize in everything knows, you end up mediocre at both.

The Interference Problem

Picture this scenario, courtesy of the engineering teams at Red Hat, Google, and IBM who built LLM-D: You’re serving a chatbot application. Ten users are having conversations (decode operations, lots of quick token generation). Suddenly, a new user arrives with a 2,600-token document and asks for a summary (massive prefill operation).

In a traditional system, that prefill request body-slams all ten ongoing conversations. Their smooth token generation stutters. Some requests timeout. Service level agreements get violated. Your users get that spinning wheel of doom.

Standard LLM deployments perform the prefill and decode phases of inference within a single replica, and given that these phases have different resource requirements, co-locating them leads to inefficient resource use, especially for long sequences.

Researchers call this “prefill-decode interference,” and it’s not just annoying. It’s expensive. When your P95 latency spikes by 3.7x because prefill operations are preempting ongoing generation, you’re not just providing a bad user experience. You’re burning money on overprovisioned hardware trying to muscle through the problem.

Disaggregation is The Radical Split

So what did the LLM-D team do? Something beautifully simple and deceptively complex: they split prefill and decode into separate workloads running on separate pods in a Kubernetes cluster.

The key innovation behind LLM-D is how it distributes inference by splitting the inference process into two distinct phases, prefill and decode, and running each in separate workloads. The project calls this approach “disaggregated serving.”

Think of it as specialization at the infrastructure level. Your prefill pods become expert readers, optimized for parallel computation. Your decode pods become expert writers, optimized for memory bandwidth and consistent token generation. Each can scale independently based on workload patterns.

But here’s where it gets interesting: disaggregation comes with an obvious cost. When prefill finishes processing your prompt, it needs to hand off that KV cache to a decode worker. That’s data transfer between GPUs or even between nodes. At first glance, shipping potentially gigabytes of cached data between workers sounds like a performance killer.

The researchers discovered something surprising. With proper placement, KV cache transfer overhead can be effectively minimized to be as low as less than the time of a decoding step, thanks to today’s high-speed networks such as NVLink and PCIe 5.0. For an OPT-175B model running on A100 GPUs, the transfer takes about 30–50ms, less than a single decode step.

The math checks out. If you’re running on 8-channel PCIe 5.0 (64GB/s per link) or NVIDIA NVLink (600GB/s), even large KV caches transfer faster than you can blink.

The Intelligent Scheduler

Disaggregation alone isn’t enough. You need something smart routing traffic. That’s where LLM-D’s inference scheduler comes in, leveraging the Kubernetes Gateway API’s Inference Extension.

Traditional load balancers use round-robin routing. Server 1, Server 2, Server 3, repeat. It’s democratic, predictable, and completely ignorant of what’s actually happening inside your model servers. Traditional load balancers treat LLM workers like identical black boxes, missing cache reuse opportunities, causing load imbalances, and increasing latency when conversations get routed to the wrong replicas.

LLM-D’s inference scheduler implements filtering and scoring algorithms necessary to make “smart” scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness.

The scheduler looks at runtime telemetry from vLLM instances. How full is the KV cache? What’s the work queue depth? Does this worker already have cached prefixes that match the incoming request? It scores each potential destination and routes intelligently.

Here’s a concrete example from the Solo.io team’s deep dive. A request comes in with a 5-token prompt asking for code review. The scheduler checks the prompt length against a configurable threshold (let’s say 100 tokens). Since 5 < 100, disaggregated inference isn’t needed. The overhead wouldn’t be worth it. The scheduler eliminates prefill-only workers from consideration and routes to a combined “both” worker or a decode worker with low load.

But when a request with a 2,600-token document summary hits? Different story. The scheduler triggers disaggregation, sends the prefill to a specialized prefill worker, then orchestrates the KV cache transfer to a decode worker with good cache hit probability.

The Competition Heats Up

LLM-D isn’t operating in a vacuum. The race to optimize distributed inference is crowded with brilliant minds and competing architectures.

vLLM, the inference engine that LLM-D builds upon, pioneered PagedAttention and continuous batching. vLLM supports distributed tensor-parallel and pipeline-parallel inference using Megatron-LM’s tensor parallel algorithm, with Ray as the default distributed runtime for multi-node inference. It’s fast, it’s widely adopted, and crucially, it now supports disaggregated serving as a pluggable feature through its KV Connector API.

NVIDIA’s TensorRT-LLM brings hardware-software co-design to the table. TensorRT-LLM delivers breakthrough performance on the latest NVIDIA GPUs with optimizations including FP8 and NVFP4 quantization, disaggregated serving, wide expert parallelism, and advanced speculative decoding techniques. When you’re running DeepSeek-R1 on Blackwell GPUs, TensorRT-LLM achieves world-record inference performance. NVIDIA also released Dynamo, a distributed inference framework that enables seamless scaling of inference workloads across GPU nodes and dynamic GPU worker allocation to efficiently respond to fluctuating user demand.

SGLang takes a different optimization angle. SGLang introduced RadixAttention, which reuses shared prompt prefixes across multiple requests, achieving up to 5x higher throughput compared to existing systems. It’s deployed at massive scale, over 300,000 GPUs worldwide, powering xAI, LinkedIn, and major cloud providers. SGLang’s zero-overhead batch scheduler keeps GPUs continuously engaged by running one batch ahead, eliminating CPU bottlenecks that traditionally cause idle GPU time.

Microsoft’s DeepSpeed-FastGen went a different direction with its Dynamic SplitFuse technique. Rather than fully disaggregating phases, DeepSpeed-FastGen splits lengthy prefills into smaller chunks and combines them with decoding tasks in the same batch — a process called piggybacking. The result: up to 2.3x higher throughput and 3.7x lower P95 latency compared to vLLM in their benchmarks.

And there’s Ray Serve, which takes a more general-purpose approach. Ray Serve is framework-agnostic, supporting everything from PyTorch and TensorFlow models to arbitrary Python business logic, with features for batching, streaming responses, and multi-model composition. Recently, Ray added support for custom request routing with PrefixCacheAffinityRouter, achieving 60% reduction in time-to-first-token and 40% improvement in throughput by keeping related requests together.

Then there’s the research that started it all. DistServe. DistServe’s disaggregation approach demonstrated that you can serve 7.4x more requests or achieve 12.6x tighter SLO compared to state-of-the-art systems while staying within latency constraints for over 90% of requests. The techniques pioneered in DistServe’s academic research have since been implemented in production systems like vLLM, validating the real-world impact of disaggregated serving.

The Performance Numbers Don’t Lie

Let’s talk brass tacks. Does disaggregation actually work in production?

LLM-D demonstrated up to 2.2k output tokens per second per GPU on DeepSeek with Expert Parallel serving on H200 GPUs, and provides up to 3x better P90 latency on long prefill with predicted latency balancing.

The Google Cloud team, one of LLM-D’s founding contributors, integrated these optimizations into their infrastructure running at “billion-user scale.” When you’re serving AI to that many people, every percentage point of efficiency translates to millions in infrastructure savings.

But here’s the nuance. Disaggregation isn’t always the answer. If your workload is too small or your GPU setup isn’t tuned for disaggregation, performance can drop by 20–30%. For shorter prompts or when the decode engine has a high prefix cache hit rate, running prefill locally on the decode worker is often faster and simpler.

That’s why sophisticated schedulers matter. They need to know when to disaggregate and when not to. It’s not about following a rigid rule; it’s about adapting to the workload in real-time.

Open Source, Open Minds

One of the most refreshing aspects of this entire space is how aggressively open-source it is. LLM-D is hosted under the LLM-D community with contributions from Red Hat, Google, IBM, CoreWeave, NVIDIA, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. SGLang is hosted under the non-profit LMSYS organization. vLLM, TensorRT-LLM, DeepSpeed, and Ray are all open-source with thriving communities.

This isn’t just academic altruism. The reality is that no single company can optimize for every model, every accelerator, every use case. The combinatorial explosion of configurations demands a community approach.

LLM-D supports NVIDIA GPUs, AMD GPUs, Google TPUs, and Intel XPUs with tested configurations and common operational patterns to improve production reliability. It’s not about vendor lock-in. It’s about interoperability.

The code is on GitHub. The papers are on arXiv. The benchmarks are reproducible. You can spin up your own disaggregated serving cluster this afternoon if you’re so inclined. That transparency breeds trust and accelerates innovation.

The Broader Philosophical Question

Strip away the technical jargon, and disaggregated serving raises an interesting question about specialization versus generalization.

For decades, the computing industry chased generalization. One server that does everything. One GPU that runs any workload. The promise of flexibility and simplicity.

But as workloads become more complex and scale becomes more demanding, we’re rediscovering the power of specialization. Not every problem needs the same hammer. Sometimes you need two different tools for two different jobs, even if that means more moving parts.

It’s the biological equivalent of specialized organs. Your stomach doesn’t also try to be your brain. Each does one thing exceptionally well, and the organism as a whole benefits from that division of labor.

Computer scientist John Hennessy, co-creator of the RISC architecture, once said, “The key to performance is elegance, not battalions of special cases.” Disaggregated serving is elegant precisely because it embraces specialization within a clean architectural pattern.

The efficiency gains aren’t coming from clever hacks or magic optimizations. They’re coming from fundamental alignment, matching computational characteristics (compute-bound vs. memory-bound) to hardware capabilities (lots of compute vs. lots of bandwidth).

The Future is Distributed and Disaggregated

Where does this go from here?

First, expect disaggregation to become the default, not the exception. As LLMs grow larger and context windows extend to millions of tokens (Claude 3 and Gemini 1.5 are already there), the resource mismatch between prefill and decode will only intensify. Systems that don’t disaggregate will be leaving enormous performance on the table.

Second, watch for more sophisticated routing algorithms. Today’s schedulers are smart about cache hits and load balancing. Tomorrow’s will likely incorporate real-time latency predictions, multi-objective optimization (cost vs. latency vs. throughput), and maybe even learned routing policies that adapt to specific workload patterns.

Third, hardware will co-evolve. NVIDIA’s Blackwell architecture, Intel’s Gaudi accelerators, AMD’s Instinct series. They’re all being designed with distributed inference in mind. Expect tighter integration between software orchestration layers like LLM-D and hardware capabilities like faster interconnects and more efficient KV cache management.

Finally, this isn’t just about chatbots. Disaggregated serving applies to any system with distinct computational phases. Batch processing pipelines, video transcoding, real-time analytics. The patterns discovered in LLM inference will ripple outward to other domains.

A Practical Call to Action

If you’re running LLMs in production, or thinking about it, here’s what you should do:

Start with your workload profile. Are your prompts highly variable? Do you have bursts of long-context requests mixed with chatbot-style short interactions? That variability is where disaggregation shines.

Benchmark your current system. Measure your P50, P95, and P99 latencies for both TTFT and TPOT. Know your baselines.

Try disaggregation on a pilot workload. LLM-D and vLLM both offer straightforward paths to experiment. You don’t need a massive cluster. Even a few GPUs can demonstrate the benefits.

Monitor the impact on goodput, not just throughput. Requests per second looks great until you realize half of them are violating SLOs. Goodput completed requests that meet latency requirements, and that is what actually matters for user experience.

And most importantly, stay curious. This field is moving fast. The system you deploy today will be outdated in six months. That’s not a bug; it’s a feature. We’re in the golden age of infrastructure innovation for AI.

The Orchestra Playing in the Shadows

So the next time you prompt an LLM and watch those tokens stream back, remember. There’s an invisible orchestra performing a carefully choreographed dance. Prefill workers digesting your question in parallel. KV caches flying across high-speed networks. Decode workers generating responses token by token. Intelligent schedulers routing traffic based on real-time telemetry.

It’s all happening in milliseconds. And it’s all disaggregated.

The future of AI isn’t just bigger models. It’s smarter infrastructure. It’s knowing when to split the work and when to combine it. It’s specialization without fragility. It’s performance without waste.

As computer scientist Butler Lampson observed, “All problems in computer science can be solved by another level of indirection.” Disaggregated serving adds exactly that, a level of indirection that transforms interference into orchestration, bottlenecks into specialization.

And honestly? That’s pretty elegant.

After all, if AI can teach itself to write poetry and solve math problems, surely we can teach our infrastructure to think with two minds at once.

Further Reading:

LLM-D GitHub Repository: https://github.com/llm-d/llm-d
DistServe Research Paper: “Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving”
vLLM Documentation: https://docs.vllm.ai/
NVIDIA TensorRT-LLM: https://developer.nvidia.com/tensorrt-llm
SGLang Project: https://github.com/sgl-project/sglang

Appendix

H200 Benchmarks & Code Examples

The Numbers That Matter. H200 Performance in the Real World

Let’s cut through the marketing and look at what actually happens when you run disaggregated inference on NVIDIA’s H200 GPUs.

The H200 isn’t just an incremental upgrade. It’s a memory beast with 141GB of HBM3e at 4.8 TB/s bandwidth — 76% more memory and 43% faster bandwidth than the H100’s 80GB at 3.35 TB/s. Same compute FLOPs, but fundamentally different memory profile. For LLM inference, which is overwhelmingly memory-bound during decode, that matters.

TensorRT-LLM on H200. The Baseline

NVIDIA’s own TensorRT-LLM benchmarks show what the hardware can do when optimized to the metal:

Llama2–13B Single GPU:

H200: 11,819 tokens/second
H100: ~6,200 tokens/second

1.9x speedup

Llama2–70B (TP8 — Full HGX):

Offline summarization (ISL/OSL: 2048/128): 1.9x more performant on H200
Online chat (ISL/OSL: 80/200): 1.6x more performant on H200

For GPT-3 175B at maximum throughput, H200 delivers 1.6x improvement over H100 on a full 8-GPU node. Not revolutionary, but substantial when you’re burning thousands of dollars per hour on inference.

vLLM: The Open-Source Champion

The vLLM community ran comparative benchmarks on H200 vs H100 using Llama 3.1 8B across multiple workload types:

Table 1

Search This Blog

Patterns that Connect: AI, Management, Metaverse, Quantum, Philosophy, and Physics