JEPA does not need a qubit. The robot might need a spike.
A note on substrates, world models, and what is actually shipping.
![]() |
| Copyright: Sanjay Basu |
Why I am writing this
There has been a small swell of press and LinkedIn threads in the last few weeks coupling JEPA-family world models with quantum computing as their training substrate. I have read enough of these to want to write something straight.
The pairing is not absurd. Quantum and JEPA both feel non-classical in flavor, both invoke prediction over physical reality, and both have well-funded research programs that promise the future. Fine. But the technical case for running JEPA training on a quantum computer does not exist yet, will not exist in three to five years, and probably will not exist in the form people are imagining.
What does exist, and what the same kind of buyer might actually find useful inside the same decade, is neuromorphic silicon for inference at the edge. That is the bet I would make if forced to put money on a non-classical substrate doing real work for world models in the near term.
Let me lay out why.
What JEPA actually is
JEPA is Yann LeCun’s bet on what scaling looks like after autoregressive next-token prediction stops paying. V-JEPA 2, released by Meta in mid-2025 and updated to V-JEPA 2.1 since, is a 1.2 billion parameter encoder-predictor trained on more than a million hours of internet video, with an action-conditioned post-training phase that uses about 62 hours of robot trajectories from the Droid dataset. The trained system does zero-shot pick-and-place on Franka arms in labs it has never seen, hitting between 65% and 80% success on unfamiliar objects. The architectural commitment is that prediction happens in a learned latent space, not in pixel space, because pixel-level prediction wastes capacity on textures and shadows that do not matter for planning.
That is the actual workload. A Vision Transformer encoder, a transformer-style predictor, masked latent prediction objectives, large-scale self-supervised pretraining on video, light fine-tuning on interaction data, and runtime inference that runs model-predictive control over imagined latent rollouts. Everything in the dominant compute path is dense matmul over learned embeddings. The bottlenecks are data efficiency in self-supervised video pretraining, the choice of which latent abstraction is worth predicting, embodied data collection at scale, and the energy cost of running these things in the wild on robots that have batteries.
None of those bottlenecks map onto quantum primitives. I will say what I mean by that.
Why quantum does not help JEPA training
Three concrete problems, not vibes.
First, the matmul accounting does not close. A V-JEPA 2 forward pass is dominated by ViT attention and MLP blocks. To run that on quantum hardware you would need to embed high-dimensional dense activations into quantum states, perform unitary operations that approximate matmul, and read out values. The data loading step, where you encode a classical vector into amplitudes of a quantum state, costs you most of the advantage you were hoping for even in the textbook regime. This is the input problem that has eaten most quantum machine learning proposals for ten years. Tang and others have shown that for the kernel methods where quantum advantage looked tightest, classical algorithms can match them once you account for honest input access. Dequantization is now a craft.
Second, the hardware is not close. Physical qubit counts on Willow, Quantinuum H-series, and IBM’s Heron generation sit in the 100 to 200 range. Logical qubits, the ones doing reliable computation after error correction, are in the single digits to low tens depending on whose accounting you trust. The gate depths and coherence times needed to execute even one transformer block are not on any roadmap I find credible for the late 2020s. A 1.2 billion parameter model has, give or take, a billion parameters more than current hardware suggests is feasible. I want to be careful here because surprise is real in this field. But surprise large enough to compress that gap by six orders of magnitude is not a planning assumption.
Third, variational circuits as standalone learners have underdelivered. Barren plateaus, where the loss landscape becomes exponentially flat as you add qubits, remain a real problem and not a solved one. The papers showing trainability under specific structural constraints usually require classical simulability anyway. The honest 2026 state of pure quantum ML is that the most defensible commercial work is quantum-inspired classical, especially tensor networks, not quantum hardware.
What quantum will plausibly contribute to AI inside this decade is narrower and different. Sampling from distributions with hard-to-simulate correlations. Combinatorial optimization inside training and serving pipelines, which is where my hybrid scheduling work sits. Chemistry and materials discovery that feeds back into better classical substrates. Video world models are not on that list.
Why neuromorphic at the edge is a different story
Now flip the substrate question to inference, and to the edge. The picture changes.
V-JEPA 2 wants to run on robots. Robots have batteries. The Meta paper is explicit about wanting to plan in the physical world, and planning in the physical world means latency budgets and thermal budgets that data center GPUs do not see. The interesting substrate question for a world model is not how to train it cheaper. It is how to run a competent imagination loop inside a 30 watt power envelope on a moving platform.
Here neuromorphic silicon has stopped being a research toy.
Intel’s Hala Point, built from 1,152 Loihi 2 processors, runs 1.15 billion neurons at Sandia, sitting roughly within the order of magnitude of an actual mouse cortex. The published energy efficiency numbers for SNN workloads on Loihi 2 are over 100x against CPU and around 30x against GPU for the event-driven sparse regimes the architecture is built for. A November 2025 paper out of the Loihi 2 community demonstrated a fall detection system using a Sony IMX636 event camera feeding a sparse SNN on a single Loihi 2 chip, hitting 84% F1 at 90 milliwatts total system power. Sub-millisecond inference. Always-on. This is not a benchmark, it is a smart camera you could ship.
BrainChip’s Akida 2 is shipping in volume inside automotive and IoT today, with sub-microwatt standby power and microsecond-class latency on event-driven vision. SynSense’s Speck SoC packs around 328,000 neurons with roughly 3 microsecond per-spike latency at milliwatt power. Innatera, GrAI Matter, Prophesee paired with Akida at Embedded World 2025, all of these are real silicon that real OEMs are buying. The 2026 market sits around $125 million in chip revenue, projected at over 50% CAGR through 2034. Small now, real soon.
The technical reason this matters for world models is sparsity and event-driven temporal structure. JEPA-style imagination, especially in the V-JEPA 2-AC action-conditioned setting, is a sequence of latent rollouts evaluated against a goal. Most of the latent state does not change between time steps. A spiking architecture only computes on change. If you can map the predictor head, or even the action-evaluation loop, onto an event-driven substrate, you trade dense matmul that the GPU loves for sparse update that the GPU is bad at. There is early work on this already. The Loihi 2 team published a MatMul-free LLM at 370M parameters running with about 3x lower energy than an edge GPU, and they are open about how much more headroom remains.
I am not claiming the encoder of V-JEPA 2 belongs on neuromorphic silicon today. It does not. ViT is matmul-dense and lives happily on a GPU or on an NPU. What I am claiming is that the parts of the world model loop that touch the physical world, the sensor fusion in front and the model-predictive control behind, are exactly where event-driven sparse compute looks like a real architectural answer.
Topology, not feature
The reason I keep coming back to topology over feature is that it forces a different question. The feature question is “does quantum make my training faster.” The answer for video world models is no. The topology question is “what shape of compute matches the shape of the problem.” A world model is a particular topology. Dense self-supervised pretraining over web-scale video. Sparse interactive fine-tuning. Intermittent rollout-based planning at deployment. Each of those segments has a different right answer.
Dense pretraining is GPU-shaped. We have known this for a decade and the curve has not bent.
Interactive fine-tuning on robot trajectories is also GPU-shaped, just smaller GPUs, because the volume of action-conditioned data is by design tiny.
Deployment-time imagination is the interesting one. It is bursty, latency-bounded, energy-bounded, and acts on sensor streams that are themselves sparse and event-driven. This is where the topology of the problem stops matching dense floating-point matmul and starts matching something closer to what brains actually do. Whether that something is exactly Loihi 2 or Akida or whatever ships in 2028 is not the point. The point is the shape, and the shape is wrong for both the GPU and the qubit.
Honest uncertainty
A few things I am not sure about, stated plainly so the reader does not have to guess.
I do not know that neuromorphic silicon will scale to the parameter counts a useful deployed world model needs. Hala Point at 1.15 billion neurons is impressive, but the synaptic structure those neurons express is far simpler than what a 1.2 billion parameter ViT expresses. The mapping is not one-to-one and probably never will be.
I do not know that the SNN-to-ANN conversion story will hold for transformer architectures the way it has held for convnets. The MatMul-free LLM result is promising but small, and the larger you push it, the more attention-shaped your model becomes, and attention is not naturally spiking.
I am uncertain whether quantum will surprise us inside the decade in a way that affects training. I think it is unlikely. I do not think it is impossible. If a credible fault-tolerant device at scale arrives sooner than expected, the conversation changes, but the data-loading problem still has to be solved separately and that is not a hardware question.
I do not know if Meta or Anthropic or DeepMind will keep treating world models as the path forward. JEPA could lose to a different paradigm in a year. If that happens, the substrate analysis still holds, because the next world model will also be dense in training and event-driven at the edge.
The practitioner’s bet
If I had to allocate research and product dollars right now toward a non-classical substrate that affects world models inside five years, here is where I would put them, ordered by risk-adjusted return.
Edge neuromorphic for sensor fusion in front of the world model. This is shipping today and the energy story is real.
Edge neuromorphic for the action evaluation and MPC loop behind the world model. This is research today but the architectural fit is good and the demo silicon exists.
Quantum-inspired tensor network methods for compressing latent representations inside the model. This is software, not hardware, but it is the place where quantum thinking has actually shipped useful classical results.
Quantum hardware for the training loop of a video world model. I would not put money here. The arithmetic does not close.
The press and LinkedIn pairing of JEPA with quantum makes sense as a vibe. Two non-classical-flavored ideas, both ambitious, both promising prediction over the world. As a technical proposition it is the wrong coupling. The right coupling is JEPA with neuromorphic, and the work that matters is not at the qubit, it is at the spike.

Comments
Post a Comment