Do Neural Networks Dream of Strictly Convex Sheep?

A parting thought before leaving Herndon, written on a flight to Dallas after a week at Amazon Machine Learning University. About what we wanted from optimization landscapes and what we got instead.

Courtesy: Amazon MLU

There is a moment in training a deep network on a g2.2xlarge when the loss does something I can only call insolent. It drops for a few epochs, plateaus for what feels like forever, jitters sideways, then drops again into a region where the gradient is essentially noise around a slow downward drift. My textbooks call this descending. My eyes call it wandering. The optimizer is moving through a country we do not have good language for, because the language we have was built for a country that does not exist. That country is convex.

I am writing this somewhere over Tennessee after a week in a windowless room learning what Amazon thinks its engineers should know about machine learning. The instructors were good. The math was rigorous. But there was a recurring tension between what the foundational chapters teach and what the deep learning chapters show. The foundational chapters teach convexity. The deep learning chapters teach what to do when convexity is gone and the textbook will not help you.

When I learned optimization in the late 1990s the moral lesson was clear. Convex problems are tractable. Non-convex problems are the wilderness. The whole apparatus of support vector machines, kernel methods, logistic regression, and the entire body of work that won machine learning its early respectability was built on a dream. If you could only formulate your problem as a convex one, the rest was bookkeeping. The dream had a sheep in it. A strictly convex sheep. One global minimum, smooth curvature, guarantees you could put into a paper without hedging.

That sheep wandered off somewhere around 2012, when a convolutional network won ImageNet by a margin that made it impossible to keep pretending convex models were the future, and we have been arguing about what we did ever since.

The pun in the title is not idle

Philip K. Dick asked whether androids dream of electric sheep because he wanted to know what the substitute consciousness of an artificial being would reach for in its sleep. He was asking about the texture of an inner life. When DeepDream amplifies feature maps until a cloud turns into a thousand dog faces, what is the network reaching toward? The strictly convex sheep is the ideal optimization landscape, the world the textbooks promised, where every problem has a unique answer and the answer is reachable from anywhere. Do our networks long for that? Do their loss landscapes secretly want to be convex?

I think the honest answer is the opposite. The loss landscapes do not want to be convex. We do. And the gap between what we want and what they are is doing real work for us, even if we cannot quite say what.

Topology not features

Here is the framing I keep returning to. The geometry of the loss landscape is not a feature of the network. It is the topology the network lives inside. Filters come and go. Architectures come and go. AlexNet became VGG became Inception became ResNet, and what persists across these is something about the shape of the country the optimizer walks through.

Dauphin and colleagues argued in their 2014 paper on saddle points that the obstacles in high-dimensional non-convex optimization are not local minima. They are saddle points. The intuition is simple once you see it. In d dimensions, a critical point being a local minimum requires d coin flips to come up heads, where each flip is the sign of an eigenvalue of the Hessian. Saddle points are vastly more common. Local minima of a deep network's loss are rare, and when you find them, they are usually decent.

Choromanska, Henaff, Mathieu, Ben Arous, and LeCun extended this argument in 2015 using tools from spin glass theory. Their picture suggests that for large enough networks the local minima cluster at roughly the same loss value. There is not one valley. There is a band of low-loss configurations, and which one your optimizer settles into matters less than the fact that it settled into the band at all.

This is not what the convex dream looked like. The convex dream had one valley. The picture emerging from these papers has many valleys, all of comparable depth, scattered through a vast space, with saddles as the connective tissue between them.

Honest uncertainty

I want to be careful here. I have not been doing this long enough to claim I understand what is happening. The spin glass analogy is suggestive. The saddle point story is well-argued. But neither tells me what the loss landscape of the specific network I am training actually looks like. The visualizations I have seen are projections onto two or three dimensions, which is like trying to understand the shape of a continent from the shadow it casts at noon.

There are also things that work that should not work, and things that should work that do not. ResNets train deeper than I would have predicted six months ago, and the explanation that skip connections smooth the loss surface is plausible but not yet proven the way I want it proven. Batch normalization helps in ways the original paper explained in terms of internal covariate shift, and I find myself unconvinced we have the right story about why it helps.

So when I say the loss landscapes do not want to be convex, I am stating a working belief, not a theorem. The networks I train on g2 instances do not have a convex loss. They have something else. And the something else is doing real work for us.

A QBist parenthesis

There is a parallel here to something I keep arguing about in quantum foundations. The standard reading of quantum mechanics treats probabilities as objective frequencies in some ensemble, and the wavefunction as a thing in the world. QBism reads them as the bets of an agent who has to act under uncertainty. The wavefunction is not the territory. It is the agent's map updated by Bayesian conditioning on measurement outcomes.

The convex dream is the dream of a territory whose map is the territory. One valley, one answer, one path down. The non-convex reality is closer to QBism's picture. The optimizer is an agent navigating with local information, building a representation that is good enough for its next bet, never having access to the full landscape, never able to verify it found the best place. The loss it reports is its own degree of belief about how well it is doing, not a measurement against some ground truth the universe is hiding from it.

I am not claiming the analogy is exact. I am saying that when you spend enough time in both rooms, you start to suspect that the mathematics we needed for fundamental physics and the mathematics we are stumbling into for deep learning are not as far apart as the textbook divisions suggest. Both involve agents with bounded access to a high-dimensional state space, making probabilistic commitments, updating on partial evidence, never closing the gap to certainty.

What this changes in practice

If you asked me what I do differently as a result of taking this view seriously, the answer is concrete.

I stop treating the loss curve as a measurement and start treating it as a diary. The optimizer is writing down what it believes about the country it is in. Plateaus are not stalls. They are passages along ridges where the network is renegotiating which features matter. Hochreiter and Schmidhuber argued in 1997 that flat minima generalize better than sharp ones, and the working folklore agrees. The shape of the basin is the model's prior over inputs it has not seen yet.

I stop trusting any one initialization. If I run the same network from ten different seeds and they land in valleys of comparable depth, the comparable depth is the signal. The specific solution is one draw from a distribution of equivalent solutions. The distribution is what I should be reasoning about.

I stop asking models to be convex and start asking what kind of non-convex they are. This is the topology question. A network whose loss landscape is a band of comparable minima is a different kind of object from one with a single deep well and many shallow ones. The deployment characteristics differ. So does the retraining behavior, and how the network responds to inputs from outside the distribution it was trained on.

Back to the sheep

The strictly convex sheep is a creature from a world the textbooks promised us and the universe did not deliver. The networks we actually train do not dream of it. They dream, if they dream of anything, of the band of valleys they walk along during training, of the ridges and saddles and flat regions they pass through, of the high-dimensional country we have mapped only enough of to know we are not its first visitors and will not be its last.

When I sit with this long enough, I find I am not disappointed by the loss of the convex sheep. The non-convex country is more interesting. Whatever representation or proto-intelligence we eventually decide these systems possess, it lives in the topology of that country. The filters are local. The topology is what persists.

The plane is descending. Dallas in twenty minutes. Time to close the notebook.

Written on a flight from Washington Dulles to Dallas, the week of Amazon Machine Learning University. All errors of judgement and arithmetic are mine.

Search This Blog

Patterns that Connect: AI, Management, Metaverse, Quantum, Philosophy, and Physics

Do Neural Networks Dream of Strictly Convex Sheep?

Comments

Post a Comment

Popular posts from this blog

Dan Simmons and a review of Song of Kali

Digital Selfhood

Axiomatic Thinking