Why use Deep Learning (DL) for Artificial Intelligence
While we are concentrating on Ethics in AI, this short piece focuses on why Deep Learning is the preferred training method for AI systems
Yoshua Bengio, Yann LeCun, and Geoffrey Hinton are recipients of the 2018 ACM A.M. Turing Award for breakthroughs that have, made deep neural networks a critical component of computing. This piece is based on the Turing Lecture published in Communications of ACM - https://cacm.acm.org/magazines/2021/7/253464-deep-learning-for-ai/fulltext (requires ACM membership)
The basic principle of Deep learning is that to learn complex internal representations for difficult tasks, parallel networks of relatively simple neurons must learn by adjusting the strengths of their connections. Deep learning uses many layers of activity vectors as representations to learn how well it can perform on large training sets using enormous amounts of computation. In the Turing Lecture (based on their paper), the authors briefly describe the origins, recent advances, and future challenges of deep learning, such as learning without external supervision, and using the method for system 2 tasks.
In the logic-inspired paradigm, symbolic representations are created using hand-designed rules of inference, and in the brain-inspired paradigm, symbolic representations are created using hand-designed or evolved rules.
A brain-inspired paradigm converts external symbols into neural activity vectors that model the structure of a set of symbols by learning the appropriate activity vectors for each symbol, and by learning non-linear transformations that allow missing elements to be filled in.
Neural activity vectors can be used to represent concepts, and to capture relationships between concepts. This leads to automatic generalization, which is our primary mode of reasoning.
To create layers of feature detectors, it makes sense to start using another supervised learning task that has plenty of labels. We first learn a hidden layer of feature detectors whose activities allow us to reconstruct the input, then learn a second layer of feature detectors whose activities allow us to reconstruct the activities of the first layer of feature detectors. Pre-training initializes the weights in a way that is easy to fine-tune a deep neural network. This makes it possible to train very large models by leveraging large quantities of unlabeled data.
Rectified linear units, used in deep learning, made it easy to train deep networks by backprop and stochastic gradient descent, without the need for layer wise pre-training.
In 2009, two graduate students used GPUs to show that pre-trained neural nets could outperform the state of the art on the TIMIT dataset. In 2010, Google showed that a deep network could significantly improve voice search on Android.
Deep learning scored a dramatic victory in the 2012 ImageNet competition, almost halving the error rate. The key was the large number of labeled images and the efficient use of multiple GPUs. Deep learning re-energized neural network research in the early 2000s by making it easy to train deeper networks, combining GPUs, large datasets, and open source software to make deep learning possible. Deep neural networks can generalize better for the types of input-output relationships we are interested in modeling, but the most popular convolutional network architecture for computer vision is ResNet-50, which has 50 layers.
The general believe is that deep neural networks excel at perception because they use compositionality to combine features in one layer into more abstract ones in the next layer.
In many applications, a self-attention mechanism uses scalar products to compute matches between query and key vectors, and then use these matches to compute a convex combination of the vectors of value produced in each previous layer. Transformers have revolutionized natural language processing, and they are now routinely used in industry. The transformations are used to predict missing words in a segment of text and solve integral and differential equations symbolically.
The deep convolutional neural net contained a few novelties, such as the use of ReLUs and dropout, but was basically the same as the feed-forward convolutional neural net.
Soft attention is a significant development in neural nets that change them from purely vector transformation machines to architectures that can operate on different data structures.
Soft Attention can be used in a layer to dynamically select which vectors from the previous layer to use to compute outputs.
The problem of determining whether a continuation is compatible with a video can be approached with latent variable models that assign an energy function to examples of a video and a proposed continuation. To represent the way Y depends on X, a deep neural net is trained to give low energy to values Y that are compatible with X.
In contrastive learning, the key difficulty is how to pick good negative samples. In a real-valued high-dimensional space, the best samples are those that have high energy but currently have low energy. Generative Adversarial Networks train a generative neural net to produce contrastive samples by applying a set of training rules.
Humans and animals seem to learn vast amounts of background knowledge by observation alone.
In supervised learning, a label conveys only a few bits of information about the world, whereas in model-free reinforcement learning, a reward conveys many bits. Self-supervised learning is a reconstruction that can predict masked or corrupted portions of the data. Supervised learning requires too much data and model-free reinforcement learning requires many trials. Deep learning is more successful at perception tasks than system 2 tasks that require deliberate sequences of steps.
Machine learning systems tend to be slower on the field than in the lab, since the distribution of test cases is not the same as the distribution of training examples.
When learning a new task, supervised reinforcement learning systems require many examples. In model-free reinforcement learning, a neural network can train itself to interpret novel combinations of existing concepts. Using contrastive learning, a neural network can be trained to produce similar output vectors for crops from the same image, but dissimilar output vectors for crops from different images.
Recent papers have produced promising results in visual feature learning using convolutional neural networks. The hidden activity vector of one of the higher-level layers of the network is used as input to a linear classifier trained in a supervised manner.
In the Variational Auto-Encoder, the latent code is packed into a larger sphere, and the information capacity of the latent code is limited by how many noisy spheres fit inside the sphere. The noise repels each other, and the system minimizes a free energy.
Deep learning systems are improving as their parameters are added. The language model GPT-318 with 175 billion parameters generates noticeably better text than GPT-2 with 1.5 billion parameters.
We can apply pieces of our knowledge in new ways to achieve higher-level cognition, such as driving in a city with unusual traffic rules, or even driving on the moon, and further improve them with practice.
System 1 processing abilities may allow us to guide search and planning at the higher (system 2) level.
Machine learning relies on inductive biases to encourage learning. How can we design deep learning architectures which incorporate such biases?
Young children can discover causal dependencies in the world. Neural networks can be trained by maximizing out-of-distribution generalizations.
The symbolic AI research program from the 20th century aimed at achieving system 2 abilities. We would like to design neural networks which can do these things while working with real-valued vectors. Recent studies help us to understand how different neural net architectures fare in terms of this ability to generalize.
Neuroscience suggests that groups of neurons (hyper-columns) are tightly connected and that they can send a vector of values (like pose information, in capsules architectures) to other neurons (like a transformer-like architecture). Most neural nets only have two timescales for adaptation: slowly adapting weights and rapidly changing input. Fast weights introduce a high-capacity short-term memory.