Newsletter February 4th, 2024 — AI advancements this week

I am traveling from January to March, attending and delivering at Oracle Data and AI forums hosted in various cities. I have already presented in Dallas and Houston while meeting with numerous Oracle customers in Seattle and Boston. The next speaking engagements are in Orlando, Atlanta, and Philly.

While speaking at the Data and AI forums, I have already delivered the following Webinars at ODSC (https://app.aiplus.training/bundles/odsc-webinars) and presented a paper in NeurIPS (nips.cc) 2023:

While preparing for upcoming presentations at NVIDIA GTC (https://www.nvidia.com/gtc/) for both recorded and in-person, I am keeping an eye on the exciting developments in the field of AI and Quantum Computing. I would like to summarize two interesting developments in the field of Artificial Intelligence (AI) this week. I am going to cover the recent developments in Quantum next week.

The first one is TaiPY. It is a tool designed to swiftly transform data and AI algorithms into production-ready web applications. It leverages Python and offers a robust platform for automating workflows, managing pipelines, and facilitating data visualization and orchestration. With a focus on data science, data engineering, and developer tools, TaiPy streamlines the process of moving projects from development to deployment, making it an efficient solution for MLOps and data operations. For more details, you can visit [GitHub — Avaiga/taipy](https://github.com/Avaiga/taipy). This open-source Python library is designed to simplify end-to-end application development for data scientists and machine learning engineers. It enables the creation of full-stack applications without needing to learn front-end languages like HTML, CSS, or JavaScript. Taipy offers a user interface generation tool, pre-built components for data pipeline management, and features for scenario and data management. It also includes version management and pipeline orchestration tools, making it suitable for collaborative projects. This framework aims to save time and allow professionals to focus on their core competencies in data and AI.
I have included a code snippet demonstrating the creation of a linear regression model, integrating it with a Taipy GUI, and visualizing predictions based on user input. Replace the model and data with your specifics to fit your needs. For detailed examples and advanced usage, refer to the Taipy documentation.

from taipy import Gui, Config
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data
df = pd.DataFrame({
    'x': np.random.rand(100),
    'y': np.random.rand(100)
})

# Model training
model = LinearRegression().fit(df[['x']], df['y'])

# Visualization with Taipy
def predict_and_visualize(state):
    state.predicted = model.predict([[state.slider_value]])[0]
    return f"Predicted value: {state.predicted}"

Config.configure_page(page_name="Graph Visualization")
gui = Gui(page_name="Graph Visualization")

gui.add_slider(id="slider_value", min=0, max=1, step=0.01, default=0.5)
gui.add_button("Predict", on_action=predict_and_visualize)
gui.add_display(predict_and_visualize, id="prediction_display")

if __name__ == "__main__":
    gui.run()

The second paper I will discuss in detail is “Formal Algorithms for Transformers” by Mary Phuong and Marcus Hutter from DeepMind provides a comprehensive and mathematically precise overview of transformer architectures and algorithms, excluding results. It aims to fill a gap in the literature by offering detailed pseudocode for transformer models, which have been lacking despite the popularity and success of transformers in natural language processing and other domains. The paper covers the basics of transformers, including their architectural components, tokenization methods, and typical tasks they are employed for, such as sequence modeling and sequence-to-sequence prediction. It delves into the specifics of various transformer models like BERT, GPT, and others, explaining their training and inference processes. Practical considerations for implementing transformers are also discussed, highlighting the importance of details often omitted in other publications. This paper is a valuable resource for both theoreticians and practitioners in the field of machine learning, offering a clear and detailed exposition of transformers. It encourages a more formal and precise approach to describing machine learning models, which can facilitate better understanding, implementation, and innovation in the field. Let’s jump in and read through my notes:

Introduction

The introduction section of the paper discusses the significance of Transformers, which are deep feed-forward artificial neural networks with a (self)attention mechanism. It notes the success of Transformers in natural language processing tasks and other domains and emphasizes the lack of published pseudocode for any variant, despite their popularity. The paper aims to address this gap by providing a self-contained and precise overview of transformer architectures and formal algorithms. The report covers the nature of Transformers, their training, usage, key architectural components, tokenization, practical considerations, and prominent models. It highlights that the pseudocode provided is about 50 lines, intended to be useful for theoreticians, experimental researchers, and authors looking to incorporate formal Transformer algorithms into their work. The intended audience is readers familiar with basic machine learning terminology and simpler neural network architectures. The paper aims to equip readers with a solid understanding of transformers, enabling them to contribute to the literature and implement their own Transformer models using the pseudocode as templates.

Motivation

The motivation section of the research paper discusses the lack of precision and detail in Deep Learning (DL) publications, highlighting the absence of pseudocode, equations, and precise explanations for neural network models. The authors argue that the DL community is hesitant to provide formal algorithms, despite the tremendous success of DL in recent years and the thousands of papers published annually. They emphasize the importance of formal algorithms for theoreticians and the need for precise descriptions of model changes, as well as the importance of proper explanations of how the networks are trained and used. The paper points out the lack of clarity regarding inputs, outputs, and potential side-effects in the experimental section of publications, and the disconnect between the methods and experimental sections. The authors also stress the importance of accompanying wrapper algorithms for core algorithms, such as (pre)training, fine-tuning, prompting, inference, and deployment. This section highlights the need for formal algorithms and clear explanations in DL publications to support both theoreticians and practitioners, as well as to ensure scientific rigor in the field.

Source code vs pseudocode

The research paper highlights the importance of formal algorithms as opposed to open source code. It emphasizes that there is a significant disparity between a (partial) Python dump and well-crafted pseudocode. The paper stresses the need for abstraction and clean-up in pseudocode, including the removal of boilerplate code, use of single-letter variable names, replacement of code with mathematical expressions wherever possible, and the elimination of some optimizations. It suggests that a well-crafted pseudocode, often less than a page in length, can be essentially complete compared to thousands of lines of real source code. The paper notes that despite its significance, the process of creating well-crafted pseudocode is often neglected due to the perceived difficulty. This paper highlights that while the process of first designing algorithms and writing up pseudocode on paper before implementation is beneficial, few DL practitioners seem to follow this approach. The authors underscore the importance of formal algorithms and well-crafted pseudocode, emphasizing the need for more recognition and implementation of these practices in the field.

Examples of good neural network pseudocode and mathematics and explanations

The paper discusses the absence of pseudocode in deep learning (DL) research and the potential benefits of providing pseudocode for neural network architectures. The authors note that while there are many papers describing Multi-Layer Perceptrons (MLPs) and other models, they often lack pseudocode. This absence raises the question of whether pseudocode is actually necessary and useful for DL research. The authors argue that providing pseudocode can serve as templates for future variations of neural network architectures and can set a new standard in DL publishing. They encourage readers to adapt the pseudocode to their specific needs and cite the original source. The paper emphasizes the importance of providing pseudocode for transformer architectures, as well as training and inference processes. The authors also mention existing resources that explain attention mechanisms and transformers with mathematical precision but without pseudocode. This paper highlights the potential utility of pseudocode in deep learning research and encourages its inclusion as a standard practice in DL publishing.

Transformers and Typical Tasks

Transformers are highly effective neural network models specifically designed for natural language processing and the modeling of sequential data. They are commonly utilized for two main types of tasks: sequence modeling and sequence-to-sequence prediction. There is a prevalence of the independent and identically distributed (i.i.d.) data learning paradigm in machine learning, but it still holds true for sequence modeling due to practical considerations. Even when dealing with a collection of independent articles as training data, the maximum context length that transformers can handle may be exceeded. In such cases, articles are roughly divided into shorter chunks with a length not exceeding the maximum allowable. This approach enables the effective utilization of transformers for sequence modeling tasks, ensuring that even lengthy articles can be processed within the limitations of the model.

Notation

The notation section of the research paper introduces the vocabulary 𝑉 and defines 𝒙 𝑛 ∈ 𝑉 as a dataset of sequences sampled i.i.d. from distribution 𝑃 over 𝑉 . The main goal is to learn an estimate P of the distribution 𝑃(𝒙) through a neural network parameterized by 𝜽. This involves learning a distribution over a single token 𝑥 [𝑡] given its preceding tokens 𝑥 [1 : 𝑡 -1] as context. Examples of applications include language modeling, RL policy distillation, and music generation.
The section also discusses sequence-to-sequence prediction (EDTransformer) using a vocabulary 𝑉 and an i.i.d. dataset of sequence pairs (𝒛 𝑛 , 𝒙 𝑛 ) ∼ 𝑃, where 𝑃 is a distribution over 𝑉 × 𝑉 . The goal is to learn an estimate of the conditional distribution 𝑃(𝒙|𝒛). Examples of applications for this include translation (𝒛 = a sentence in English, 𝒙 = the same sentence in German), question answering (𝒛 = question, 𝒙 = the corresponding answer), and text-to-speech (𝒛 = a piece of text, 𝒙 = a voice recording of someone reading the text).
The notation section sets the stage for understanding the dataset and the primary goals of learning distribution estimates within the context of neural network parameters, as well as the specific applications of these distributions in language modeling, translation, question answering, and text-to-speech.

Classification (ETransformer). Given a vocabulary 𝑉 and a set of classes [𝑁]

The paper discusses the classification process using the ETransformer. It highlights the importance of tokenization in natural language tasks, which involves representing text as a sequence of vocabulary elements called tokens. The paper explores different tokenization methods, including character-level, word-level, and subword tokenization. Character-level tokenization generates long sequences, word-level tokenization requires a large vocabulary and cannot handle new words at test time, while subword tokenization, particularly Byte Pair Encoding, is commonly used in practice. Subword tokenization involves using a set of commonly occurring word segments to express all words, including common words and single characters. The paper emphasizes the significance of tokenization methods in the classification process, particularly in natural language tasks such as sentiment classification, spam filtering, and toxicity classification.
Final vocabulary and text representation
The section discusses the process of tokenization and vocabulary representation in natural language processing. Each vocabulary element is given a unique index within a range. Special tokens are then added to the vocabulary, such as mask_token, bos_token, and eos_token, which serve different purposes in language modeling. The complete vocabulary consists of a specific number of elements. Text is represented as a sequence of token IDs corresponding to its subwords, with a bos_token at the beginning and an eos_token at the end. This approach enables the representation of text for further processing in natural language processing tasks.

Architectural Components

The section discusses the key neural network building blocks that serve as the foundation for transformers. First, the token embedding function is described, which represents each vocabulary element as a vector in ℝ 𝑑 e. The positional embedding function is then explained, which represents a token’s position in a sequence as a vector in ℝ 𝑑 e. It is emphasized that positional embeddings are essential for transformers to understand word ordering. The paper also mentions the use of either learned positional embeddings or hardcoded mappings. The significance of positional embeddings in forming a token’s initial embedding for a sequence is highlighted.
The attention mechanism used in transformers is detailed, where the process involves mapping the token being predicted to a query vector and the surrounding tokens to key and value vectors to derive a distribution over the context tokens. The paper also outlines the basic single-query attention algorithm and the common variants of the basic attention mechanism used in transformers, including bidirectional/unmasked self-attention. The softmax function for matrix arguments and the mask matrix for bidirectional and unidirectional attention are also defined in the paper. The section provides a comprehensive overview of the architectural components essential for building transformers and lays the groundwork for the subsequent presentation of full transformer architectures in the following section.

Cross-attention

The research paper discusses the concept of cross-attention, which involves applying attention to each token of a primary token sequence while treating a second token sequence as the context. The paper also introduces multi-head attention, where transformers run multiple attention heads in parallel and combine their outputs. The paper covers layer normalization, which controls the mean and variance of individual neural network activations, and introduces the concept of root mean square layer normalization. The paper discusses unembedding, which involves converting a vector representation of a token and its context into a distribution over the vocabulary elements. The unembedding matrix is independently learned, but it is noted that in some cases, it may be fixed to be the transpose of the embedding matrix. These concepts and algorithms are essential for understanding the mechanisms and operations involved in sequence-to-sequence tasks and are key components in the design and implementation of transformer models for various natural language processing tasks.

Transformer Architectures

The section discusses several prominent transformer architectures in historical order, including the original sequence-to-sequence / Encoder-Decoder Transformer (EDT), BERT (an encoder-only transformer), and GPT (a decoder-only transformer). The main architectural difference between BERT and GPT lies in attention masking, as well as differences in activation functions and layer-norm positioning. The paper includes these differences in the pseudocode to remain true to the original algorithms. The Encoder-decoder / sequence-to-sequence transformer (EDT) was the first transformer and was initially used for sequence-to-sequence tasks such as machine translation. It involves encoding the context sequence using bidirectional multi-head attention, followed by the encoding of the primary sequence. Each token in the primary sequence is allowed to utilize information from the encoded context sequence and preceding primary sequence tokens. The section further delves into the detailed algorithms and key aspects of attention and normalization used in these transformer architectures.

Encoder-only transformer: BERT [DCLT19]

The research paper discusses the BERT (Bidirectional Encoder Representations from Transformers) model, which is an encoder-only transformer trained on masked language modeling. BERT is designed to learn useful text representations that can be applied to various downstream NLP tasks. The model utilizes a masking strategy during training, where each input token is replaced with a probability 𝑝 mask by a dummy token mask_token. Evaluation is based on the reconstruction probability of these masked tokens. The BERT architecture is similar to the encoder part of the seq2seq transformer, and it employs the GELU nonlinearity instead of ReLU. The paper provides a detailed description of the BERT model and its training process, outlining the use of masking and the evaluation based on reconstruction probability. Additionally, the paper emphasizes the adaptability of BERT’s learned representations for various NLP tasks.

Decoder-only transformers: GPT-2 [RWC + 19]

The research paper discusses three decoder-only transformers: GPT-2, GPT-3, and Gopher. These are large language models developed by OpenAI and DeepMind, trained using autoregressive language modeling to predict the next token in an incomplete sentence or paragraph. The main distinction from BERT is the use of unidirectional attention and a different order of layernorms. GPT-3 is similar to GPT-2 but larger, with the replacement of dense attention by sparse attention, where each token uses only a subset of the full context. Gopher, also based on the GPT-2 architecture, differs by replacing layer norms with RMSnorm and using different positional embeddings. The paper provides Algorithm 10, which contains the pseudocode for GPT-2, offering a detailed insight into its functioning. These decoder-only transformers showcase variations in architecture, attention mechanisms, and positional embeddings, contributing to the ongoing research and development of language models.

Multi-domain decoder-only transformer: Gato [RZP]

The Gato is a multi-modal multitask transformer developed by DeepMind, serving as a single neural network capable of playing Atari, navigating 3D environments, controlling a robotic arm, captioning images, engaging in conversations, and more. Each modality is transformed into a sequence prediction problem using distinct tokenization and embedding methods, such as dividing images into non-overlapping 16 × 16 patches and processing them with a ResNet block to obtain a vector representation. The Gato architecture is a decoder-only transformer resembling Algorithm 10, but with modality-specific embedding code replacing Line 2.

Search This Blog

Patterns that Connect: AI, Management, Metaverse, Quantum, Philosophy, and Physics