On ‘Compression Algorithms for Large Language Models’ — shrinking the model size and reducing the cost of the hardware accelerators


Last week, I came across a paper published by esteemed researchers at Seoul National University, South Korea. The paper is a comprehensive survey of compression algorithms for Language Models. The paper is 35 pages long. Here is a link to the paper. It took me a few days to read and make notes. My humble effort is to summarize the paper, focusing on the compression techniques. I have also added a few snippets of Python code as I am trying out these compression techniques on a GPU VM with Data Science image #oraclecloudinfrastructure #nvidia A100 node.

The paper provides a thorough examination of various algorithms used to compress language models without reducing their accuracy. It covers methodologies like pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. Each category is explored with in-depth analyses of representative algorithms, discussing their advantages, challenges, and applications. The goal is to enhance the efficiency of large language models, making them more accessible and cost-effective while maintaining their performance. The paper concludes with discussions on the current trends in compression techniques, the importance of low-cost algorithms for large models, and suggestions for future research directions in the field.

I’ll summarize the main sections of the paper:


The paper begins by highlighting the importance of language models in various applications and the challenges posed by their large size, including high computational and storage requirements.

Compression Techniques Overview

It categorizes the compression techniques into several types, such as pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design.


This section explains how pruning methods reduce model size by eliminating unnecessary weights or neurons. This technique removes less important parts of the model, like weights or neurons, to reduce size. Example: Magnitude pruning, which eliminates weights closest to zero. Structured pruning is a method of compressing neural networks by removing entire channels, filters, or layers, rather than individual weights. This approach maintains the original structure and data flow of the network, which can simplify hardware implementation and potentially lead to better computational efficiency. Unlike unstructured pruning, which may result in sparse matrices requiring specialized hardware or software for efficient computation, structured pruning results in smaller, denser matrices that are more compatible with existing hardware accelerators.


It covers the process of reducing the precision of the weights, which can significantly decrease the model size and speed up computation. The paper discusses different approaches, like uniform or non-uniform quantization. Quantization in neural network compression involves reducing the precision of the weights and activations from floating-point to lower-bit representations, such as 16-bit, 8-bit, or even binary formats. This process significantly decreases the model size and speeds up inference by reducing the computational resources needed. There are different approaches to quantization, including post-training quantization, where the model is quantized after being fully trained, and quantization-aware training, which incorporates quantization into the training process to minimize the loss of accuracy. These techniques enable deploying complex neural networks on devices with limited memory and computational power, such as mobile phones and embedded devices. Uniform quantization applies the same quantization step size to all values, leading to equal intervals between quantized values. It’s simpler and more hardware-friendly. Non-uniform quantization uses variable step sizes, allowing for finer granularity where needed, such as for values closer to zero in a weight distribution. This can preserve more information for certain ranges of values, potentially leading to better model performance but at the cost of increased complexity in implementation.

Knowledge Distillation

This part describes how a smaller model (student) is trained to mimic the behavior of a larger model (teacher), preserving performance while reducing size. The student model learns from the outputs of the teacher model, aiming to achieve comparable performance with significantly fewer parameters, making it more efficient for deployment in environments with limited computational resources. This process involves guiding the student model not just with the final classification outputs but often with the soft probabilities or intermediate representations learned by the teacher, enriching the student’s learning process.

Low-Rank Approximation

The paper explains how matrix decomposition techniques can simplify the weight matrices of the network to reduce complexity and model size. Low-rank approximation is a technique used to reduce the complexity of models by approximating large matrices with products of smaller matrices. This method exploits the idea that data often lies in a lower-dimensional space. For example, in Singular Value Decomposition (SVD), a large matrix is decomposed into three smaller matrices, capturing the most significant aspects of the original matrix while discarding the less important information. This approach can significantly reduce the number of parameters in neural networks, leading to more efficient storage and computation without a substantial loss in performance.

Parameter Sharing

It discusses methods for reusing weights across different parts of the model to cut down on the total number of parameters. Parameter sharing is a technique used in neural networks to reduce the number of trainable parameters, thus decreasing memory usage and computational requirements. It involves using the same parameters (weights) for more than one function in a model. A common example of parameter sharing is in convolutional neural networks (CNNs), where the same filter (set of weights) is applied across different parts of the input image. This approach leverages the spatial hierarchies in images and allows for detecting features regardless of their position in the input, significantly reducing the model size while maintaining effectiveness.

Efficient Architecture Design

This section highlights how new model architectures are designed for efficiency, offering the same or improved performance with fewer parameters. Efficient architecture design focuses on creating neural network models that are inherently less resource-intensive, achieving high performance with fewer parameters or computational complexity. This involves innovative structural designs such as attention mechanisms that focus computation on relevant parts of the input data, or network architectures that incorporate efficiency from the ground up, like MobileNets or EfficientNets. These models are particularly suited for applications on devices with limited computational capabilities, such as smartphones or IoT devices, enabling advanced AI functionalities without requiring extensive hardware resources.

Discussion and Future Directions: It concludes with a discussion on the implications of the findings, challenges in model compression, and potential areas for future research.

I am using the AI ‘all-in-one’ Data Science Image for GPU from Oracle Cloud Marketplace deployed on NVIDIA A100 Baremetal shape running in Oracle Cloud Infrastructure. Here are the code snippets to get one started. I will put the notebook on my GitHub.

Pruning — torch.nn.utils.prune module in PyTorch:

import torch

import torch.nn.utils.prune as prune

import torchvision.models as models

model = models.resnet18(pretrained=True)

parameters_to_prune = ((model.conv1, 'weight'), (model.layer1[0].conv1, 'weight'))

prune.global_unstructured(parameters_to_prune, pruning_method=prune.L1Unstructured, amount=0.2)

Quantization — Post-training quantization in TensorFlow:

import tensorflow as tf

model = tf.keras.applications.MobileNetV2(weights='imagenet', input_shape=(224, 224, 3))

converter = tf.lite.TFLiteConverter.from_keras_model(model)

converter.optimizations = [tf.lite.Optimize.DEFAULT]

tflite_model = converter.convert()

Knowledge Distillation — Hugging Face transformers:

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, BertForSequenceClassification

from transformers import Trainer, TrainingArguments

teacher = BertForSequenceClassification.from_pretrained('bert-base-uncased')

student = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

# Define training arguments

training_args = TrainingArguments(









# Initialize Trainer

trainer = Trainer(






# Train and save



Low-Rank Approximation — Using SVD in NumPy:

import numpy as np

# Example matrix

A = np.random.rand(10,10)

U, s, VT = np.linalg.svd(A)

S = np.zeros((10, 10))

S[:10, :10] = np.diag(s)

# Reconstruct with only the top-5 singular values

S_reduced = np.zeros_like(S)

S_reduced[:5, :5] = np.diag(s[:5])

A_reduced = U.dot(S_reduced).dot(VT)

Parameter Sharing — Shared LSTM layers in Keras:

from tensorflow.keras.layers import Input, LSTM, Dense

from tensorflow.keras.models import Model

input = Input(shape=(10, 64))

shared_lstm = LSTM(64)

processed = shared_lstm(input)

prediction = Dense(10, activation='softmax')(processed)

model = Model(inputs=input, outputs=prediction)




Efficient Architecture Design — Using EfficientNet in TensorFlow:

import tensorflow as tf

model = tf.keras.applications.EfficientNetB0(weights='imagenet')

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

These snippets serve as starting points for experimenting with compression techniques in machine learning models. Adjustments may be necessary depending on your specific dataset and requirements.

The following is not in the paper. I am creating a generalized formula that encapsulates the relationship between compression algorithms and cost efficiency in deploying large language models. This involves considering several factors, including model size, computational complexity, and the specific compression techniques applied. A precise mathematical formula might be too complex and context-dependent, but we can conceptualize a high-level relationship as follows:

Copyright: Sanjay Basu


Performance of Compressed Model measures the effectiveness of the compressed model in achieving its intended task (e.g., accuracy, speed).

Computational Resources Required encompasses the computational power and memory needed to train and deploy the model.

Model Size Reduction Factor represents the degree to which the model’s size has been reduced through compression.

The goal of compression algorithms is to maximize cost efficiency by:

1. Minimizing the Computational Resources Required, and making it cheaper to train and deploy models.

2. Maximizing the Model Size Reduction Factor, and allowing the model to run on devices with limited memory and processing power without significantly compromising performance.

Different compression techniques (pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design) contribute variably to these factors. For example, quantization may greatly reduce model size and computational requirements with a minor impact on performance, while knowledge distillation focuses on preserving or even enhancing performance with substantially smaller model architectures.

The ideal formula for a specific scenario would need to account for the unique characteristics and requirements of the application, including the balance between performance, cost, and operational constraints.

Large Language Models

Compression Techniques

Model Quantization

Knowledge Distillation

Low Rank Approximation


Popular posts from this blog

OCI Object Storage: Copy Objects Across Tenancies Within a Region

Religious Perspectives on Artificial Intelligence: My views

The Legal Rights of an Algorithm