When a Mathematician Dreamt in Parallel

Copyright: Sanjay Basu


GPU-Accelerated Ramanujan Visualizations on DGX Spark and OCI A100

There’s something almost cosmically appropriate about using modern GPU architecture to visualize the mathematics of Srinivasa Ramanujan. Here was a man who claimed his theorems came to him in dreams, delivered by the goddess Namagiri. A kind of divine parallel processing, if you will. Now we’re using thousands of CUDA cores to render the very structures he intuited a century ago.

I’ve been playing with a visualization suite that brings Ramanujan’s most beautiful discoveries to life. The partition function, modular forms, taxicab numbers, and that insane π series that converges at eight digits per term. The interesting part isn’t just the math. It’s the architecture. We’ve built this to leverage both WebGPU compute shaders in the browser and a CUDA backend for the heavy lifting. The result is something that feels almost alive, with results streaming in real-time as the GPU churns through pentagonal number recurrences.

My idol

Before we dive into the technical bits, let me explain why this project matters beyond mere visualization.

Ramanujan worked without formal proofs. He saw mathematical truths the way you or I see a chair across the room. His notebooks are filled with results that took other mathematicians decades to verify. The partition function alone, counting the ways to write an integer as sums of smaller integers, seems trivial until you realize p(100) equals 190,569,292. The growth is explosive, governed by an asymptotic formula Hardy and Ramanujan derived together:

p(n) ~ (1 / 4n√3) · exp(π√(2n/3))

What strikes me about this is the presence of π and e (transcendental numbers) lurking inside what appears to be a purely combinatorial problem. There’s something here about the deep structure of mathematics that GPUs let us see rather than merely prove.

The Architecture

Browser Meets Backend

The system operates on two parallel tracks:

Track 1: WebGPU Compute Shaders

For modular forms visualization, we’re computing |η(τ)|² directly on whatever GPU lives in your machine. Yes, even in the browser. The Dedekind eta function is this beautiful infinite product:

η(τ) = e^(πiτ/12) ∏(1 — e^(2πinτ))

Each pixel in our visualization represents a point in the complex upper half-plane, and each pixel requires evaluating that product. Classic embarrassingly parallel problem. The WGSL compute shader dispatches 16×16 workgroups, and on a decent GPU, a 512×512 grid with 100 product terms renders in about 50 milliseconds.

Track 2: CUDA Backend via WebSocket

For the partition function and taxicab searches, we need more muscle. And more importantly, we need to stream results as they’re computed. Nobody wants to stare at a blank screen for five seconds while p(100,000) finishes.

The Python backend uses CuPy for GPU-accelerated arrays and Numba for custom CUDA kernels. Results flow back to the browser via WebSocket in batches of 500, updating the Plotly charts in real-time. There’s something deeply satisfying about watching the curve grow as the computation proceeds.

Running on DGX Spark

The DGX Spark with its Grace Blackwell architecture is frankly overkill for this, but that’s part of the joy. The launch script auto-detects the GB10 and adjusts thread blocks accordingly. Here’s what you get:

Press enter or click to view image in full size
Created by Sanjay Basu

The partition computation uses Euler’s pentagonal number theorem as a recurrence relation. It’s inherently sequential in one sense. You need p(n-1) to compute p(n), but within each step, the summation over generalized pentagonal numbers parallelizes nicely.

# On DGX Spark
chmod +x launch_ramanujan.sh
./launch_ramanujan.sh
# Open http://localhost:8000/ramanujan_gpu_enhanced.html

The script handles dependency installation, runs a quick TFLOPS benchmark to verify your GPU is healthy, and launches both the CUDA WebSocket server and the HTTP frontend.

Scaling to OCI Bare Metal A100 (BM.GPU.A100-v2.8)

Now here’s where it gets interesting for those of us who want to push beyond single-GPU. Oracle Cloud Infrastructure offers bare metal A100 instances with eight A100 GPUs (80GB HBM2e each). 640GB of aggregate GPU memory connected via NVLink. This is the BM.GPU.A100-v2.8 shape, and it’s perfect for scaling these visualizations into territory that would make Ramanujan himself smile.

Why Multi-GPU for Mathematical Visualization?

Single GPU is fine for p(100,000). But what about p(10,000,000)? The partition numbers grow so fast that they exceed 64-bit integers almost immediately. You need arbitrary precision arithmetic. And the modular form visualization? At 4K resolution with 500 product terms, you’re looking at computation that benefits enormously from spreading across eight A100s.

More practically, the taxicab search to 10¹² requires parallel cube generation across multiple GPUs, with reduction operations to find collisions.

Setting Up the OCI A100 Instance

Step 1: Provision the Instance

# Using OCI CLI
oci compute instance launch \
--availability-domain "AD-1" \
--compartment-id $COMPARTMENT_ID \
--shape "BM.GPU.A100-v2.8" \
--image-id $GPU_IMAGE_ID \
--subnet-id $SUBNET_ID \
--display-name "ramanujan-a100-cluster"

Use the NVIDIA GPU Cloud (NGC) marketplace image or Oracle Linux 8 with CUDA pre-installed. The A100 bare metal shapes come with CUDA 12.x drivers ready.

Step 2: Verify Multi-GPU Setup

# SSH into your instance
ssh opc@<instance-ip>

# Check all 8 GPUs
nvidia-smi

# Verify NVLink topology
nvidia-smi topo -m

You should see eight A100-SXM4–80GB devices with NVLink connectivity showing NV12 between adjacent pairs.

Step 3: Install Dependencies

# System packages
sudo dnf install -y python3.11 python3.11-pip git

# CUDA Python stack
pip3.11 install cupy-cuda12x numba websockets numpy mpi4py

# For multi-GPU communication
pip3.11 install cupy-cuda12x[all] # Includes NCCL bindings

Step 4: Deploy the Multi-GPU Backend

Here’s an enhanced version of the CUDA backend that distributes work across all eight A100s:

# ramanujan_multi_gpu.py
import cupy as cp
from cupy.cuda import nccl
import numpy as np

class MultiGPURamanujan:
def __init__(self):
self.n_gpus = cp.cuda.runtime.getDeviceCount()
print(f"Initialized with {self.n_gpus} GPUs")

def compute_modular_form_distributed(self, resolution=4096, terms=500):
"""Distribute modular form computation across all GPUs."""
rows_per_gpu = resolution // self.n_gpus
results = []

for gpu_id in range(self.n_gpus):
with cp.cuda.Device(gpu_id):
start_row = gpu_id * rows_per_gpu
end_row = start_row + rows_per_gpu

# Each GPU computes its slice of the image
local_result = self._compute_eta_slice(
resolution, start_row, end_row, terms
)
results.append(local_result)

# Gather results (NVLink makes this fast)
with cp.cuda.Device(0):
full_result = cp.vstack([r.get() for r in results])

return full_result

def compute_partitions_distributed(self, max_n=10_000_000):
"""
Distribute partition computation using parallel prefix approach.
Each GPU handles a range of n values, with synchronization
for dependencies.
"""

# This is more complex due to sequential dependencies
# We use a wavefront approach where GPUs work on independent
# pentagonal term evaluations
pass

Step 5: Launch with Multi-GPU Support

# Set CUDA visible devices (usually all 8)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Launch with increased WebSocket buffer for 8x throughput
python3.11 ramanujan_cuda_backend.py --multi-gpu --batch-size 4000

# In another terminal, start the frontend
python3.11 -m http.server 8000

Network Configuration for Remote Access

OCI bare metal instances need security list rules for WebSocket and HTTP access:

# Add ingress rules via OCI CLI
oci network security-list update \
--security-list-id $SECURITY_LIST_ID \
--ingress-security-rules '[
{"protocol":"6","source":"0.0.0.0/0","tcpOptions":{"destinationPortRange":{"min":8000,"max":8000}}},
{"protocol":"6","source":"0.0.0.0/0","tcpOptions":{"destinationPortRange":{"min":8765,"max":8765}}}
]'

Then access via http://<public-ip>:8000/ramanujan_gpu_enhanced.html

Performance on 8x A100

Press enter or click to view image in full size
Created by Sanjay Basu

The scaling isn’t perfectly linear due to synchronization overhead, but for embarrassingly parallel workloads like the modular form visualization, you’re looking at near-8x speedup.

Cost Optimization

BM.GPU.A100-v2.8 runs about $30/hour on OCI. For a visualization demo or exploration session, you’re looking at maybe $5–10 for a few hours of play. Use preemptible capacity if available in your region for ~60% discount.

# Preemptible instance (if available)
oci compute instance launch \
--shape "BM.GPU.A100-v2.8" \
--preemptible-instance-config '{"preemptionAction":{"type":"TERMINATE"}}' \
...

The Deeper Point

I started this project as a weekend experiment, but it’s grown into something that feels meaningful. Ramanujan worked with pencil and paper, producing results that we now verify with thousands of parallel processors. There’s a kind of technological reverence in that, using our most advanced hardware to explore the structures a self-taught genius intuited in colonial India.

The partition function, the modular forms, the taxicab numbers. These aren’t just mathematical curiosities. They’re windows into the deep structure of number theory, connected to elliptic curves, string theory, and the Langlands program. When you watch |η(τ)|² render in real-time on an A100, you’re seeing symmetries that took mathematicians a century to understand.

Ramanujan said equations meant nothing to him unless they expressed a thought of God. I don’t know about the theology, but I know this, instead.

Watching GPU cores churn through his mathematics feels like participating in something larger than code.

Quick Reference: File Manifest

Press enter or click to view image in full size
Created by Sanjay Basu

Links

• OCI GPU Shapes Documentation

• CuPy Multi-GPU Guide

My Visualization

Press enter or click to view image in full size
Copyright: Sanjay Basu


Dr. Sanjay Basu is Senior Director of GPU & Gen AI Solutions at Oracle Cloud Infrastructure, Chief Technology Advisor for Stem Practice, and co-founder of MisaLabs. He writes hard science fiction when the GPUs are idle.

 

Comments

Popular posts from this blog

Digital Selfhood

Axiomatic Thinking

How MSPs Can Deliver IT-as-a-Service with Better Governance