Run multiple LLMs on your DGX Spark with flashtensors

 Leverage 128GB unified memory for instant model hot-swapping

Copyright: Sanjay Basu

The Model Loading Problem

Waiting for a large AI model to initialize often involves a long, frustrating delay. During this time, your GPU remains idle as weights are transferred through multiple bottlenecks, leading to significant latency. For those operating local AI setups, this startup delay can determine whether the system feels quick and responsive or sluggish and vexing..

Now imagine running multiple large models on a single GPU, and switching between them in seconds. That’s exactly what flashtensors enables, and on the DGX Spark’s 128GB unified memory architecture, this capability becomes particularly powerful.

Why DGX Spark is Ideal for flashtensors

The DGX Spark’s Grace Blackwell architecture provides unique advantages for flashtensors’ direct memory streaming approach:

Copyright: Sanjay Basu

The shared memory architecture removes the old bottleneck caused by data transfers between the SSD, CPU, and GPU. Flashtensors enable direct streaming of weights into the shared memory pool, allowing for almost instant hot-swapping between different models.

How flashtensors Works

Traditional loaders like safetensors or PyTorch’s native methods transfer data in multiple steps: from disk to RAM, then from RAM to GPU VRAM, each step adding delay and occupying memory. Flashtensors adopts a different strategy, streaming model weights directly from NVMe storage to GPU memory in optimized chunks, thus avoiding intermediate buffers. This results in a more efficient transfer process.

With 10x faster cold starts compared to safetensors, models load in sub-2-second times for most cases. You can host 100+ models on a single GPU and achieve near-instant hot-swapping between cached models.

Installation on DGX Spark

Setting up flashtensors on your DGX Spark requires only a single command:

pip install git+https://github.com/leoheuler/flashtensors.git

For optimal performance, ensure your DGX Spark’s NVMe storage is configured for direct I/O. The library automatically detects CUDA availability and configures itself for the Blackwell GPU architecture.

For optimal performance, ensure your DGX Spark’s NVMe storage is configured for direct I/O. The library automatically detects CUDA availability and configures itself for the Blackwell GPU architecture.

Quick Start: CLI Usage

flashtensors includes a convenient CLI for quick testing. Start the service, pull a model, and generate text:

# Start the flashtensors service
flash start
# Pull and cache a model
flash pull Qwen/Qwen3-14B
# Generate text
flash run Qwen/Qwen3-14B "Explain quantum entanglement"

The first load takes a few seconds as flashtensors converts and caches the model. Subsequent loads are nearly instant.

Python Integration with vLLM

For production applications, flashtensors integrates seamlessly with vLLM. Here’s a complete example optimized for the DGX Spark’s 128GB unified memory:

import flashtensors as ft
from vllm import SamplingParams
import time
# DGX Spark optimized configuration
ft.configure(
storage_path="/data/models", # NVMe path
mem_pool_size=1024**3 * 100, # 100GB model cache
chunk_size=1024**2 * 64, # 64MB chunks
gpu_memory_utilization=0.85 # Leave headroom
)
ft.activate_vllm_integration()
# Register multiple models for hot-swapping
models = [
"Qwen/Qwen3-14B",
"Qwen/Qwen3-32B",
"meta-llama/Llama-3.1-8B-Instruct",
"mistralai/Mistral-7B-Instruct-v0.3"
]
for model_id in models:
ft.register_model(model_id, backend="vllm", torch_dtype="bfloat16")
# Load and benchmark
start = time.time()
llm = ft.load_model("Qwen/Qwen3-32B", backend="vllm", dtype="bfloat16")
print(f"Loaded 32B model in {time.time() - start:.2f}s")
# Generate
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["Explain the DGX Spark architecture"], params)
print(outputs[0].outputs[0].text)

Expected Performance on DGX Spark

Based on the H100 benchmarks and the DGX Spark’s unified memory advantages, here are the expected loading times:

Copyright: Sanjay Basu

70B models require 4-bit quantization on DGX Spark, or BF16 with memory-efficient attention. The unified memory architecture allows models that wouldn’t fit in traditional 80GB VRAM.

Loading Custom Fine-Tuned Models

If you’ve fine-tuned your own models (perhaps using the DGX Spark’s capabilities for LoRA or full fine-tuning), flashtensors can load them just as easily:

from flashtensors import flash
import torch
# Save your model in flashtensors format
model = YourFineTunedModel()
model.load_state_dict(torch.load('finetuned_checkpoint.pt'))
flash.save_dict(model.state_dict(), "/data/models/my-custom-model")
# Later: instant loading
loaded_weights = flash.load_dict("/data/models/my-custom-model", device_map={"": 0})

Multi-Model Serving Architecture

The real power of flashtensors on DGX Spark emerges when serving multiple models. With 100GB+ available for model caching, you can keep dozens of models warm and swap between them in milliseconds:

import flashtensors as ft
from flask import Flask, request, jsonify
app = Flask(__name__)
model_registry = {}
# Pre-register all models at startup
MODEL_CATALOG = {
"general": "Qwen/Qwen3-14B",
"coding": "Qwen/Qwen3-Coder-32B",
"creative": "meta-llama/Llama-3.1-8B-Instruct",
"fast": "Qwen/Qwen3-0.6B"
}
@app.route("/generate", methods=["POST"])
def generate():
model_type = request.json.get("model_type", "general")
model_id = MODEL_CATALOG[model_type]
# Hot-swap to requested model (sub-second if cached)
llm = ft.load_model(model_id, backend="vllm")
# ... generate and return response

Practical Use Cases

1. Agentic AI Systems

Multi-agent frameworks like AgentSpec benefit enormously from instant model switching.

Route simple queries to fast 0.6B models while sending complex reasoning tasks to

32B+ models, all from the same GPU.

2. Local AI Development

Prototype with multiple models without the usual loading delays. Test Qwen, Llama,

Mistral, and your fine-tuned variants in rapid succession. The DGX Spark becomes a

local AI development powerhouse.

3. Personalized Model Serving

Host user-specific fine-tuned models for multiple users on a single GPU. Each user

gets their personalized model loaded on-demand in under 3 seconds.

4. Edge Robotics

For robotics applications where latency matters, flashtensors ensures models are

ready when needed. Swap between perception, planning, and control models without

the usual warm-up delays.


DGX Spark Configuration Tips

1. Maximize memory pool: Set mem_pool_size to 100GB+ to fully leverage unified memory

2. Use NVMe storage: Store models on the fastest available NVMe for optimal streaming

3. Increase chunk size: 64MB chunks work well with the DGX Spark’s bandwidth

4. Pre-warm critical models: Register and load your most-used models at service startup

5. Monitor unified memory: Use nvidia-smi to track memory usage across CPU+GPU


What’s Next for flashtensors

The project roadmap includes Docker support, a built-in inference server, and

integrations with SGLang, LlamaCPP, Dynamo, and Ollama. These additions will make

flashtensors even more valuable for local AI deployments.


Conclusion

flashtensors represents a fundamental rethink of how models should transfer from

storage to GPU. On the DGX Spark’s 128GB unified memory architecture, this approach

enables capabilities that weren’t practical before: hosting 100+ models on a single

GPU, sub-3-second cold starts for models up to 32B parameters, and near-instant

hot-swapping between cached models.

For those of us building local AI infrastructure, this eliminates one of the biggest

friction points in model deployment. Combined with the DGX Spark’s exceptional

memory capacity, flashtensors transforms what’s possible with a single GPU system.


Resources

• flashtensors GitHub: github.com/leoheuler/flashtensors

• NVIDIA DGX Spark: nvidia.com/dgx-spark

• vLLM Documentation: docs.vllm.ai

— 

One command. One GPU. A hundred models. No waiting.


Comments

Popular posts from this blog

Digital Selfhood

Axiomatic Thinking

How MSPs Can Deliver IT-as-a-Service with Better Governance