Fine-Tuning Language Models on NVIDIA DGX Spark

 Complete How-To Guide

Copyright: Sanjay Basu

Overview

This guide provides comprehensive instructions for fine-tuning open-source language models on the NVIDIA DGX Spark personal AI supercomputer. The DGX Spark’s unique 128GB unified memory architecture enables local training of models that would traditionally require cloud infrastructure.

Fine-tuning allows you to customize pre-trained models for specific tasks, domains, or response styles while preserving their general capabilities. This guide covers three fine-tuning strategies: Full fine-tuning for maximum customization, LoRA for memory-efficient adaptation, and QLoRA for training even larger models within memory constraints.

DGX Spark Hardware Advantages

The NVIDIA DGX Spark provides several key advantages for local AI development:

  • 128GB Unified Memory: CPU and GPU share the same memory pool via NVLink-C2C, eliminating memory transfer bottlenecks

  • Grace Blackwell Architecture: Purpose-built for AI workloads with up to 1 PFLOPS performance (FP4)

  • 900 GB/s NVLink-C2C Bandwidth: Ultra-fast CPU-GPU communication for seamless model loading

  • Local Execution: Complete privacy, no cloud dependencies, predictable costs

  • Large Model Support: Train 7B-70B parameter models locally with appropriate methods

Fine-Tuning Methods

Choose the appropriate method based on your model size, available memory, and quality requirements:

Copyright: Sanjay Basu

Recommended Models

The following open-source models are excellent choices for fine-tuning on DGX Spark, sorted by size:

Small Models (Under 3B Parameters)

Ideal for experimentation, fast iteration, and full fine-tuning:

  • SmolLM 135M/360M/1.7B: HuggingFace’s efficient small models, perfect for testing
  • Qwen 2.5 1.5B: Excellent multilingual capabilities in a small package
  • Phi-3 Mini (3.8B): Microsoft’s compact but capable model

Medium Models (3B-13B Parameters)

Best balance of capability and trainability with LoRA:

  • Qwen 2.5 3B/7B: Strong reasoning and coding abilities
  • Llama 3.2 3B: Meta’s latest efficient model
  • Llama 3.1 8B: Excellent general-purpose model
  • Mistral 7B: Strong performance with fast inference
  • Gemma 2 9B: Google’s high-quality open model

Large Models (13B+ Parameters)

Maximum capability, requires QLoRA for training:

  • Mistral Nemo 12B: Excellent for complex tasks
  • Llama 3.1 70B: State-of-the-art open model (QLoRA required)
  • Qwen 2.5 72B: Powerful multilingual model (QLoRA required)

Quick Start Guide

Follow these steps to fine-tune your first model:

Step 1: Environment Setup

Clone or download the fine-tuning scripts, then run the setup:

chmod +x setup.sh && ./setup.sh

This creates a virtual environment and installs all dependencies.

Step 2: Prepare Your Data

Create a JSON file with your training examples in Alpaca format:

[{“instruction”: “Your task”, “input”: “Optional context”, “output”: “Expected response”}]

Or use the provided dataset preparation script:

python scripts/prepare_dataset.py — create-sample

Step 3: Run Fine-Tuning

Execute fine-tuning with your chosen model and method:

python scripts/finetune_dgx_spark.py — model qwen2.5–3b — method lora — dataset data/sample_data.json

Or use the convenience script with presets:

./run_finetune.sh small # Qwen 3B with LoRA

Step 4: Test Your Model

Run inference with your fine-tuned model:

python scripts/finetune_dgx_spark.py — inference — model-path output/merged_model — prompt “Your test prompt”

Dataset Preparation

High-quality training data is the most important factor in fine-tuning success. This section covers data formats and best practices.

Supported Formats

Alpaca Format (Recommended)

The standard format for instruction-following datasets:

{“instruction”: “task description”, “input”: “optional context”, “output”: “expected response”}

ShareGPT Format

For conversational/chat-style data:

{“conversations”: [{“role”: “user”, “content”: “…”}, {“role”: “assistant”, “content”: “…”}]}

Data Quality Guidelines

  • Aim for 1,000–10,000 high-quality examples for domain adaptation
  • Ensure diverse examples covering the full range of desired behaviors
  • Include both simple and complex examples for robust learning
  • Validate that outputs match instructions accurately
  • Remove duplicates and low-quality examples
  • Balance categories if doing multi-task training

Training Configuration

Optimal hyperparameters vary by model size and method. The following recommendations are tuned for DGX Spark’s 128GB unified memory:

Recommended Batch Sizes

Copyright: Sanjay Basu

LoRA Hyperparameters

  1. Rank (r): 16 for LoRA, 64 for QLoRA — higher rank = more capacity
  2. Alpha: Typically 2x the rank (32 for r=16)
  3. Dropout: 0.05 for LoRA, 0.1 for QLoRA
  4. Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Parameters

  • Learning Rate: 2e-4 (adjust based on loss curves)
  • Epochs: 3 for domain adaptation, 1–2 for instruction tuning
  • Warmup Ratio: 0.03 (3% of training steps)
  • Weight Decay: 0.01
  • Scheduler: Cosine with warmup

Troubleshooting

Out of Memory Errors

  • Reduce batch size by half
  • Increase gradient accumulation to maintain effective batch size
  • Switch from full fine-tuning to LoRA, or from LoRA to QLoRA
  • Reduce sequence length (max_length parameter)
  • Enable gradient checkpointing (enabled by default)

Training Loss Not Decreasing

  • Check data quality and format
  • Increase learning rate by 2–5x
  • Verify tokenization is working correctly
  • Ensure sufficient training examples (1000+ recommended)

Model Produces Nonsense

  • Training may have diverged — reduce learning rate
  • Check for data formatting issues
  • Ensure proper tokenizer configuration
  • Train for more epochs if loss is still high

Next Steps

After successfully fine-tuning your model:

  1. Evaluate on held-out test data to measure improvements
  2. Deploy using LM Studio, Ollama, or vLLM for inference
  3. Compare with cloud alternatives to quantify DGX Spark advantages
  4. Iterate on data quality for continued improvement
  5. Consider RLHF or DPO for further alignment

The DGX Spark’s unified memory architecture provides unique advantages for local AI development, enabling training of large models without cloud dependencies while maintaining full control over your data and models.

GitHub: https://github.com/sanjbasu/dgxsparkfinetune


Comments

Popular posts from this blog

Digital Selfhood

Axiomatic Thinking

How MSPs Can Deliver IT-as-a-Service with Better Governance