Fine-Tuning Language Models on NVIDIA DGX Spark
Complete How-To Guide
![]() |
| Copyright: Sanjay Basu |
Overview
This guide provides comprehensive instructions for fine-tuning open-source language models on the NVIDIA DGX Spark personal AI supercomputer. The DGX Spark’s unique 128GB unified memory architecture enables local training of models that would traditionally require cloud infrastructure.
Fine-tuning allows you to customize pre-trained models for specific tasks, domains, or response styles while preserving their general capabilities. This guide covers three fine-tuning strategies: Full fine-tuning for maximum customization, LoRA for memory-efficient adaptation, and QLoRA for training even larger models within memory constraints.
DGX Spark Hardware Advantages
The NVIDIA DGX Spark provides several key advantages for local AI development:
- 128GB Unified Memory: CPU and GPU share the same memory pool via NVLink-C2C, eliminating memory transfer bottlenecks
- Grace Blackwell Architecture: Purpose-built for AI workloads with up to 1 PFLOPS performance (FP4)
- 900 GB/s NVLink-C2C Bandwidth: Ultra-fast CPU-GPU communication for seamless model loading
- Local Execution: Complete privacy, no cloud dependencies, predictable costs
- Large Model Support: Train 7B-70B parameter models locally with appropriate methods
Fine-Tuning Methods
Choose the appropriate method based on your model size, available memory, and quality requirements:
![]() |
| Copyright: Sanjay Basu |
Recommended Models
The following open-source models are excellent choices for fine-tuning on DGX Spark, sorted by size:
Small Models (Under 3B Parameters)
Ideal for experimentation, fast iteration, and full fine-tuning:
- SmolLM 135M/360M/1.7B: HuggingFace’s efficient small models, perfect for testing
- Qwen 2.5 1.5B: Excellent multilingual capabilities in a small package
- Phi-3 Mini (3.8B): Microsoft’s compact but capable model
Medium Models (3B-13B Parameters)
Best balance of capability and trainability with LoRA:
- Qwen 2.5 3B/7B: Strong reasoning and coding abilities
- Llama 3.2 3B: Meta’s latest efficient model
- Llama 3.1 8B: Excellent general-purpose model
- Mistral 7B: Strong performance with fast inference
- Gemma 2 9B: Google’s high-quality open model
Large Models (13B+ Parameters)
Maximum capability, requires QLoRA for training:
- Mistral Nemo 12B: Excellent for complex tasks
- Llama 3.1 70B: State-of-the-art open model (QLoRA required)
- Qwen 2.5 72B: Powerful multilingual model (QLoRA required)
Quick Start Guide
Follow these steps to fine-tune your first model:
Step 1: Environment Setup
Clone or download the fine-tuning scripts, then run the setup:
chmod +x setup.sh && ./setup.sh
This creates a virtual environment and installs all dependencies.
Step 2: Prepare Your Data
Create a JSON file with your training examples in Alpaca format:
[{“instruction”: “Your task”, “input”: “Optional context”, “output”: “Expected response”}]
Or use the provided dataset preparation script:
python scripts/prepare_dataset.py — create-sample
Step 3: Run Fine-Tuning
Execute fine-tuning with your chosen model and method:
python scripts/finetune_dgx_spark.py — model qwen2.5–3b — method lora — dataset data/sample_data.json
Or use the convenience script with presets:
./run_finetune.sh small # Qwen 3B with LoRA
Step 4: Test Your Model
Run inference with your fine-tuned model:
python scripts/finetune_dgx_spark.py — inference — model-path output/merged_model — prompt “Your test prompt”
Dataset Preparation
High-quality training data is the most important factor in fine-tuning success. This section covers data formats and best practices.
Supported Formats
Alpaca Format (Recommended)
The standard format for instruction-following datasets:
{“instruction”: “task description”, “input”: “optional context”, “output”: “expected response”}
ShareGPT Format
For conversational/chat-style data:
{“conversations”: [{“role”: “user”, “content”: “…”}, {“role”: “assistant”, “content”: “…”}]}
Data Quality Guidelines
- Aim for 1,000–10,000 high-quality examples for domain adaptation
- Ensure diverse examples covering the full range of desired behaviors
- Include both simple and complex examples for robust learning
- Validate that outputs match instructions accurately
- Remove duplicates and low-quality examples
- Balance categories if doing multi-task training
Training Configuration
Optimal hyperparameters vary by model size and method. The following recommendations are tuned for DGX Spark’s 128GB unified memory:
Recommended Batch Sizes
![]() |
| Copyright: Sanjay Basu |
LoRA Hyperparameters
- Rank (r): 16 for LoRA, 64 for QLoRA — higher rank = more capacity
- Alpha: Typically 2x the rank (32 for r=16)
- Dropout: 0.05 for LoRA, 0.1 for QLoRA
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Parameters
- Learning Rate: 2e-4 (adjust based on loss curves)
- Epochs: 3 for domain adaptation, 1–2 for instruction tuning
- Warmup Ratio: 0.03 (3% of training steps)
- Weight Decay: 0.01
- Scheduler: Cosine with warmup
Troubleshooting
Out of Memory Errors
- Reduce batch size by half
- Increase gradient accumulation to maintain effective batch size
- Switch from full fine-tuning to LoRA, or from LoRA to QLoRA
- Reduce sequence length (max_length parameter)
- Enable gradient checkpointing (enabled by default)
Training Loss Not Decreasing
- Check data quality and format
- Increase learning rate by 2–5x
- Verify tokenization is working correctly
- Ensure sufficient training examples (1000+ recommended)
Model Produces Nonsense
- Training may have diverged — reduce learning rate
- Check for data formatting issues
- Ensure proper tokenizer configuration
- Train for more epochs if loss is still high
Next Steps
After successfully fine-tuning your model:
- Evaluate on held-out test data to measure improvements
- Deploy using LM Studio, Ollama, or vLLM for inference
- Compare with cloud alternatives to quantify DGX Spark advantages
- Iterate on data quality for continued improvement
- Consider RLHF or DPO for further alignment
The DGX Spark’s unified memory architecture provides unique advantages for local AI development, enabling training of large models without cloud dependencies while maintaining full control over your data and models.



Comments
Post a Comment