CompFP4 compresses optimizer states to sub-4-bit precision with error-feedback compensation. Train larger models on the GPUs you already have.
59% VRAM reduction. 7 domains. Zero loss penalty.
OOM crashes mid-training
59% VRAM reduction — proven across 7 domains
Can't afford H100 clusters
Train on A100/4090 what used to need H100
Gradient instability at low precision
Error-feedback compensation — no divergence
Weeks of optimizer tuning per domain
Domain-aware Hybrid++ auto-splits parameters
Complex multi-node setup
pip install compfp4 — 3 lines to integrate
CLI connects to any cloud GPU — RunPod, Lambda, your cluster.
GPT, LLaMA, BERT, Gemma
LLM training at half the VRAM. GPT-scale on consumer GPUs.
ViT, DeiT, ResNet
Larger batches, faster convergence on classification and detection.
Whisper, HuBERT, Wav2Vec2
ASR and audio understanding training without VRAM walls.
VITS, Bark, SpeechT5, Tortoise
Text-to-speech and voice synthesis at full model scale.
Stable Diffusion, DiT, DDPM
Train UNets and diffusion transformers with 55% less VRAM.
CLIP, LLaVA, BLIP-2, Flamingo
Vision+language training in one pass without OOM.
Mixtral, Switch, DeepSeek-MoE
Mixture-of-Experts with 4x fewer optimizer bytes per expert.
Any PyTorch model
Bring your own architecture. Auto-tune with --auto-tune flag.
Real training runs. Real VRAM savings. No asterisks.
| Domain | Model | Baseline VRAM | CompFP4 VRAM | Savings |
|---|---|---|---|---|
| Language | LLaMA-7B | 14.2 GB | 6.4 GB | 55% |
| Vision | ViT-L/14 | 8.1 GB | 3.8 GB | 53% |
| Audio | Whisper-Large | 6.4 GB | 3.0 GB | 53% |
| Multimodal | LLaVA-1.5 | 16.8 GB | 7.9 GB | 53% |
| MoE | Mixtral-8x7B | 94 GB | 42 GB | 55% |
CompFP4 runs wherever PyTorch runs. Local workstation, cloud VM, or managed cluster — install and go.
SSH in, pip install, train.
SM 7.0+ (Volta and newer).
Drop-in optimizer replacement.
Start free. Scale when you're ready.