Same GPU. Bigger model.

CompFP4 compresses optimizer states to sub-4-bit precision with error-feedback compensation. Train larger models on the GPUs you already have.

8.0→8.0bytes/param

59% VRAM reduction. 7 domains. Zero loss penalty.

59%

VRAM Saved

7/7

Domains Proven

$1,719

Saved / 1k H100-hrs

Get Started View Benchmarks

The problem with GPU memory

Problem

OOM crashes mid-training

CompFP4

59% VRAM reduction — proven across 7 domains

Problem

Can't afford H100 clusters

CompFP4

Train on A100/4090 what used to need H100

Problem

Gradient instability at low precision

CompFP4

Error-feedback compensation — no divergence

Problem

Weeks of optimizer tuning per domain

CompFP4

Domain-aware Hybrid++ auto-splits parameters

Problem

Complex multi-node setup

CompFP4

pip install compfp4 — 3 lines to integrate

Built for every domain

CLI connects to any cloud GPU — RunPod, Lambda, your cluster.

Language

GPT, LLaMA, BERT, Gemma

LLM training at half the VRAM. GPT-scale on consumer GPUs.

Vision

ViT, DeiT, ResNet

Larger batches, faster convergence on classification and detection.

Audio

Whisper, HuBERT, Wav2Vec2

ASR and audio understanding training without VRAM walls.

Speech & TTS

VITS, Bark, SpeechT5, Tortoise

Text-to-speech and voice synthesis at full model scale.

Diffusion

Stable Diffusion, DiT, DDPM

Train UNets and diffusion transformers with 55% less VRAM.

Multimodal

CLIP, LLaVA, BLIP-2, Flamingo

Vision+language training in one pass without OOM.

MoE

Mixtral, Switch, DeepSeek-MoE

Mixture-of-Experts with 4x fewer optimizer bytes per expert.

Custom

Any PyTorch model

Bring your own architecture. Auto-tune with --auto-tune flag.

Benchmark evidence

Real training runs. Real VRAM savings. No asterisks.

Domain	Model	Baseline VRAM	CompFP4 VRAM	Savings
Language	LLaMA-7B	14.2 GB	6.4 GB	55%
Vision	ViT-L/14	8.1 GB	3.8 GB	53%
Audio	Whisper-Large	6.4 GB	3.0 GB	53%
Multimodal	LLaVA-1.5	16.8 GB	7.9 GB	53%
MoE	Mixtral-8x7B	94 GB	42 GB	55%

Full benchmark details

Three lines. That's it.

$ pip install compfp4
# your_train.py
from compfp4 import CompFP4Adam
optimizer = CompFP4Adam(model.parameters(), lr=1e-4)
# That's it. Train normally.

Works with any PyTorch model. No custom layers, no model rewrites.

Your GPU, anywhere.

CompFP4 runs wherever PyTorch runs. Local workstation, cloud VM, or managed cluster — install and go.

Any Cloud Provider

RunPod
Lambda Labs
AWS / GCP / Azure
CoreWeave
Paperspace

SSH in, pip install, train.

Any GPU

NVIDIA T4, A100, H100
RTX 3090 / 4090
L40S, A10G
Multi-GPU via PyTorch DDP
Consumer to datacenter

SM 7.0+ (Volta and newer).

Any Framework

PyTorch native
HuggingFace Trainer
Lightning / Fabric
DeepSpeed ZeRO
FSDP compatible

Drop-in optimizer replacement.

Simple pricing

Start free. Scale when you're ready.

Starter

$29/mo

1 GPU
5 domains
CompFP4Adam optimizer
Community support
pip install access

Start Free Trial

Pro

$99/mo

4 GPUs
All 5 CUDA kernels
Hybrid++ auto-split
Priority support
CLI cloud deployment

Start Free Trial

Team

$299/mo

Unlimited GPUs
All kernels + priority
Team dashboard
Usage analytics
Shared license pool

Start Free Trial

Enterprise

Custom

Custom deployment
On-prem support
SLA guarantee
Dedicated engineer
Volume licensing

Contact Sales