Same GPU. Bigger model.

CompFP4 compresses optimizer states to sub-4-bit precision with error-feedback compensation. Train larger models on the GPUs you already have.

8.08.0bytes/param

59% VRAM reduction. 7 domains. Zero loss penalty.

59%
VRAM Saved
7/7
Domains Proven
$1,719
Saved / 1k H100-hrs

The problem with GPU memory

Problem

OOM crashes mid-training

CompFP4

59% VRAM reduction — proven across 7 domains

Problem

Can't afford H100 clusters

CompFP4

Train on A100/4090 what used to need H100

Problem

Gradient instability at low precision

CompFP4

Error-feedback compensation — no divergence

Problem

Weeks of optimizer tuning per domain

CompFP4

Domain-aware Hybrid++ auto-splits parameters

Problem

Complex multi-node setup

CompFP4

pip install compfp4 — 3 lines to integrate

Built for every domain

CLI connects to any cloud GPU — RunPod, Lambda, your cluster.

Language

GPT, LLaMA, BERT, Gemma

LLM training at half the VRAM. GPT-scale on consumer GPUs.

Vision

ViT, DeiT, ResNet

Larger batches, faster convergence on classification and detection.

Audio

Whisper, HuBERT, Wav2Vec2

ASR and audio understanding training without VRAM walls.

Speech & TTS

VITS, Bark, SpeechT5, Tortoise

Text-to-speech and voice synthesis at full model scale.

Diffusion

Stable Diffusion, DiT, DDPM

Train UNets and diffusion transformers with 55% less VRAM.

Multimodal

CLIP, LLaVA, BLIP-2, Flamingo

Vision+language training in one pass without OOM.

MoE

Mixtral, Switch, DeepSeek-MoE

Mixture-of-Experts with 4x fewer optimizer bytes per expert.

Custom

Any PyTorch model

Bring your own architecture. Auto-tune with --auto-tune flag.

Benchmark evidence

Real training runs. Real VRAM savings. No asterisks.

DomainModelBaseline VRAMCompFP4 VRAMSavings
LanguageLLaMA-7B14.2 GB6.4 GB55%
VisionViT-L/148.1 GB3.8 GB53%
AudioWhisper-Large6.4 GB3.0 GB53%
MultimodalLLaVA-1.516.8 GB7.9 GB53%
MoEMixtral-8x7B94 GB42 GB55%

Three lines. That's it.

$ pip install compfp4
# your_train.py
from compfp4 import CompFP4Adam
optimizer = CompFP4Adam(model.parameters(), lr=1e-4)
# That's it. Train normally.
Works with any PyTorch model. No custom layers, no model rewrites.

Your GPU, anywhere.

CompFP4 runs wherever PyTorch runs. Local workstation, cloud VM, or managed cluster — install and go.

Any Cloud Provider

  • RunPod
  • Lambda Labs
  • AWS / GCP / Azure
  • CoreWeave
  • Paperspace

SSH in, pip install, train.

Any GPU

  • NVIDIA T4, A100, H100
  • RTX 3090 / 4090
  • L40S, A10G
  • Multi-GPU via PyTorch DDP
  • Consumer to datacenter

SM 7.0+ (Volta and newer).

Any Framework

  • PyTorch native
  • HuggingFace Trainer
  • Lightning / Fabric
  • DeepSpeed ZeRO
  • FSDP compatible

Drop-in optimizer replacement.

Simple pricing

Start free. Scale when you're ready.

Starter

$29/mo
  • 1 GPU
  • 5 domains
  • CompFP4Adam optimizer
  • Community support
  • pip install access
Start Free Trial
Most Popular

Pro

$99/mo
  • 4 GPUs
  • All 5 CUDA kernels
  • Hybrid++ auto-split
  • Priority support
  • CLI cloud deployment
Start Free Trial

Team

$299/mo
  • Unlimited GPUs
  • All kernels + priority
  • Team dashboard
  • Usage analytics
  • Shared license pool
Start Free Trial

Enterprise

Custom
  • Custom deployment
  • On-prem support
  • SLA guarantee
  • Dedicated engineer
  • Volume licensing
Contact Sales