Most teams renting a GPU server either burn money on a card that's twice the size they need, or grab a too-small one and hit OOM on the first real batch. The decision is rarely made on data — it's "we got a recommendation from someone on Discord."
This post is the actual math we walk every prospect through: how to pick the right card for inference, fine-tuning, batch generation or rendering, when to stay on cloud per-hour pricing, and when dedicated bare-metal pays back inside two months.
What kind of GPU workload do you have?
Four flavours, each bottlenecked by something different. Getting this right is more important than the GPU model.
| Workload | Bottleneck | Typical session length | Idle gaps |
|---|---|---|---|
| Inference / serving | Memory bandwidth | Continuous | 0 % |
| Fine-tuning / training | Compute (TFLOPS) | Hours to days | < 10 % |
| Batch generation (Stable Diffusion, video) | Memory + compute | Bursty | 30–70 % |
| 3D rendering / video transcode | Memory + VRAM | Continuous | 10–30 % |
If you're serving an LLM to users in real time — you're memory-bandwidth limited. The card that wins is the one with the fastest HBM/GDDR throughput, not the one with the most TFLOPS.
If you're fine-tuning on a fixed dataset — compute matters most, and you want enough VRAM to fit batch × sequence × activations without offloading.
If you're running Stable Diffusion XL in batches — you need both, but you can get away with smaller cards because each request runs in 5–20 seconds and frees the memory.
If you're rendering Blender, Cinema 4D or transcoding video with NVENC — VRAM is the hard wall, then compute.
Don't pick a card for "AI/ML" generically. Pick it for the workload tier above.
VRAM math — the only number that actually matters first
Run out of VRAM and your job dies, period. So size for VRAM first, then check compute.
For inference of LLMs, the rule of thumb:
VRAM ≈ params × bytes_per_param + kv_cache + activations
params × bytes_per_param = the model weights
kv_cache ≈ 2 × layers × hidden × seq_len × batch × bytes_per_param
activations ≈ negligible at inference, dominant at trainingIn practice you can use this quick table:
| Model | FP32 (4 bytes) | FP16 / BF16 (2 bytes) | INT8 (1 byte) | INT4 (0.5 bytes) |
|---|---|---|---|---|
| Llama 3 8B | 32 GB | 16 GB | 8 GB | 4 GB |
| Llama 3 70B | 280 GB | 140 GB | 70 GB | 35 GB |
| Llama 3.1 405B | 1620 GB | 810 GB | 405 GB | 202 GB |
| Mistral 7B | 28 GB | 14 GB | 7 GB | 3.5 GB |
| Mixtral 8×7B | 188 GB | 94 GB | 47 GB | 23 GB |
| Mixtral 8×22B | 564 GB | 282 GB | 141 GB | 70 GB |
| SD 1.5 (UNet only) | 3.4 GB | 1.7 GB | n/a | n/a |
| SDXL 1.0 (UNet only) | 14 GB | 7 GB | n/a | n/a |
| Flux.1 dev | 96 GB | 48 GB | n/a | n/a |
These numbers are weights only. Add 10–30 % for KV cache at long context, and another 10 % framework overhead. So Llama 70B at INT4 needs ~40–45 GB realistic, not 35 GB.
For fine-tuning with LoRA / QLoRA, multiply by ~1.5× over inference. For full-parameter fine-tuning, multiply by ~3–4× because you need optimiser states (Adam = 2× weights, gradients = 1× weights).
Compute (TFLOPS) — when it bottlenecks
Compute matters when you're training, batch-generating, or rendering. For real-time LLM inference, you're usually waiting on memory bandwidth long before you saturate compute.
| Card | VRAM | Memory bandwidth | FP16 TFLOPS | INT8 TOPS | Real price/hr (dedicated) |
|---|---|---|---|---|---|
| RTX 4090 (consumer) | 24 GB | 1008 GB/s | 165 | 661 | not allowed in datacenter EULA |
| RTX 6000 Ada | 48 GB | 960 GB/s | 91 | 1457 | $0.95–1.20 |
| RTX PRO 4500 Blackwell | 32 GB | 896 GB/s | 113 | 904 | $0.55–0.85 |
| RTX PRO 6000 Blackwell | 96 GB | 1792 GB/s | 252 | 2016 | $1.40–1.80 |
| A100 80 GB | 80 GB | 2039 GB/s | 312 | 624 | $1.80–2.40 (older silicon) |
| H100 80 GB | 80 GB | 3350 GB/s | 989 | 1978 | $2.50–4.50 |
(Prices are EU dedicated bare-metal, monthly contract amortised per hour. Cloud spot/on-demand is 2–4× more — see TCO section below.)
The two Blackwell PRO cards in our fleet (RTX PRO 4500 and 6000) are the sweet spot for most workloads:
- PRO 4500 (32 GB) — fits Llama 3 8B at FP16, Llama 70B at INT4, SDXL with comfortable batch sizes, all open-weight image models. $0.55–0.85/hr dedicated.
- PRO 6000 Blackwell (96 GB) — fits Llama 70B at FP16 with room for KV cache, Mixtral 8×22B at INT4, Flux.1 at FP16. $1.40–1.80/hr dedicated.
The H100 is overkill for most production inference unless you're serving a 70B+ model at high QPS. It shines for training and high-throughput batch — and you pay for it.
Worked example 1: serving Llama 3 8B to 50 concurrent users
You have an internal chatbot for support staff, expecting 50 simultaneous sessions, each averaging 1 message every 30 seconds. Average prompt 800 tokens, average response 300 tokens.
Model weights (FP16): 16 GB
KV cache per session (FP16):
2 × 32 layers × 4096 hidden × 1100 tokens × 2 bytes ≈ 0.58 GB/session
50 concurrent sessions: 50 × 0.58 ≈ 29 GB
Framework + activations: ~2 GB
Total VRAM needed: 47 GBYou can't run this on a 24 GB card. PRO 4500 (32 GB) gets you partway with INT8 quantisation. The clean answer is PRO 6000 Blackwell (96 GB) — comfortable headroom, room for batching, can scale to 100+ concurrent users without re-architecting.
Throughput on PRO 6000 for Llama 8B FP16: roughly 80–120 tokens/sec aggregated across sessions. At 300 tokens average response, that's ~25–40 completions per second — plenty for 50 users.
If you instead run Llama 3 8B quantised to INT4, weights drop to 4 GB, KV cache to 14 GB, total ~20 GB. Fits on PRO 4500 easily, even on RTX 6000 Ada. Throughput is similar or slightly higher (memory-bound workload, less data to move). For most production deployments, INT4 quantisation is the unsung performance optimisation.
Worked example 2: fine-tuning Mistral 7B on a 100K-example dataset
You're customising Mistral 7B on internal docs. Standard QLoRA setup, 4-bit base model, rank-16 LoRA adapters, batch size 4, sequence length 2048.
Base model (4-bit): 3.5 GB
LoRA adapters (rank 16, FP16): ~0.05 GB
Gradients (FP32): ~0.2 GB (LoRA only)
Optimiser states (AdamW): ~0.5 GB
Activations (batch 4, seq 2048): 12–18 GB
Total: ~17–22 GBFits comfortably on a 24 GB card, but stress test before you commit — actual VRAM peaks during backprop can be 1.3× the steady-state. PRO 4500 (32 GB) gives you safety margin and lets you bump batch size to 8 for faster epochs.
100K examples × 3 epochs × 1.5 sec/example ≈ 125 hours on PRO 4500. At dedicated rental that's ~$80–100 total. On AWS p5.xlarge (one H100), same job is 2× faster but at $4-12/hr on-demand → $250–750. Spot prices are unpredictable for H100s right now.
For fine-tuning runs lasting 1–10 days, dedicated rental wins. For one-shot 1-hour experiments, cloud spot is fine.
Worked example 3: serving SDXL at 200 images/hour
You're powering an internal image generation tool. Target: 200 images per hour, average 30-step DPM++ sampler at 1024×1024.
SDXL UNet weights (FP16): 7 GB
VAE + text encoders: 2 GB
Activations during sampling: 4–6 GB
Total active VRAM: 13–15 GBPer-image latency on PRO 4500: ~8 seconds at 30 steps. So one card sustains 200/hour with headroom. PRO 6000 cuts that to ~4 seconds → 800/hour, useful if you bursty traffic or run multiple concurrent requests.
For 200/hour steady, PRO 4500 (32 GB) is the right size. You'd run the model with one warm worker, no fancy batching needed.
If you need to fan out to thousands of images per hour for a SaaS product — multiple PRO 6000 nodes with a queue (SQS, Redis Stream, or NATS), front by FastAPI. Our customers running SDXL services at scale typically run 4× PRO 6000 in active rotation with one warm spare.
Worked example 4: Blender 4K animation render
300-frame 4K render with cycles, 256 samples, complex scene (5 GB of textures).
Scene + textures in VRAM: 8–12 GB
Cycles tile buffers: 2–3 GB
Denoiser working memory: 1 GB
Total: 11–16 GBPRO 4500 (32 GB) fits easily with 50% headroom. Render time per frame at 256 samples on PRO 4500 ≈ 90 sec. 300 frames = 7.5 hours. Same on PRO 6000 Blackwell ≈ 4 hours due to higher CUDA core count and OptiX RT cores.
For studios doing multiple projects in parallel, 2× PRO 6000 nodes beats one H100 for cost and you can render two independent shots simultaneously.
Cloud per-hour vs dedicated monthly — TCO crossover
Here's where it gets honest. Cloud GPU per-hour is great if you actually run intermittently. The crossover happens at much lower utilisation than people think.
| GPU | Cloud spot ($/hr) | Cloud on-demand ($/hr) | Dedicated monthly ($/mo) | Dedicated hourly equivalent |
|---|---|---|---|---|
| A100 80 GB | $1.20 – 2.20 | $3.20 – 5.50 | $1300 – 1700 | ~$2.00/hr |
| H100 80 GB | $2.50 – 4.50 | $4.50 – 12.00 | $1800 – 2600 | ~$3.00/hr |
| RTX PRO 4500 | not on cloud | not on cloud | $400 – 620 | ~$0.65/hr |
| RTX PRO 6000 Blackwell | not on cloud | not on cloud | $1050 – 1300 | ~$1.55/hr |
Crossover hours per month:
- A100 dedicated beats spot at ~700 hours/month (you're always-on).
- A100 dedicated beats on-demand at ~400 hours/month (~13 hours/day).
- H100 dedicated beats on-demand at ~250 hours/month (~8 hours/day).
- RTX PRO 4500 dedicated has no cloud equivalent — but at ~$0.65/hr equivalent it's the cheapest 32 GB GPU you can rent anywhere.
If you're running a model in production with > 8-hour daily utilisation, dedicated wins. Inference and fine-tuning over a few weeks both blow past the crossover.
Cloud per-hour wins when:
- You're prototyping, < 4 hours of GPU time per day.
- Your workload is genuinely spiky (one big training job per month).
- You need a specific instance type for a few hours (e.g. 8× H100 for a 70B fine-tune).
For everything else — dedicated rental is the boring correct answer.
PCIe passthrough vs vGPU/MIG slicing
Two ways to get a "GPU server":
PCIe passthrough — the whole physical card is yours. Full CUDA, no slicing, no shared memory bandwidth, deterministic performance. This is what we ship on our GPU VPS. You see the device as if it were in your own machine.
vGPU / MIG slicing — the card is sliced into smaller "instances" by NVIDIA's vGPU software or H100's MIG feature. You get fractional VRAM (e.g. 1g.10gb on H100 = 10 GB VRAM slice). Cheaper, but:
- Performance is non-deterministic (noisy neighbours)
- Not all features work (some inference frameworks check for full card)
- You don't get full memory bandwidth
For serious workloads, passthrough is the right answer. Slicing makes sense if you're hosting 50 light internal users on one big card — but that's a niche.
Linux + driver setup, the gotchas
We pre-install Ubuntu 24.04 LTS with the NVIDIA driver pinned to the latest stable for your card generation, CUDA 12.4+, cuDNN, NVIDIA Container Toolkit, and nvidia-smi working out of the box. That said, things customers commonly trip on:
- Pin your driver version in production. Auto-upgrades have broken inference setups twice this year (sm_75 deprecation drama in November).
apt-mark hold nvidia-driver-XXX. - Use `nvidia-docker2` / Container Toolkit, not generic Docker GPU mode. The latter breaks on multi-GPU setups.
- PyTorch nightly is fine for research, pin a stable for production.
2.5.1+cu124works on Blackwell. - For Triton Inference Server, expose
--shm-size=2gminimum. Default is 64 MB which kills multi-process serving. - Persistence mode on:
nvidia-smi -pm 1— otherwise the card cold-boots between calls and you eat 200ms latency.
What we sell, and how to pick
We run two Blackwell tiers across our GPU fleet:
- RTX PRO 4500 (32 GB) — from $400/mo. The default for fine-tuning open-weight models, SDXL serving, Blender render farms.
- RTX PRO 6000 Blackwell (96 GB) — from $1050/mo. For Llama 70B at FP16, Mixtral 22B at INT4, large-context inference, multi-tenant model serving.
Both ship with:
- Full PCIe passthrough — no slicing
- NVMe SSD for model storage (no slow networked weights)
- 10 Gbit/s unmetered uplink (cheap to download datasets, push checkpoints)
- 64–256 GB system RAM, EPYC CPU
- Pre-installed CUDA + Container Toolkit
- Provisioning in 1–4 hours
For workloads that need more storage (training datasets, video archives), pair the GPU node with a storage server over our private VLAN — no per-GB egress.
For large-bandwidth inference (image generation served to users, real-time video), the GPU sits behind our 10–400 G unmetered uplinks — no surprise egress bill.
If your model genuinely needs an H100 (70B+ at high QPS, multi-modal frontier models) — tell us. We have a small H100 pool on demand but it's typically more economical to fit your workload onto Blackwell PRO and run more cards in parallel.
TL;DR
- Size for VRAM first, then compute.
- LLM inference is memory-bandwidth-bound; INT4 quantisation is your friend.
- Fine-tuning with QLoRA fits on a 32 GB card for models up to ~13B.
- Cloud per-hour wins under ~8 h/day; dedicated wins past that.
- Don't pick H100 unless you actually need it — Blackwell PRO is 2–4× cheaper for 80 % of workloads.
- PCIe passthrough > vGPU slicing for serious work.
- Pin your driver version. Persistence mode on. Use Container Toolkit.
If your sizing doesn't fit cleanly into the worked examples, talk to us — we'll spec it together. No upsell unless you actually need the bigger card.