Rent an A100 40GB as the canonical 7B pretraining and 13B–34B fine-tuning workhorse. 40 GB HBM2e at 1,555 GB/s, NVLink 3rd-gen, MIG up to 7 instances per card. 8-GPU NVSwitch pods sustain ~27K tok/s aggregate for Mistral-7B-class pretraining runs. The exact training silicon FSDP, DeepSpeed ZeRO-3, and Megatron-LM were originally tuned against. Billed per-minute, paid in BTC, USDT/USDC or CLORE.
A100 40GBs are what you rent when minutes matter. FP8 cuts memory and doubles throughput on Transformer workloads; HBM2e keeps the SMs fed; NVLink lets eight cards train as one.
FSDP, DeepSpeed ZeRO-3, and Megatron-LM were tuned against this exact silicon. 40 GB HBM2e + 1.55 TB/s + NVLink — the spec on which the public 7B-class pretraining recipes were originally validated, still the cheapest HBM + NVLink path with full MIG support in 2026.
TensorRT-LLM, vLLM, SGLang — all tuned for Ampere. Serve 70B FP8 from a single card with margin to spare for KV-cache.
40 GB HBM2e + FlashAttention-3 means 128k-token contexts run without offload. Ideal for long-doc RAG and agent loops with tool use.
Ampere is the architecture the 70B and 405B foundation models were trained on. Specs from Nvidia's SXM5 datasheet; pricing reflects the lowest live on-demand floor.
// prices are spot-market lows · refreshed every 60 s
Every server is priced by its host. These are the live floors across the marketplace — you'll see hundreds of variants once you're in.
No sales call. No quota request. No three-week procurement. The first four commands are all you need.
Filter the marketplace by A100 40GB 80GB, country, GPU count, reliability score, network speed.
Choose a Docker image — PyTorch, vLLM, ComfyUI, Blender — or paste your own.
You get a public endpoint, an SSH key, and Jupyter on port 8888 in under 90 s.
Per-minute billing rounds to the second. Stop the instance and the meter stops with it.
When you're pretraining 7B from scratch or fine-tuning 13B with offload — 40 GB is plenty. Step up to 80 GB for 34B+ pretraining, 70B fine-tuning, or LongRoPE / 128k-context work that exhausts the smaller model's KV cache.
Single-card, no - 70B pretraining needs an 8-GPU node minimum. CLORE.AI lists 8x H100, 8x H200, and 8x B200 pods with NVLink fabric for exactly this. A100 80GB pods run 70B FSDP training but at lower throughput than Hopper-class. For multi-week training, contact host operators for reserved-instance terms - listed in the marketplace under 'Reserved'.
A100 80GB has no FP8 - peak is BF16/TF32. H100 introduces FP8 with TransformerEngine and roughly 4x the BF16 training throughput at 2x the rental price - so ~2x perf-per-dollar on FP8-eligible workloads. H200 matches H100 compute but adds 141 GB HBM3e. B200 doubles H100 FP8 again with 192 GB HBM3e. Pick by VRAM and bandwidth ceiling, not just sticker FLOPS.
8-GPU H100 SXM, H200 SXM, and B200 nodes ship with NVSwitch fabric - 900 GB/s peer bandwidth on H100/H200, 1.8 TB/s 5th-gen NVLink on B200. PCIe variants (H100 PCIe, A100 PCIe) have NVLink Bridge in pairs only. Multi-node fabric (NVLink-Switch across racks) is available on B200 hyperscale pods - filter by 'NVSwitch' in the marketplace.
Yes. Multi-GPU listings expose all cards in a single rental as a coherent node with NVSwitch (where present), shared NVMe scratch, and InfiniBand or 100 GbE fabric for multi-node training. The standard PyTorch torchrun, DeepSpeed, and Megatron-LM launchers run unmodified. Filter the marketplace by GPU count to find 8x A100, 8x H100, 8x H200 nodes.
V100 (HBM2, 900 GB/s) -> A100 40GB (HBM2e, 1,555 GB/s) -> A100 80GB (HBM2e, 1,935 GB/s) -> H100 (HBM3, 3,350 GB/s) -> H200 (HBM3e, 4,800 GB/s) -> B200 (HBM3e, 8,000 GB/s). Each generation roughly doubles bandwidth or VRAM; KV-cache-bound serving and bandwidth-bound training scale almost linearly with this number.
40 GB HBM2e + 1.55 TB/s + NVLink — the canonical 7B pretraining and 13B–34B fine-tuning workhorse.
8× A100 40GB node hits ~27K tok/s aggregate — a Mistral-7B-class pretraining run completes in weeks of spot.
Read the guide →Standard 13B SFT pipeline — 40 GB fits FSDP-sharded weights + activations at 4K context.
Read the guide →Hard isolation for multi-tenant ML platforms — each MIG slice gets dedicated SMs and HBM.
Read the guide →Side-by-side specs across the datacenter tier. Click any row to see that GPU.
Step-by-step guides verified on CLORE.AI hardware. Pick a workload, copy the docker image, ship in minutes.
Per-minute payouts in BTC, USDT, USDC or CLORE. No listing fee, no contracts, withdraw any time.
Hosts around the world are accepting workloads right now. Sign up, top up your wallet, and the next hour is yours.