Rent an NVIDIA A10 as the AWS/GCP-equivalent inference card. 24 GB Ampere at 150 W with MIG — vLLM Llama-3 8B FP16 at ~1,050 tok/s, four-instance MIG partitioning for SaaS multi-tenancy, DeepStream YOLO + NVENC pipelines on a single card. Same Triton/vLLM configs that run on hyperscaler reference setups, deployed identically. Billed per-minute, paid in BTC, USDT/USDC or CLORE.
A10 is what happens when Nvidia takes 4090-class silicon, doubles the VRAM, adds ECC, locks the clocks for 24/7 operation, and ships it in a passive datacenter form factor.
Same silicon as the AWS g5 and GCP G2 inference instance types — meaning your existing Triton, vLLM, and TensorRT configs run unchanged. Drop-in target for hyperscaler-trained reference deployments, with MIG hard isolation for multi-tenant SaaS APIs at proper p99 budgets.
SDXL, Flux, Stable Video Diffusion, HunyuanVideo. RT cores 3rd-gen + OptiX make it the fastest non-HBM card for diffusion.
13B–34B QLoRA on a single card. 24 GB ECC means stable long runs without the corruption risk of consumer GDDR6X.
A10 delivers Hopper-class FP8 inference at a fraction of the H100 hourly rate. The right card when latency matters but you're not training from scratch.
// prices are spot-market lows · refreshed every 60 s
Every server is priced by its host. These are the live floors across the marketplace — you'll see hundreds of variants once you're in.
No sales call. No quota request. No three-week procurement. The first four commands are all you need.
Filter the marketplace by A10 48GB, country, GPU count, reliability score, network speed.
Choose a Docker image — PyTorch, vLLM, ComfyUI, Blender — or paste your own.
You get a public endpoint, an SSH key, and Jupyter on port 8888 in under 90 s.
Per-minute billing rounds to the second. Stop the instance and the meter stops with it.
A10 has more raw FP16 throughput (124 TFLOPS vs 121) and the same 24 GB VRAM. L4 is much more power-efficient (72 W vs 150 W) and Ada-class with FP8. A10 is the AWS/GCP standard — better choice if you need MIG or are matching a hyperscaler reference setup.
An L40S serving Llama-3 70B FP8 with vLLM and continuous batching pushes roughly 3,000-4,500 output tokens/second at batch saturation. At a $0.78/hr spot rate, that lands near $0.05-$0.07 per million output tokens before the 2.5% spot fee. PoH staking knocks the fee in half; reserved spot floors land you closer to $0.04/M. Numbers vary with prompt length and batch shape - benchmark on your traffic.
Yes. The inference tier (T4, L4, L40S, A10) is exactly what vLLM's PagedAttention and continuous batching are tuned for. L40S handles 70B FP8 single-card with KV-cache headroom; A10 and L4 serve 7B-13B at high throughput; T4 covers Whisper, embeddings, and 7B INT8. Pull the official vLLM Docker image, point it at your model, expose port 8000.
L40S has Ada FP8 tensor cores - the same architecture as H100 for inference math, at a fraction of the rental price. L4 also supports FP8. T4 and A10 are pre-FP8 but have INT8 (T4 added INT8 in Turing, A10 in Ampere) and excel at quantized 7B-13B serving. Pick L40S when FP8 throughput matters; pick A10 or T4 when $/request matters more.
On A10 or L4 with vLLM and batch-1, time-to-first-token for a 7B FP16 model lands around 80-150 ms; p99 inter-token latency is 25-40 ms. L40S with FP8 cuts both roughly in half. T4 doubles them. Real numbers depend on prompt length and concurrent batch size - low-batch interactive serving is fastest, high-batch saturation maximizes throughput.
MIG (Multi-Instance GPU) is supported on A100, A30, and H100/H200 - not on L4, L40S, T4, or A10. For consumer-tier multi-tenancy on the inference tier, run multiple model replicas inside a single Docker container or use container-level resource limits. If you need hardware-isolated MIG slices, rent A100 40GB and partition into up to 7 instances.
24 GB Ampere at 150 W with MIG — the AWS/GCP-equivalent inference card for hyperscaler-reference deployments.
Reference card on AWS g5 / GCP G2 — deploys with the same Triton/vLLM configs as production hyperscaler stacks.
Read the guide →Hard memory isolation for SaaS multi-tenant inference — each tenant gets a dedicated GPU slice.
Read the guide →Edge ML video analytics pipeline — inference + encode on a single 150 W card.
Read the guide →Side-by-side specs across the inference tier. Click any row to see that GPU.
Step-by-step guides verified on CLORE.AI hardware. Pick a workload, copy the docker image, ship in minutes.
Per-minute payouts in BTC, USDT, USDC or CLORE. No listing fee, no contracts, withdraw any time.
Hosts around the world are accepting workloads right now. Sign up, top up your wallet, and the next hour is yours.