Log in Rent NVIDIA L4
L4 48GB · live marketplace — from $0.32 / hr

Rent an NVIDIA L4
by the minute.
Not the month.

Rent an NVIDIA L4 for modern Ada-class inference with FP8 and AV1 NVENC. 24 GB Ada at 72 W passive — vLLM serving Llama-3 8B INT8 at ~950 tok/s with 16-request concurrency, three simultaneous 4K60 AV1 transcode streams, Florence-2 captioning at 28 images per second. Continuous-batching headroom prior-gen inference cards never had. Billed per-minute, paid in BTC, USDT/USDC or CLORE. The streaming and embedding card of choice for 2026 datacenter ops.

Per-minute billing SSH + Docker + Jupyter Spot & on-demand 15 regions
INFERENCE · LIVE POD #82144 · L4 ×1 · us-east-2 UPTIME 14h 22m
THROUGHPUT
2,847 tok/s
▲ 4.2% vs 5m
LATENCY p50
38 ms
▬ steady
LATENCY p99
112 ms
▲ 6 ms
REQUESTS / SEC 274 RPS
GPU UTIL 87%
VRAM 35.2 / 24 GB
QUEUE 3 reqs
MODEL
Llama-3.1-70B · INT8
BATCH
32 · cont.
RATE
$0.42/hr
COST/1M tok
$0.10
$0.32/hr
Starting on-demand price
24GB
GDDR6 VRAM per card
15
Regions with L4 hosts
<90s
Cold-start to ready
workloads

Datacenter Ada,
built to serve.

L4 is what happens when Nvidia takes 4090-class silicon, doubles the VRAM, adds ECC, locks the clocks for 24/7 operation, and ships it in a passive datacenter form factor.

T4 replacement with FP8 + AV1 NVENC

24 GB Ada at 72 W passive — same form factor economics as a T4 but with FP8 tensor paths, AV1 NVENC, and enough VRAM to run vLLM with proper concurrency. The card datacenter operators standardized on for production inference and streaming transcode in 2026.

Llama-3 70B INT4 210 tok/s/user

Generative media

SDXL, Flux, Stable Video Diffusion, HunyuanVideo. RT cores 3rd-gen + OptiX make it the fastest non-HBM card for diffusion.

SDXL 1024² batch-8 ~1.6× vs 4090

Mid-scale fine-tuning

13B–34B QLoRA on a single card. 24 GB ECC means stable long runs without the corruption risk of consumer GDDR6X.

Llama-3 8B SFT ~17k tok/s
why L4

FP8 throughput
without HBM pricing.

L4 delivers Hopper-class FP8 inference at a fraction of the H100 hourly rate. The right card when latency matters but you're not training from scratch.

L4 48GB A100 80GB RTX 4090 H100 80GB
Architecture Ada Lovelace Ampere Ada Lovelace Hopper
CUDA cores 7,424 6,912 16,384 14,592
VRAM 24 GB GDDR6 80 GB HBM2e 24 GB GDDR6X 80 GB HBM3
Memory bandwidth 300 GB/s 1,935 GB/s 1,008 GB/s 3,350 GB/s
FP16 / BF16 (dense) 362 TFLOPS 312 TFLOPS ~165 TFLOPS 756 TFLOPS
From / hr (on-demand) $0.32 $0.92 $0.31 $1.89

// prices are spot-market lows · refreshed every 60 s

pricing

Two ways to rent.
Pay only for the minutes you use.

Every server is priced by its host. These are the live floors across the marketplace — you'll see hundreds of variants once you're in.

Spot

$0.32 / hr
≈ 0.000002 BTC · 147 CLORE
  • Lowest possible rate
  • Per-minute billing
  • Can be interrupted by on-demand renter
  • Best for batch training, rendering
Browse spot L4s
MOST RENTED

On-demand

$0.42 / hr
≈ 0.0000044 BTC · 320 CLORE
  • Guaranteed availability
  • No preemption, ever
  • Per-minute billing
  • Best for inference, dev work, demos
Rent on-demand
Pay with
Bitcoin on-chain
CLORE native token
USDT / USDC ERC-20 · BEP-20
workflow

Four steps to a running L4.

No sales call. No quota request. No three-week procurement. The first four commands are all you need.

01 / FILTER

Pick your card

Filter the marketplace by L4 48GB, country, GPU count, reliability score, network speed.

02 / RENT

Click rent

Choose a Docker image — PyTorch, vLLM, ComfyUI, Blender — or paste your own.

$ clore rent --gpu "L4 48GB"
03 / CONNECT

SSH or Jupyter

You get a public endpoint, an SSH key, and Jupyter on port 8888 in under 90 s.

04 / STOP

Stop anytime

Per-minute billing rounds to the second. Stop the instance and the meter stops with it.

faq

Questions hosts and renters ask.

Why pick L4 over a 4090 for inference?

Power-efficient (72 W vs 450 W), passively cooled, designed for 24/7 multi-tenant inference. Datacenter-validated for serving stacks like vLLM and Triton. The 4090 is faster per-card; the L4 is cheaper per-request at scale.

What's the rough cost-per-million-tokens for Llama-3 70B FP8 on L40S?

An L40S serving Llama-3 70B FP8 with vLLM and continuous batching pushes roughly 3,000-4,500 output tokens/second at batch saturation. At a $0.78/hr spot rate, that lands near $0.05-$0.07 per million output tokens before the 2.5% spot fee. PoH staking knocks the fee in half; reserved spot floors land you closer to $0.04/M. Numbers vary with prompt length and batch shape - benchmark on your traffic.

Can I run vLLM with continuous batching on these GPUs?

Yes. The inference tier (T4, L4, L40S, A10) is exactly what vLLM's PagedAttention and continuous batching are tuned for. L40S handles 70B FP8 single-card with KV-cache headroom; A10 and L4 serve 7B-13B at high throughput; T4 covers Whisper, embeddings, and 7B INT8. Pull the official vLLM Docker image, point it at your model, expose port 8000.

Does the GPU support FP8 and INT8 inference?

L40S has Ada FP8 tensor cores - the same architecture as H100 for inference math, at a fraction of the rental price. L4 also supports FP8. T4 and A10 are pre-FP8 but have INT8 (T4 added INT8 in Turing, A10 in Ampere) and excel at quantized 7B-13B serving. Pick L40S when FP8 throughput matters; pick A10 or T4 when $/request matters more.

What's typical p99 latency for 7B-class models on this tier?

On A10 or L4 with vLLM and batch-1, time-to-first-token for a 7B FP16 model lands around 80-150 ms; p99 inter-token latency is 25-40 ms. L40S with FP8 cuts both roughly in half. T4 doubles them. Real numbers depend on prompt length and concurrent batch size - low-batch interactive serving is fastest, high-batch saturation maximizes throughput.

Can I MIG-partition this card for multi-tenant serving?

MIG (Multi-Instance GPU) is supported on A100, A30, and H100/H200 - not on L4, L40S, T4, or A10. For consumer-tier multi-tenancy on the inference tier, run multiple model replicas inside a single Docker container or use container-level resource limits. If you need hardware-isolated MIG slices, rent A100 40GB and partition into up to 7 instances.

workload spotlight

Real numbers on the NVIDIA L4.

24 GB Ada at 72 W passive — the modern replacement for T4 with FP8, AV1 NVENC, and continuous-batching headroom.

vLLM serving Llama-3 8B INT8
vLLM + GPTQ 8-bit + chunked prefill
~950 tok/s aggregated, p50 45 ms

24 GB Ada is the cheapest stable card for production 8B serving with batch >=16 concurrency.

Read the guide →
AV1 NVENC video transcode
FFmpeg + AV1 NVENC
~3 simultaneous 4K60 AV1 streams

L4's AV1 encoder and 72 W envelope make it the streaming transcode card of choice in 2026.

Read the guide →
Florence-2 captioning at scale
Transformers + fp16
~28 images/s @ 768²

Vision-language captioning for stock-photo libraries — 24 GB fits Florence-2 large + 16 batch.

Read the guide →
inference comparison

Inference-tier comparison.

Side-by-side specs across the inference tier. Click any row to see that GPU.

GPU
VRAM
TDP (W)
FP8
MIG
Perf/W (FP16 TFLOPS/W)
Llama-3 8B tok/$ (vLLM INT8)
Spot $/hr
Tesla T4
16 GB GDDR6
70
~1.16
~9,800
$0.08
NVIDIA L4 / this page
24 GB GDDR6
72
yes
~1.68
~14,000
$0.22
NVIDIA L40S tier focus
48 GB GDDR6
350
yes
~1.05
~20,000
$0.65
A10
24 GB GDDR6
150
~0.83
~12,500
$0.16
workload guides

Run these on your rented NVIDIA L4.

Step-by-step guides verified on CLORE.AI hardware. Pick a workload, copy the docker image, ship in minutes.

Language Models
vLLM serving
High-throughput LLM serving with PagedAttention.
Audio Voice
Whisper transcription
OpenAI Whisper-large for speech-to-text.
Video Processing
FFmpeg + NVENC
Hardware-accelerated video transcoding.
Vision Models
Florence-2
Microsoft's compact vision-language model.
Language Models
llama.cpp server
GGUF quantized inference with HTTP/OpenAI-compatible API.
Computer Vision
YOLOv8 detection
Real-time object detection with YOLOv8.
Advanced
CLORE API integration
Programmatic order creation via the public API.
See all guides →
other gpus

Compare with similar cards.

Tesla T4
16 GB · from $0.08/hr
Rent →
NVIDIA L40S
48 GB · from $0.65/hr
Rent →
A100 40GB
40 GB · from $0.78/hr
Rent →

Your training run
is 90 seconds away.

Hosts around the world are accepting workloads right now. Sign up, top up your wallet, and the next hour is yours.