Achieve Higher Sustained LLM Throughput with SupremeRAID KV Cache for Rack
Whitepaper
June 19, 2026

Achieve Higher Sustained LLM Throughput with SupremeRAID™ KV Cache for Rack

Read the White Paper to Learn How a Shareable External KV Cache Tier Improves Total Throughput by 53%

Download the white paper here.

When KV Cache Runs Out of Room

Long-context LLM inference has a storage problem that rarely shows up in model benchmarks. As context windows grow and multi-turn workloads become standard, the KV cache working set expands — and when it outgrows GPU memory, the serving stack has nowhere to go. Most GPU servers carry local SSDs, but tying KV cache capacity to isolated SSD pools inside each compute node creates a scaling trap: add cache headroom, and you're adding compute nodes whether you need them or not.

The SupremeRAID™ KV Cache for Rack is built to solve exactly that problem. Instead of local SSDs per GPU server, a dedicated SupremeRAID™ storage appliance connects over high-speed Ethernet and exports a shared NFS cache path to the inference fleet. GPU servers stay focused on model serving. Cache capacity scales independently — on its own hardware, on its own timeline.

What the Numbers Show

Graid Technology validated this architecture with an EvalScope application benchmark — a high-concurrency, multi-turn workload running Qwen3-235B on four NVIDIA H200 GPUs with vLLM and LMCache. The test compared no KV cache offload against a cache path backed by the SupremeRAID™ KV Cache for Rack on a Supermicro SSG-221E-DN2R24R storage server with ten KIOXIA CM7-V NVMe SSDs in RAID 5, across prefix lengths from 512 to 8,192 tokens.

Total Throughput improved at every tested prefix length — and the gains grew with context:

  • +32.7% at 512 tokens
  • +47.0% at 2,048 tokens
  • +53.4% at 8,192 tokens

All 512 requests completed with zero failures, under 128-way parallelism and 3-to-5-turn conversation depth.

The pattern is significant: the longer the reusable prefix, the more the external cache tier contributes. That's precisely the direction inference workloads are heading — longer contexts, higher concurrency, more multi-turn depth.

Why Architecture Matters as Much as Performance

Beyond throughput, the design gives infrastructure teams deployment flexibility. GPU server selection can be driven by accelerator density and inference performance rather than local SSD footprint. Cache capacity expands on the storage tier as workload demands shift. And because the cache namespace is shareable, multiple GPU servers can mount a common cache path when the serving stack configuration is aligned.

Graid Technology has qualified SupremeRAID™ KV Cache for Rack on 20 storage server platforms across AIC, Dell, Giga Computing, Lenovo, and Supermicro — giving architects a broad set of validated configurations to match their deployment requirements.

Read the full white paper — including complete benchmark methodology, configuration details, and results — here.

Learn More

News & Resources

SupremeRAID™ KV Cache for Rack delivers up to 53.4% higher LLM throughput by externalizing the KV cache tier. Read our latest white paper to get the benchmark results.
We are honored to share that Graid Technology’s SupremeRAID™ HE has received the Special Prize in the Server & Storage Category at the Interop Tokyo 2026 Best of Show Award.
Graid Technology officially announces VROC™ by Graid Technology — an actively developed, OEM-backed evolution of Intel® VROC with a 24-month roadmap, Intel® Xeon® 6 support, and no-cost upgrades for existing customers. Q3 2026 availability through OEM and channel partners.