Achieve Higher Sustained LLM Throughput with SupremeRAID™ KV Cache for Rack

Read the White Paper to Learn How a Shareable External KV Cache Tier Improves Total Throughput by 53%

When KV Cache Runs Out of Room

Long-context LLM inference has a storage problem that rarely shows up in model benchmarks. As context windows grow and multi-turn workloads become standard, the KV cache working set expands — and when it outgrows GPU memory, the serving stack has nowhere to go. Most GPU servers carry local SSDs, but tying KV cache capacity to isolated SSD pools inside each compute node creates a scaling trap: add cache headroom, and you're adding compute nodes whether you need them or not.

The SupremeRAID™ KV Cache for Rack is built to solve exactly that problem. Instead of local SSDs per GPU server, a dedicated SupremeRAID™ storage appliance connects over high-speed Ethernet and exports a shared NFS cache path to the inference fleet. GPU servers stay focused on model serving. Cache capacity scales independently — on its own hardware, on its own timeline.

What the Numbers Show

Graid Technology validated this architecture with an EvalScope application benchmark — a high-concurrency, multi-turn workload running Qwen3-235B on four NVIDIA H200 GPUs with vLLM and LMCache. The test compared no KV cache offload against a cache path backed by the SupremeRAID™ KV Cache for Rack on a Supermicro SSG-221E-DN2R24R storage server with ten KIOXIA CM7-V NVMe SSDs in RAID 5, across prefix lengths from 512 to 8,192 tokens.

Total Throughput improved at every tested prefix length — and the gains grew with context:

+32.7% at 512 tokens
+47.0% at 2,048 tokens
+53.4% at 8,192 tokens

All 512 requests completed with zero failures, under 128-way parallelism and 3-to-5-turn conversation depth.

The pattern is significant: the longer the reusable prefix, the more the external cache tier contributes. That's precisely the direction inference workloads are heading — longer contexts, higher concurrency, more multi-turn depth.

Why Architecture Matters as Much as Performance

Beyond throughput, the design gives infrastructure teams deployment flexibility. GPU server selection can be driven by accelerator density and inference performance rather than local SSD footprint. Cache capacity expands on the storage tier as workload demands shift. And because the cache namespace is shareable, multiple GPU servers can mount a common cache path when the serving stack configuration is aligned.

Graid Technology has qualified SupremeRAID™ KV Cache for Rack on 20 storage server platforms across AIC, Dell, Giga Computing, Lenovo, and Supermicro — giving architects a broad set of validated configurations to match their deployment requirements.

Read the full white paper — including complete benchmark methodology, configuration details, and results — here.

‍

Learn More

News & Resources

Graid Technology Achieves Industry-First 100 Million IOPS on a Protected RAID 5 Volume for GPU-Initiated I/O

News/PR

July 15, 2026

Graid Technology Achieves Industry-First 100 Million IOPS on a Protected RAID 5 Volume for GPU-Initiated I/O

Graid Technology has achieved an industry-first benchmark: 100 million IOPS on a protected RAID 5 volume for GPU-initiated I/O, built from 32 KIOXIA XD8 NVMe SSDs. The result shows that resilient NVMe storage can now operate at the scale that current and future AI deployments require, which is especially significant for environments that require both extreme performance and enterprise-class data protection. In benchmark testing with a WholeGraph training workload, [...]

Ready to stop paying? Talk to a Graid Technology rep to lock in the Summer 6 Pack discount before September 30, 2026. Crack Open the Promo →

Blog

July 14, 2026

Blog: Stop Paying the Storage Tax

Every NVMe array pays a hidden storage tax — either 12–18% of line rate lost to a hardware RAID controller, or 18–28% of host CPU consumed by software RAID. With enterprise NVMe pricing up ~257% since Q2 2025, that tax now hits a much bigger check. SupremeRAID™ eliminates both halves by running RAID I/O on an NVIDIA GPU: full line-rate throughput, CPU cores returned to your applications, enterprise-grade protection on one card.

Graid joins AI Powered Data Centre Conference in Mumbai

Events

July 10, 2026

Join US at AI Powered Data Centre Conference

Graid Technology is excited to join the AI Powered Data Centre Conference 2026 as a Silver Partner in Mumbai, India. As AI workloads continue to drive new demands on data centre infrastructure, we will showcase how SupremeRAID™ GPU-accelerated NVMe RAID helps reduce storage bottlenecks, improve data protection, and enable higher performance for AI, HPC, and next-generation data centres.

View all

Eliminate Bottlenecks. Accelerate Results.

Discover how SupremeRAID™ delivers unmatched performance, resilience, and efficiency for AI, HPC, and enterprise workloads. Contact our team to get started.

Subscribe to our newsletter

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.