Do You Really Need H100s for RAG? Let’s Do the Math.
Are you falling for the marketing that you need liquid cooled racks for you AI projects?
Everyone says you need racks of GPUs to run enterprise AI.
That may be true—if you’re training foundation models. But what about the use case most enterprises are actually working on?
Let’s break down a realistic architecture for RAG (Retrieval-Augmented Generation) and show why the bottleneck isn’t GPU capacity. It’s hybrid sprawl.
🎯 The Use Case: RAG for Customer Support
Imagine this:
You’re building an internal AI assistant to help your support team answer product questions. You’ve got:
A dataset of 15,000 product documents
Customers asking questions via chatbot or internal portal
A need to return clear, reliable answers in real time
🧠 The Workflow (Simplified)
User asks a question
RAG system retrieves the top 5–10 relevant documents (~1,200–1,500 tokens)
Prompt is constructed: question + context
LLM generates a concise answer (~300 tokens)
Response is returned to user or integrated into CRM
🧮 Token Breakdown
Let’s do some light math.
User query: 20 tokens
Context (retrieved docs): ~5 chunks × 250 tokens = 1,250
Output: 300 tokens
Total per request: ~1,600–1,700 tokens
Now let’s scale that.
Say you’re handling 5 requests per second (300 per minute):
5 × 1,700 = 8,500 tokens per second
300 × 1,700 = ~510,000 tokens per minute
That’s ~30 million tokens per hour, or ~720 million/day if sustained.
🧱 Infrastructure Needs
So, what kind of hardware do you need to handle this?
🔹 CPU-only
Quantized models like Phi-2, Gemma 2B, DeepSeek, etc.
Modern 64-core CPUs (e.g., AMD EPYC, Intel Xeon)
Inference throughput: ~100–150 tokens/sec per server
You’ll need 8–10 CPU boxes to handle 5 RPS comfortably
🔹 Modest GPUs (A10s, L4s)
Throughput: 500–1,000 tokens/sec per GPU (with 7B–13B models)
2–3 GPUs total could handle this load with headroom
🔻 H100s?
You’d be provisioning for thousands of RPS.
This workload doesn’t justify it—you’re paying for power you don’t need.
💰 TCO and Complexity
The total cost of 10 CPU servers or 2 mid-range GPUs is a fraction of a DGX H100 system.
Not to mention:
No vendor lock-in to CUDA
Easier integration with traditional apps
Lower power and cooling requirements
Far less operational risk
🧩 What
Is
Hard Then?
It’s not the LLM. It’s the workflow:
Documents might live in SharePoint or Box
The chatbot UI is in Salesforce or ServiceNow
Auth flows through Okta or Entra ID
Output logs back into Zendesk or Jira
You’re stitching together cloud, SaaS, and on-prem assets. That’s the hard part.
Not the tokens. Not the GPUs.
✏️ TL;DR
A Fortune 100 company is already supporting 20,000 developers using AI-assisted coding tools—running on eight servers with no H100s.
That’s what real-world inference looks like.
So before you burn your budget on the AI Factory, do the math.
You probably need better orchestration—not bigger silicon.
👀 Want the broader context?
Check out the full blog:
📖 Forget the AI Factory. The Real Battle Is Hybrid AI Infrastructure
🛠️ Need Help Thinking Through It?
Keith on Call is my async advisory service. You send me the architecture, doc, or problem—
I send you a plainspoken, no-BS read on what’s real, what’s noise, and where you need to focus.
Whether it’s inferencing scale, hybrid architecture, or budget sanity—I’ve got your back.
🎯 Book now → $3,500 flat via Gumroad