Shadow Stack — Local vs API Cost Analysis
Hardware TCO#
Sample Build: RTX 3090 24GB (used market, ~$700–800)#
| Component | Cost |
|---|---|
| RTX 3090 (used) | $750 |
| Host server (Proxmox, used workstation) | $400 |
| 32 GB RAM | $80 |
| 1 TB NVMe | $80 |
| Power (600W system × 8h/day × $0.12/kWh) | ~$21/month |
| One-time hardware | ~$1,310 |
| Monthly power | ~$21 |
Break-even calculation — you need to compare against API costs you’d otherwise pay.
RTX 4060 Ti 16GB Build (~Budget)#
| Component | Cost |
|---|---|
| RTX 4060 Ti 16GB | $450 |
| Mini PC / used workstation | $200 |
| 16 GB RAM | $40 |
| 512 GB NVMe | $50 |
| Power (250W × 8h × $0.12) | ~$9/month |
| One-time hardware | ~$740 |
| Monthly power | ~$9 |
API Cost Comparison (2026 pricing, approximate)#
Input/Output Costs (per 1M tokens)#
| Model | Input | Output | Notes |
|---|---|---|---|
| Claude 3.5 Haiku | $0.80 | $4.00 | Fast, cost-effective |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Strong all-rounder |
| Claude Opus | $15.00 | $75.00 | Frontier |
| GPT-4o mini | $0.15 | $0.60 | Cheap but limited |
| GPT-4o | $2.50 | $10.00 | Strong general |
| Gemini 1.5 Flash | $0.075 | $0.30 | Very cheap, long context |
| Local (Ollama/vLLM) | $0.00 | $0.00 | Power cost only |
Break-Even Analysis#
Assume you use Claude 3.5 Haiku for lightweight tasks ($0.80/$4.00 per 1M tokens):
At 50 requests/day, avg 500 input + 200 output tokens each:
- Daily tokens: 50 × 700 = 35,000 tokens
- Daily cost: ~35K/1M × avg $2.40 = ~$0.084/day
- Monthly API cost: ~$2.50/month
- Break-even for $740 hardware: ~25 years ← not worth it at this usage
At 2,000 requests/day (automation, batch processing):
- Daily tokens: 2,000 × 700 = 1.4M tokens
- Daily cost: ~$3.36/day
- Monthly API cost: ~$100/month
- Break-even for $740 hardware: ~7 months ← compelling
The Real Cost Drivers#
Local inference makes economic sense when:
- Volume — >500 API calls/day with average-length prompts
- Data privacy — you can’t send data to external APIs (legal, compliance, IP concerns)
- Latency — you need <100ms first-token response (local on good GPU = 50–80ms)
- Offline — air-gapped environments, no internet dependency
- Experimentation — unlimited free iterations on model behavior
Local inference is a poor fit when:
- Usage is low and sporadic
- You need frontier-class reasoning (local 7B ≠ Claude Sonnet)
- You have no ops capacity to maintain GPU infrastructure
- Inference hardware is your bottleneck (power, heat, noise)
Decision Framework: Local vs Cloud API#
START: What kind of task?
│
├─ Frontier reasoning required?
│ (complex code, nuanced writing, multi-step agents)
│ └─ YES → Cloud API (Claude/GPT-4o). Period.
│
├─ Data sovereignty / privacy required?
│ └─ YES → Local (mandatory, regardless of cost)
│
├─ What's your volume?
│ ├─ <100 calls/day → Cloud API (break-even takes years)
│ ├─ 100–500 calls/day → Hybrid (cloud for complex, local for simple)
│ └─ >500 calls/day → Local likely cost-justified
│
└─ Do you have the hardware?
├─ YES, already built → Use it
└─ NO → Calculate break-even before buying
Hybrid Routing Pattern#
The most practical homelab setup isn’t “all local” — it’s a router that classifies tasks:
# Pseudo-code for task router
def route_request(prompt, task_type):
if task_type in ["classification", "extraction", "simple_qa"]:
return local_model("llama3.1:8b", prompt) # free
elif task_type in ["code_generation", "document_analysis"]:
return local_model("mistral:7b", prompt) # free, good quality
else:
return cloud_api("claude-haiku-4-5", prompt) # paid, frontier
This pattern lets you cut 60–80% of API spend while maintaining quality on tasks that need it.
Quick Reference: When to Use What#
| Scenario | Use |
|---|---|
| RAG over private docs | Local (Ollama + nomic-embed-text) |
| Code autocomplete | Local (Mistral 7B / Codestral) |
| Summarizing emails | Local (Phi-3.5 3.8B, fast) |
| Complex research report | Cloud API |
| Agentic workflows with 10+ steps | Cloud API |
| Data pipeline ETL enrichment | Local (high volume, structured) |
| Legal/sensitive document analysis | Local (mandatory) |
| Customer-facing chatbot (quality matters) | Cloud API |
Related Reading#
For implementation details on setting up local inference, see the companion guide: Shadow Stack Complete Guide
Read other posts