Shadow Stack — Local vs API Cost Analysis

Hardware TCO#

Sample Build: RTX 3090 24GB (used market, ~$700–800)#

Component	Cost
RTX 3090 (used)	$750
Host server (Proxmox, used workstation)	$400
32 GB RAM	$80
1 TB NVMe	$80
Power (600W system × 8h/day × $0.12/kWh)	~$21/month
One-time hardware	~$1,310
Monthly power	~$21

Break-even calculation — you need to compare against API costs you’d otherwise pay.

RTX 4060 Ti 16GB Build (~Budget)#

Component	Cost
RTX 4060 Ti 16GB	$450
Mini PC / used workstation	$200
16 GB RAM	$40
512 GB NVMe	$50
Power (250W × 8h × $0.12)	~$9/month
One-time hardware	~$740
Monthly power	~$9

API Cost Comparison (2026 pricing, approximate)#

Input/Output Costs (per 1M tokens)#

Model	Input	Output	Notes
Claude 3.5 Haiku	$0.80	$4.00	Fast, cost-effective
Claude 3.5 Sonnet	$3.00	$15.00	Strong all-rounder
Claude Opus	$15.00	$75.00	Frontier
GPT-4o mini	$0.15	$0.60	Cheap but limited
GPT-4o	$2.50	$10.00	Strong general
Gemini 1.5 Flash	$0.075	$0.30	Very cheap, long context
Local (Ollama/vLLM)	$0.00	$0.00	Power cost only

Break-Even Analysis#

Assume you use Claude 3.5 Haiku for lightweight tasks ($0.80/$4.00 per 1M tokens):

At 50 requests/day, avg 500 input + 200 output tokens each:

Daily tokens: 50 × 700 = 35,000 tokens
Daily cost: ~35K/1M × avg $2.40 = ~$0.084/day
Monthly API cost: ~$2.50/month
Break-even for $740 hardware: ~25 years ← not worth it at this usage

At 2,000 requests/day (automation, batch processing):

Daily tokens: 2,000 × 700 = 1.4M tokens
Daily cost: ~$3.36/day
Monthly API cost: ~$100/month
Break-even for $740 hardware: ~7 months ← compelling

The Real Cost Drivers#

Local inference makes economic sense when:

Volume — >500 API calls/day with average-length prompts
Data privacy — you can’t send data to external APIs (legal, compliance, IP concerns)
Latency — you need <100ms first-token response (local on good GPU = 50–80ms)
Offline — air-gapped environments, no internet dependency
Experimentation — unlimited free iterations on model behavior

Local inference is a poor fit when:

Usage is low and sporadic
You need frontier-class reasoning (local 7B ≠ Claude Sonnet)
You have no ops capacity to maintain GPU infrastructure
Inference hardware is your bottleneck (power, heat, noise)

Decision Framework: Local vs Cloud API#

START: What kind of task?
│
├─ Frontier reasoning required?
│   (complex code, nuanced writing, multi-step agents)
│   └─ YES → Cloud API (Claude/GPT-4o). Period.
│
├─ Data sovereignty / privacy required?
│   └─ YES → Local (mandatory, regardless of cost)
│
├─ What's your volume?
│   ├─ <100 calls/day → Cloud API (break-even takes years)
│   ├─ 100–500 calls/day → Hybrid (cloud for complex, local for simple)
│   └─ >500 calls/day → Local likely cost-justified
│
└─ Do you have the hardware?
    ├─ YES, already built → Use it
    └─ NO → Calculate break-even before buying

Hybrid Routing Pattern#

The most practical homelab setup isn’t “all local” — it’s a router that classifies tasks:

# Pseudo-code for task router
def route_request(prompt, task_type):
    if task_type in ["classification", "extraction", "simple_qa"]:
        return local_model("llama3.1:8b", prompt)   # free
    elif task_type in ["code_generation", "document_analysis"]:
        return local_model("mistral:7b", prompt)      # free, good quality
    else:
        return cloud_api("claude-haiku-4-5", prompt) # paid, frontier

This pattern lets you cut 60–80% of API spend while maintaining quality on tasks that need it.

Quick Reference: When to Use What#

Scenario	Use
RAG over private docs	Local (Ollama + nomic-embed-text)
Code autocomplete	Local (Mistral 7B / Codestral)
Summarizing emails	Local (Phi-3.5 3.8B, fast)
Complex research report	Cloud API
Agentic workflows with 10+ steps	Cloud API
Data pipeline ETL enrichment	Local (high volume, structured)
Legal/sensitive document analysis	Local (mandatory)
Customer-facing chatbot (quality matters)	Cloud API

For implementation details on setting up local inference, see the companion guide: Shadow Stack Complete Guide