Hardware TCO#

Sample Build: RTX 3090 24GB (used market, ~$700–800)#

ComponentCost
RTX 3090 (used)$750
Host server (Proxmox, used workstation)$400
32 GB RAM$80
1 TB NVMe$80
Power (600W system × 8h/day × $0.12/kWh)~$21/month
One-time hardware~$1,310
Monthly power~$21

Break-even calculation — you need to compare against API costs you’d otherwise pay.

RTX 4060 Ti 16GB Build (~Budget)#

ComponentCost
RTX 4060 Ti 16GB$450
Mini PC / used workstation$200
16 GB RAM$40
512 GB NVMe$50
Power (250W × 8h × $0.12)~$9/month
One-time hardware~$740
Monthly power~$9

API Cost Comparison (2026 pricing, approximate)#

Input/Output Costs (per 1M tokens)#

ModelInputOutputNotes
Claude 3.5 Haiku$0.80$4.00Fast, cost-effective
Claude 3.5 Sonnet$3.00$15.00Strong all-rounder
Claude Opus$15.00$75.00Frontier
GPT-4o mini$0.15$0.60Cheap but limited
GPT-4o$2.50$10.00Strong general
Gemini 1.5 Flash$0.075$0.30Very cheap, long context
Local (Ollama/vLLM)$0.00$0.00Power cost only

Break-Even Analysis#

Assume you use Claude 3.5 Haiku for lightweight tasks ($0.80/$4.00 per 1M tokens):

At 50 requests/day, avg 500 input + 200 output tokens each:

  • Daily tokens: 50 × 700 = 35,000 tokens
  • Daily cost: ~35K/1M × avg $2.40 = ~$0.084/day
  • Monthly API cost: ~$2.50/month
  • Break-even for $740 hardware: ~25 years ← not worth it at this usage

At 2,000 requests/day (automation, batch processing):

  • Daily tokens: 2,000 × 700 = 1.4M tokens
  • Daily cost: ~$3.36/day
  • Monthly API cost: ~$100/month
  • Break-even for $740 hardware: ~7 months ← compelling

The Real Cost Drivers#

Local inference makes economic sense when:

  1. Volume — >500 API calls/day with average-length prompts
  2. Data privacy — you can’t send data to external APIs (legal, compliance, IP concerns)
  3. Latency — you need <100ms first-token response (local on good GPU = 50–80ms)
  4. Offline — air-gapped environments, no internet dependency
  5. Experimentation — unlimited free iterations on model behavior

Local inference is a poor fit when:

  1. Usage is low and sporadic
  2. You need frontier-class reasoning (local 7B ≠ Claude Sonnet)
  3. You have no ops capacity to maintain GPU infrastructure
  4. Inference hardware is your bottleneck (power, heat, noise)

Decision Framework: Local vs Cloud API#

START: What kind of task?
│
├─ Frontier reasoning required?
│   (complex code, nuanced writing, multi-step agents)
│   └─ YES → Cloud API (Claude/GPT-4o). Period.
│
├─ Data sovereignty / privacy required?
│   └─ YES → Local (mandatory, regardless of cost)
│
├─ What's your volume?
│   ├─ <100 calls/day → Cloud API (break-even takes years)
│   ├─ 100–500 calls/day → Hybrid (cloud for complex, local for simple)
│   └─ >500 calls/day → Local likely cost-justified
│
└─ Do you have the hardware?
    ├─ YES, already built → Use it
    └─ NO → Calculate break-even before buying

Hybrid Routing Pattern#

The most practical homelab setup isn’t “all local” — it’s a router that classifies tasks:

# Pseudo-code for task router
def route_request(prompt, task_type):
    if task_type in ["classification", "extraction", "simple_qa"]:
        return local_model("llama3.1:8b", prompt)   # free
    elif task_type in ["code_generation", "document_analysis"]:
        return local_model("mistral:7b", prompt)      # free, good quality
    else:
        return cloud_api("claude-haiku-4-5", prompt) # paid, frontier

This pattern lets you cut 60–80% of API spend while maintaining quality on tasks that need it.

Quick Reference: When to Use What#

ScenarioUse
RAG over private docsLocal (Ollama + nomic-embed-text)
Code autocompleteLocal (Mistral 7B / Codestral)
Summarizing emailsLocal (Phi-3.5 3.8B, fast)
Complex research reportCloud API
Agentic workflows with 10+ stepsCloud API
Data pipeline ETL enrichmentLocal (high volume, structured)
Legal/sensitive document analysisLocal (mandatory)
Customer-facing chatbot (quality matters)Cloud API

For implementation details on setting up local inference, see the companion guide: Shadow Stack Complete Guide