Shadow Stack — Running Local AI Models on Consumer Hardware
What Is the Shadow Stack?#
The “shadow stack” is a local inference layer that runs alongside your cloud API usage. Instead of every prompt hitting OpenAI or Anthropic, lightweight or private workloads run on GPUs you already own. You choose the right tier per task.
Three deployment tiers:
- Cloud APIs — Claude, GPT-4o, Gemini. Highest quality, per-token cost, zero ops.
- Local inference — Llama 3, Mistral, Phi-3 on your hardware. Fixed cost after setup, full data sovereignty.
- Hybrid routing — Simple or private tasks go local; complex reasoning hits cloud.
Hardware Prerequisites#
GPU — the critical variable#
| VRAM | What You Can Run | Notes |
|---|---|---|
| 8 GB (RTX 3070/4060) | 7B models (Q4/Q5 quant) | Good daily driver for Mistral 7B, Llama 3 8B |
| 12 GB (RTX 3060 12GB / 4070) | 13B models (Q4), 7B (full) | Sweet spot for most homelab use |
| 16–24 GB (RTX 3090/4090) | 30B quant, 13B full precision | Handles Mixtral 8x7B MoE offloaded |
| 2× GPU (NVLink or separate) | 70B models quantized | Requires model sharding (vLLM handles this) |
Minimum recommended: RTX 3060 12 GB or RTX 4060 Ti 16 GB (~$300–400 used/new)
Host System Requirements#
- CPU: Any modern x86_64 (Proxmox needs VT-d / AMD-Vi for passthrough)
- RAM: 32 GB+ system RAM (models may offload layers to CPU RAM)
- Storage: NVMe SSD, 100+ GB free (models are 4–20 GB each)
- Motherboard: Must support IOMMU groups. Check: AMD (Ryzen/EPYC) platforms generally have better IOMMU grouping than older Intel consumer boards.
Checking IOMMU on Existing Hardware#
# On the bare metal host (before Proxmox install):
dmesg | grep -e DMAR -e IOMMU
# Should show: IOMMU enabled / DMAR: IOMMU enabled
Proxmox Setup#
⚠️ 5 Critical Reboot Points: This section requires 5 reboots total:
- After dist-upgrade (repo changes)
- After GRUB update (IOMMU/IOMMU=pt)
- After vfio-pci driver binding (update-initramfs)
- After GPU drivers in VM (nvidia-driver install)
- After docker/nvidia-container-toolkit config (systemctl restart)
Plan 30–45 minutes for the full GPU passthrough setup including reboots. Test each layer before moving to the next.
Install Proxmox VE (8.x)#
Download the ISO from proxmox.com, flash to USB, install to NVMe drive separate from data storage.
# Post-install: Remove enterprise repo, add free repo
sed -i 's|^deb|#deb|' /etc/apt/sources.list.d/pve-enterprise.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" \
> /etc/apt/sources.list.d/pve-no-sub.list
apt update && apt dist-upgrade -y
Enable IOMMU (Required for GPU Passthrough)#
Intel systems — edit GRUB:
# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
update-grub && reboot
AMD systems:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"
update-grub && reboot
Verify IOMMU active after reboot:
dmesg | grep -e DMAR -e IOMMU | head -5
# Expected: "IOMMU enabled"
find /sys/kernel/iommu_groups/ -type l | head -20
# Should list groups — more groups = better isolation
LXC vs VM for Inference#
| Approach | Pros | Cons |
|---|---|---|
| LXC container | Near-native GPU perf, easy backups | GPU passthrough to LXC is trickier (device binding) |
| Full VM (KVM) | Clean separation, snapshot support | Slight overhead, more setup steps |
| Recommended: | VM with PCIe passthrough | Best isolation + Proxmox snapshot support |
GPU Passthrough in Proxmox#
Step 1: Isolate the GPU from Proxmox Host#
# Find your GPU's PCI address
lspci | grep -i nvidia # or AMD
# Example output: 01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [RTX 3090]
# Find the IOMMU group
for d in /sys/kernel/iommu_groups/*/devices/*; do
n=${d#*/iommu_groups/*}; n=${n%%/*}
printf 'IOMMU Group %s ' "$n"
lspci -nns "${d##*/}"
done | grep -i nvidia
If the GPU shares an IOMMU group with other critical devices (not just its own audio), you may need the ACS override patch kernel — available in Proxmox helper scripts community repo.
⚠️ ACS Override Patch Required: If lspci output shows your GPU grouped with unrelated devices (storage controllers, other NICs), you’ll need to compile the ACS override kernel patch. Without it, you cannot isolate the GPU alone. This is a Proxmox-specific gotcha — the community has pre-built kernels; search “proxmox-pci-acs-override.”
Step 2: Bind GPU to VFIO Driver#
# /etc/modprobe.d/vfio.conf
# Replace IDs with your GPU's vendor:device IDs (from lspci -nn)
options vfio-pci ids=10de:2204,10de:1aef
# /etc/modules — add these lines:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
update-initramfs -u && reboot
⚠️ Critical Verification Before Reboot:
Before rebooting, verify VFIO binding is successful:
lspci -k | grep -A3 "NVIDIA" # Expected output: Kernel driver in use: vfio-pci # If you see "nouveau" or "nvidia", the bind failed — DO NOT REBOOTFailure modes:
- NVIDIA driver still loaded (run
rmmod nvidia nvidia-uvmfirst)- Device not found (double-check PCI IDs in vfio.conf)
- Kernel module not loaded (verify all vfio modules in /etc/modules)
Only proceed if output shows
vfio-pcifor both device lines (VGA + Audio).
Step 3: Assign GPU to VM in Proxmox UI#
In the VM Hardware tab:
- Add → PCI Device
- Select your GPU (01:00.0)
- Check: All Functions, ROM-Bar, PCI-Express
- For NVIDIA: set x-vga=1 if it’s the only display
Then in VM Options: set Machine to q35, BIOS to OVMF (UEFI).
Step 4: Install GPU Drivers Inside VM#
Ubuntu 22.04 VM:
apt update && apt install -y ubuntu-drivers-common
ubuntu-drivers autoinstall
# OR for specific version:
apt install -y nvidia-driver-535 nvidia-cuda-toolkit
reboot
# Verify:
nvidia-smi
# Expected: Shows GPU model, driver version, CUDA version
✅ Post-Passthrough GPU Test: Inside the VM after driver install and reboot, run:
nvidia-smi # ✓ Expected: GPU detected (e.g., "NVIDIA A100 / RTX 4090"), driver version ≥535 # Test inference capability: python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))" # ✓ Expected: True, then your GPU model nameIf
nvidia-smishows “No devices were found,” the passthrough failed. Check Proxmox VM hardware tab for PCI device assignment and verify host-side vfio binding withlspci -kbefore retrying.
Containerized Inference with Ollama#
Ollama is the easiest path to local inference — a single binary that manages model downloads, quantization selection, and a REST API compatible with OpenAI’s /v1/chat/completions endpoint.
Install Ollama (inside VM or bare metal)#
curl -fsSL https://ollama.com/install.sh | sh
# Installs binary + systemd service
# Verify GPU is detected:
ollama serve &
ollama run llama3.2
# Should show GPU utilization in nvidia-smi
Docker Compose Setup (recommended for homelab)#
ℹ️ Docker & docker-compose Prerequisites: This section assumes Docker and docker-compose are already installed. If not present, install them:
apt install docker.io docker-compose -y systemctl enable docker && systemctl start docker
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
runtime: nvidia # requires nvidia-container-toolkit
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ollama_models:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
volumes:
ollama_models:
# Install nvidia-container-toolkit first:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
apt install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
docker compose up -d
Running Models#
# Pull and run — Ollama selects appropriate quant for your VRAM:
ollama pull llama3.2:3b # 2.0 GB — fast, fits 8GB VRAM easily
ollama pull llama3.1:8b # 4.7 GB — great quality/speed balance
ollama pull mistral:7b # 4.1 GB — strong reasoning, fast
ollama pull phi3.5:3.8b # 2.2 GB — Microsoft, excellent for its size
ollama pull deepseek-r1:7b # 4.7 GB — reasoning model, chain-of-thought
# OpenAI-compatible API:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Explain IOMMU in one paragraph"}]
}'
✅ Ollama Model Cache Verification: After pulling models, verify they are cached locally:
# Check model storage location and size: ls -lh ~/.ollama/models/ # ✓ Expected: Shows directories like "manifests/", "blobs/" containing model files # Verify GPU is being used during inference: nvidia-smi # (in separate terminal) # ✓ Expected: Shows "ollama" process in GPU Memory usage during model queryIf models are not cached,
ollama pulldid not complete — check disk space withdf -hand retry.
Model Quality Cheat Sheet#
| Model | Size | Best For | VRAM Needed |
|---|---|---|---|
| Llama 3.2 3B | 2 GB | Fast Q&A, simple tasks | 4 GB |
| Llama 3.1 8B | 4.7 GB | General purpose | 6 GB |
| Mistral 7B | 4.1 GB | Code, reasoning | 6 GB |
| Phi-3.5 3.8B | 2.2 GB | Document tasks, fast | 4 GB |
| Mixtral 8x7B | 26 GB | Near GPT-3.5 quality | 24 GB (or CPU offload) |
| Llama 3.1 70B Q4 | 40 GB | Near GPT-4 class | 2×24 GB or CPU |
| DeepSeek-R1 7B | 4.7 GB | Step-by-step reasoning | 6 GB |
ℹ️ Quantization Rationale: All models listed are available in Q4 (4-bit) or Q5 (5-bit) quantization, which are chosen to balance inference quality against VRAM usage. Q4 drops ~2–4% accuracy vs. FP16 but halves memory; Q5 preserves 98%+ accuracy with 60% of full-precision size. For most homelab use cases, Q4/Q5 models are indistinguishable from full precision while fitting in affordable GPUs.
vLLM for Production-Grade Inference#
vLLM offers higher throughput than Ollama via PagedAttention — better for serving multiple concurrent users or batch workloads. More complex to configure, but significantly faster under load.
Why PagedAttention? vLLM’s PagedAttention algorithm reorganizes the KV cache (key-value pairs from earlier tokens) into memory pages, enabling request batching without memory waste. When 10 concurrent users query the model, Ollama must keep 10 separate full KV caches in VRAM; vLLM shares physical pages across requests, reducing memory pressure by 40–60% and enabling higher throughput per GPU.
When to Use vLLM vs Ollama#
| Factor | Ollama | vLLM |
|---|---|---|
| Setup complexity | Low | Medium-High |
| Single-user inference | Excellent | Good |
| Concurrent requests (5+) | Degrades | Excels |
| Model variety | Huge (GGUF) | HuggingFace models |
| OpenAI API compat | Yes | Yes |
| Recommended for | Personal/homelab | Team/production |
Docker Setup for vLLM#
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--dtype half \
--max-model-len 4096
# Test:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'
Multi-GPU Tensor Parallelism (70B+ Models)#
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--dtype half \
--tensor-parallel-size 2 # splits model across 2 GPUs
ℹ️ Tensor Parallelism GPU Count: The
--tensor-parallel-sizeparameter must equal the number of identical GPUs in your system. If you have 2× RTX 4090, use--tensor-parallel-size 2. For 4× RTX 3090, use--tensor-parallel-size 4. Mismatching this value will cause OOM or hang errors. Verify GPU count withnvidia-smibefore starting the container.
HuggingFace Token (for gated models like Llama)#
export HUGGING_FACE_HUB_TOKEN=hf_yourtoken
# Add to docker run: -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN
Choosing the Right Model#
The Practical Hierarchy (as of early 2026)#
GPT-4o / Claude 3.5 Sonnet / Gemini 1.5 Pro
→ Complex reasoning, coding, nuanced writing
→ Use when quality is non-negotiable
Llama 3.1 70B / Mixtral 8x22B (local)
→ Near-frontier quality, requires big VRAM
→ Good for private data, batch processing
Mistral 7B / Llama 3.1 8B / Phi-3.5 (local)
→ 80% of use cases at near-zero marginal cost
→ Daily driver for personal automation
Llama 3.2 3B / Phi-3 mini (local)
→ Ultra-fast, classification, extraction, routing
Task-to-Model Mapping#
| Task | Recommended Local Model | Why |
|---|---|---|
| Code completion | Mistral 7B / Codestral | Strong code training |
| Document Q&A (RAG) | Llama 3.1 8B | Context handling |
| Summarization | Phi-3.5 3.8B | Fast, accurate enough |
| Chain-of-thought reasoning | DeepSeek-R1 7B | Explicit reasoning traces |
| Embeddings (RAG) | nomic-embed-text | via Ollama, great recall |
| Complex multi-step agents | Cloud API only | Local models lose track |
Related Reading#
For cost analysis and decision frameworks, see the companion post: Local vs API Cost Analysis