Shadow Stack — Running Local AI Models on Consumer Hardware

What Is the Shadow Stack?#

The “shadow stack” is a local inference layer that runs alongside your cloud API usage. Instead of every prompt hitting OpenAI or Anthropic, lightweight or private workloads run on GPUs you already own. You choose the right tier per task.

Three deployment tiers:

Cloud APIs — Claude, GPT-4o, Gemini. Highest quality, per-token cost, zero ops.
Local inference — Llama 3, Mistral, Phi-3 on your hardware. Fixed cost after setup, full data sovereignty.
Hybrid routing — Simple or private tasks go local; complex reasoning hits cloud.

Hardware Prerequisites#

GPU — the critical variable#

VRAM	What You Can Run	Notes
8 GB (RTX 3070/4060)	7B models (Q4/Q5 quant)	Good daily driver for Mistral 7B, Llama 3 8B
12 GB (RTX 3060 12GB / 4070)	13B models (Q4), 7B (full)	Sweet spot for most homelab use
16–24 GB (RTX 3090/4090)	30B quant, 13B full precision	Handles Mixtral 8x7B MoE offloaded
2× GPU (NVLink or separate)	70B models quantized	Requires model sharding (vLLM handles this)

Minimum recommended: RTX 3060 12 GB or RTX 4060 Ti 16 GB (~$300–400 used/new)

Host System Requirements#

CPU: Any modern x86_64 (Proxmox needs VT-d / AMD-Vi for passthrough)
RAM: 32 GB+ system RAM (models may offload layers to CPU RAM)
Storage: NVMe SSD, 100+ GB free (models are 4–20 GB each)
Motherboard: Must support IOMMU groups. Check: AMD (Ryzen/EPYC) platforms generally have better IOMMU grouping than older Intel consumer boards.

Checking IOMMU on Existing Hardware#

# On the bare metal host (before Proxmox install):
dmesg | grep -e DMAR -e IOMMU
# Should show: IOMMU enabled / DMAR: IOMMU enabled

Proxmox Setup#

⚠️ 5 Critical Reboot Points: This section requires 5 reboots total:
After dist-upgrade (repo changes)
After GRUB update (IOMMU/IOMMU=pt)
After vfio-pci driver binding (update-initramfs)
After GPU drivers in VM (nvidia-driver install)
After docker/nvidia-container-toolkit config (systemctl restart)
Plan 30–45 minutes for the full GPU passthrough setup including reboots. Test each layer before moving to the next.

Install Proxmox VE (8.x)#

Download the ISO from proxmox.com, flash to USB, install to NVMe drive separate from data storage.

# Post-install: Remove enterprise repo, add free repo
sed -i 's|^deb|#deb|' /etc/apt/sources.list.d/pve-enterprise.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" \
  > /etc/apt/sources.list.d/pve-no-sub.list
apt update && apt dist-upgrade -y

Enable IOMMU (Required for GPU Passthrough)#

Intel systems — edit GRUB:

# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
update-grub && reboot

AMD systems:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"
update-grub && reboot

Verify IOMMU active after reboot:

dmesg | grep -e DMAR -e IOMMU | head -5
# Expected: "IOMMU enabled"
find /sys/kernel/iommu_groups/ -type l | head -20
# Should list groups — more groups = better isolation

LXC vs VM for Inference#

Approach	Pros	Cons
LXC container	Near-native GPU perf, easy backups	GPU passthrough to LXC is trickier (device binding)
Full VM (KVM)	Clean separation, snapshot support	Slight overhead, more setup steps
Recommended:	VM with PCIe passthrough	Best isolation + Proxmox snapshot support

GPU Passthrough in Proxmox#

Step 1: Isolate the GPU from Proxmox Host#

# Find your GPU's PCI address
lspci | grep -i nvidia   # or AMD
# Example output: 01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [RTX 3090]

# Find the IOMMU group
for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done | grep -i nvidia

If the GPU shares an IOMMU group with other critical devices (not just its own audio), you may need the ACS override patch kernel — available in Proxmox helper scripts community repo.

⚠️ ACS Override Patch Required: If lspci output shows your GPU grouped with unrelated devices (storage controllers, other NICs), you’ll need to compile the ACS override kernel patch. Without it, you cannot isolate the GPU alone. This is a Proxmox-specific gotcha — the community has pre-built kernels; search “proxmox-pci-acs-override.”

Step 2: Bind GPU to VFIO Driver#

# /etc/modprobe.d/vfio.conf
# Replace IDs with your GPU's vendor:device IDs (from lspci -nn)
options vfio-pci ids=10de:2204,10de:1aef

# /etc/modules — add these lines:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

update-initramfs -u && reboot

⚠️ Critical Verification Before Reboot:
Before rebooting, verify VFIO binding is successful:
lspci -k | grep -A3 "NVIDIA"
# Expected output: Kernel driver in use: vfio-pci
# If you see "nouveau" or "nvidia", the bind failed — DO NOT REBOOT
Failure modes:
NVIDIA driver still loaded (run rmmod nvidia nvidia-uvm first)
Device not found (double-check PCI IDs in vfio.conf)
Kernel module not loaded (verify all vfio modules in /etc/modules)
Only proceed if output shows vfio-pci for both device lines (VGA + Audio).

Step 3: Assign GPU to VM in Proxmox UI#

In the VM Hardware tab:

Add → PCI Device
Select your GPU (01:00.0)
Check: All Functions, ROM-Bar, PCI-Express
For NVIDIA: set x-vga=1 if it’s the only display

Then in VM Options: set Machine to q35, BIOS to OVMF (UEFI).

Step 4: Install GPU Drivers Inside VM#

Ubuntu 22.04 VM:

apt update && apt install -y ubuntu-drivers-common
ubuntu-drivers autoinstall
# OR for specific version:
apt install -y nvidia-driver-535 nvidia-cuda-toolkit
reboot

# Verify:
nvidia-smi
# Expected: Shows GPU model, driver version, CUDA version

✅ Post-Passthrough GPU Test: Inside the VM after driver install and reboot, run:
nvidia-smi
# ✓ Expected: GPU detected (e.g., "NVIDIA A100 / RTX 4090"), driver version ≥535

# Test inference capability:
python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
# ✓ Expected: True, then your GPU model name
If nvidia-smi shows “No devices were found,” the passthrough failed. Check Proxmox VM hardware tab for PCI device assignment and verify host-side vfio binding with lspci -k before retrying.

Containerized Inference with Ollama#

Ollama is the easiest path to local inference — a single binary that manages model downloads, quantization selection, and a REST API compatible with OpenAI’s /v1/chat/completions endpoint.

Install Ollama (inside VM or bare metal)#

curl -fsSL https://ollama.com/install.sh | sh
# Installs binary + systemd service

# Verify GPU is detected:
ollama serve &
ollama run llama3.2
# Should show GPU utilization in nvidia-smi

Docker Compose Setup (recommended for homelab)#

ℹ️ Docker & docker-compose Prerequisites: This section assumes Docker and docker-compose are already installed. If not present, install them:
apt install docker.io docker-compose -y
systemctl enable docker && systemctl start docker

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    runtime: nvidia           # requires nvidia-container-toolkit
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ollama_models:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped

volumes:
  ollama_models:

# Install nvidia-container-toolkit first:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
apt install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

docker compose up -d

Running Models#

# Pull and run — Ollama selects appropriate quant for your VRAM:
ollama pull llama3.2:3b       # 2.0 GB — fast, fits 8GB VRAM easily
ollama pull llama3.1:8b       # 4.7 GB — great quality/speed balance
ollama pull mistral:7b        # 4.1 GB — strong reasoning, fast
ollama pull phi3.5:3.8b       # 2.2 GB — Microsoft, excellent for its size
ollama pull deepseek-r1:7b    # 4.7 GB — reasoning model, chain-of-thought

# OpenAI-compatible API:
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Explain IOMMU in one paragraph"}]
  }'

✅ Ollama Model Cache Verification: After pulling models, verify they are cached locally:
# Check model storage location and size:
ls -lh ~/.ollama/models/
# ✓ Expected: Shows directories like "manifests/", "blobs/" containing model files
# Verify GPU is being used during inference:
nvidia-smi  # (in separate terminal)
# ✓ Expected: Shows "ollama" process in GPU Memory usage during model query
If models are not cached, ollama pull did not complete — check disk space with df -h and retry.

Model Quality Cheat Sheet#

Model	Size	Best For	VRAM Needed
Llama 3.2 3B	2 GB	Fast Q&A, simple tasks	4 GB
Llama 3.1 8B	4.7 GB	General purpose	6 GB
Mistral 7B	4.1 GB	Code, reasoning	6 GB
Phi-3.5 3.8B	2.2 GB	Document tasks, fast	4 GB
Mixtral 8x7B	26 GB	Near GPT-3.5 quality	24 GB (or CPU offload)
Llama 3.1 70B Q4	40 GB	Near GPT-4 class	2×24 GB or CPU
DeepSeek-R1 7B	4.7 GB	Step-by-step reasoning	6 GB

ℹ️ Quantization Rationale: All models listed are available in Q4 (4-bit) or Q5 (5-bit) quantization, which are chosen to balance inference quality against VRAM usage. Q4 drops ~2–4% accuracy vs. FP16 but halves memory; Q5 preserves 98%+ accuracy with 60% of full-precision size. For most homelab use cases, Q4/Q5 models are indistinguishable from full precision while fitting in affordable GPUs.

vLLM for Production-Grade Inference#

vLLM offers higher throughput than Ollama via PagedAttention — better for serving multiple concurrent users or batch workloads. More complex to configure, but significantly faster under load.

Why PagedAttention? vLLM’s PagedAttention algorithm reorganizes the KV cache (key-value pairs from earlier tokens) into memory pages, enabling request batching without memory waste. When 10 concurrent users query the model, Ollama must keep 10 separate full KV caches in VRAM; vLLM shares physical pages across requests, reducing memory pressure by 40–60% and enabling higher throughput per GPU.

When to Use vLLM vs Ollama#

Factor	Ollama	vLLM
Setup complexity	Low	Medium-High
Single-user inference	Excellent	Good
Concurrent requests (5+)	Degrades	Excels
Model variety	Huge (GGUF)	HuggingFace models
OpenAI API compat	Yes	Yes
Recommended for	Personal/homelab	Team/production

Docker Setup for vLLM#

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --dtype half \
  --max-model-len 4096

# Test:
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100
  }'

Multi-GPU Tensor Parallelism (70B+ Models)#

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dtype half \
  --tensor-parallel-size 2    # splits model across 2 GPUs

ℹ️ Tensor Parallelism GPU Count: The --tensor-parallel-size parameter must equal the number of identical GPUs in your system. If you have 2× RTX 4090, use --tensor-parallel-size 2. For 4× RTX 3090, use --tensor-parallel-size 4. Mismatching this value will cause OOM or hang errors. Verify GPU count with nvidia-smi before starting the container.

HuggingFace Token (for gated models like Llama)#

export HUGGING_FACE_HUB_TOKEN=hf_yourtoken
# Add to docker run: -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN

Choosing the Right Model#

The Practical Hierarchy (as of early 2026)#

GPT-4o / Claude 3.5 Sonnet / Gemini 1.5 Pro
   → Complex reasoning, coding, nuanced writing
   → Use when quality is non-negotiable

Llama 3.1 70B / Mixtral 8x22B (local)
   → Near-frontier quality, requires big VRAM
   → Good for private data, batch processing

Mistral 7B / Llama 3.1 8B / Phi-3.5 (local)
   → 80% of use cases at near-zero marginal cost
   → Daily driver for personal automation

Llama 3.2 3B / Phi-3 mini (local)
   → Ultra-fast, classification, extraction, routing

Task-to-Model Mapping#

Task	Recommended Local Model	Why
Code completion	Mistral 7B / Codestral	Strong code training
Document Q&A (RAG)	Llama 3.1 8B	Context handling
Summarization	Phi-3.5 3.8B	Fast, accurate enough
Chain-of-thought reasoning	DeepSeek-R1 7B	Explicit reasoning traces
Embeddings (RAG)	nomic-embed-text	via Ollama, great recall
Complex multi-step agents	Cloud API only	Local models lose track

For cost analysis and decision frameworks, see the companion post: Local vs API Cost Analysis