Private Local LLM Stack — Keep Your Code On-Machine

Free - $30/mo (hardware amortized separately)
Tech leads and indie builders working on proprietary codebases who need AI coding assistance without cloud API exposure

Three supply chain attacks in two weeks exposed every cloud AI coding tool as a potential exfiltration vector. Here's the full stack to run local LLMs for 80% of...

The stack (5 tools)

01
Ollama Free (open source)
Local LLM runtime and REST API server

OpenAI-compatible API, automatic GPU offloading, Docker-like model management — drop-in replacement for cloud endpoints

02
Qwen3 8B Free (Apache 2.0)
Primary coding model for developer workstations

72% HumanEval, ~50% SWE-bench Verified, runs at 80-90 tok/s on RTX 4090 — sufficient for the vast majority of daily coding tasks

03
Continue.dev Free (open source)
VS Code / JetBrains IDE integration

Native editor integration with configurable base_url — point it at localhost and cloud API calls stop instantly

04
LiteLLM Local Proxy Free (open source, pin to 1.83.0+)
Routing layer for hybrid local/cloud workflows

Routes simple completions to local models, complex tasks to managed API when needed — one config file controls all traffic

05
Docker Free (Docker Desktop paid for large teams)
Agent execution sandbox

Restricts agent filesystem and network access to the project directory — necessary once you run agentic loops locally

total / month Free - $30/mo (hardware amortized separately)

TL;DR

  • Three supply chain incidents — Axios (March 31), LiteLLM (March 24), Claude Code CVEs (patched January 2026) — map exactly to where code leaves your machine in a cloud AI workflow
  • Qwen3 8B hits 72% HumanEval on 16GB RAM; Gemma 4 31B reaches ~85% on a 24GB GPU — both sufficient for most daily coding tasks
  • Stack: Ollama (runtime) → Qwen3 8B or Gemma 4 (model) → Continue.dev or Cline (IDE) → LiteLLM local proxy (optional routing) → Docker sandbox (agent isolation)
  • Hardware floor: 16GB unified RAM for Qwen3 8B; 24GB dedicated VRAM for 31B-class models — budget $1,500–$3,000
  • This stack covers 80% of coding tasks. Complex cross-repository reasoning and greenfield architecture decisions still require frontier models — plan for hybrid

Three supply chain attacks in two weeks should be enough to reconsider where your code goes. On March 31, North Korean actors hijacked the Axios npm package — 100 million weekly downloads, RAT installed via postinstall hook. On March 24, litellm versions 1.82.7 and 1.82.8 appeared on PyPI containing a credential stealer that auto-executed on Python startup, targeting AWS, GCP, and Azure tokens. And if you’re running Claude Code against untrusted repositories without patching to v2.0.65+, two published CVEs (CVE-2025-59536, CVE-2026-21852) allow arbitrary shell execution and API key exfiltration via malicious .claude/settings.json files — both were disclosed and patched in late 2025 and January 2026.

The narrative around local LLMs has been “cost savings and offline use.” Both true, both wrong as primary motivations. The forcing function is that every cloud AI coding tool is now a potential exfiltration vector: your code leaves your machine on every completion, every agent loop, every context window. I’d build this stack for any team working on proprietary code, and I’d have built it faster if I’d seen the LiteLLM supply chain incident coming.

The good news: the same tools the community dismissed as “not good enough” are now clearly good enough for 80% of routine coding tasks. Android Studio’s same-day Gemma 4 local agent integration on April 2 — the day the model dropped — is the clearest signal yet that a major platform vendor considers local model quality production-sufficient. The benchmarks back that up. This isn’t a compromise anymore. It’s a choice.

The Threat Model

Before stack details, it’s worth being precise about what you’re actually defending against. The attacks above aren’t theoretical.

The Axios compromise worked because build pipelines execute postinstall scripts without auditing them. The malicious packages (versions 1.14.1 and 0.30.4) installed a remote access trojan via a dependency (plain-crypto-js@4.2.1) that exfiltrated environment variables and filesystem contents. If your CI pipeline pulls npm packages and runs completions against a cloud API, you have two simultaneous exfiltration vectors: the compromised package and the API call itself.

The LiteLLM compromise was more targeted. Versions 1.82.7 and 1.82.8 were live on PyPI from 10

UTC for approximately 40 minutes before PyPI quarantined them. The malicious .pth file (litellm_init.pth) auto-executes on Python startup — meaning any Python process importing LiteLLM during that window would silently steal cloud provider credentials. LiteLLM gets 3.4 million daily downloads. The blast radius was significant.

Claude Code’s vulnerabilities are different in character: they require an attacker to plant a malicious .claude/settings.json in a repository you clone. CVE-2025-59536 (published October 3, 2025) executes arbitrary shell commands on tool initialization — before any trust dialog appears. CVE-2026-21852 (published January 21, 2026) overrides ANTHROPIC_BASE_URL to redirect API traffic — and with it, your code — to an attacker-controlled endpoint. Both are patched in v2.0.65+, but they illustrate the pattern: the attack surface isn’t just the model provider. It’s every tool in the chain that touches your code.

Running local models doesn’t eliminate all of these risks, but it eliminates the most consequential one: your proprietary code never traverses a network you don’t control.

Stack Overview

The stack has four layers, each doing one job:

  • Ollama — local LLM runtime; wraps llama.cpp behind an OpenAI-compatible REST API on localhost:11434
  • Model — Qwen3 8B (efficient, 16GB RAM minimum) or Gemma 4 31B (better reasoning, 24GB VRAM)
  • Editor integration — Continue.dev (VS Code / JetBrains) or Cline (terminal agent); both accept a base_url override pointing at localhost
  • LiteLLM local proxy — optional routing layer; sends simple completions to Ollama, escalates complex tasks to a managed API endpoint if your hardware can’t handle them
  • Docker sandbox — execution isolation for agentic loops; restricts filesystem access to the project directory and cuts outbound network access

What the combination solves that individual tools don’t: without the routing layer, you’re making a binary choice between local (slower on complex tasks) and cloud (exfiltration risk). LiteLLM lets you define that boundary per request type. Without the sandbox, a local model running in an agentic loop has the same filesystem access you do — which is too much.

graph TD
    Editor["VS Code / JetBrains<br/>(Continue.dev / Cline)"]
    Proxy["LiteLLM Local Proxy<br/>localhost:4000"]
    Ollama["Ollama<br/>localhost:11434"]
    Model["Qwen3 8B / Gemma 4<br/>(local model weights)"]
    Cloud["Managed API<br/>(complex tasks only)"]
    Sandbox["Docker Sandbox<br/>(agent execution)"]

    Editor -->|"completion request"| Proxy
    Proxy -->|"simple / routine tasks"| Ollama
    Proxy -->|"complex cross-repo reasoning"| Cloud
    Ollama --> Model
    Editor -->|"agentic tool calls"| Sandbox
    Sandbox -->|"contained execution"| Ollama

Components

Ollama (current stable: 0.6.x)

Ollama wraps llama.cpp inference behind a REST API layer that handles model quantization, GPU memory allocation, and model file management — the things that made local LLMs painful to operate two years ago. It reached 95,000+ GitHub stars in early 2026, which is less a vanity metric and more a signal that the ecosystem of compatible tools is now large enough to depend on.

The key property for this stack: Ollama binds to 127.0.0.1:11434 by default. It does not expose itself to the network unless you explicitly set OLLAMA_HOST=0.0.0.0. Don’t do that without a reverse proxy and authentication in front of it — anyone on the network can run arbitrary prompts and exhaust your GPU capacity.

Why Ollama in this stack:

  • OpenAI-compatible REST API — Continue.dev, Cline, LiteLLM, and most other tools drop in without code changes
  • Automatic GPU offloading — manages VRAM allocation across layers without manual configuration
  • Content-addressable model storage — same approach as Docker image layers; models are pulled once and deduplicated
ToolDifferenceSwitch if
llama.cpp (direct)No REST API wrapper; lower overheadRunning in CI pipelines where API latency matters and you control the binary directly
LM StudioGUI-first; easier initial setupYou want a desktop app and don’t need programmatic control
vLLMProduction inference server; better batchingServing multiple users from a shared GPU instance at scale

Model Selection

Three models are worth evaluating for coding tasks in 2026, each with a distinct hardware and use-case profile.

Qwen3 8B (released April 28, 2025, Apache 2.0) is the practical default for developer workstations. It achieves 72% on HumanEval and approximately 50% on SWE-bench Verified — the latter is a meaningful number because SWE-bench uses real GitHub issues, not synthetic problems. On an RTX 4090 with Q4_K_M quantization it runs at 80-90 tok/s; on a 16GB unified RAM machine with no dedicated GPU it drops to 20-30 tok/s but stays usable for interactive work. Qwen3 8B has a native 32K context that extends to 131K+ via YaRN scaling. One configuration detail that will bite you: Ollama defaults to an 8K context window. Set num_ctx to at least 65536 in your Modelfile or you’re leaving most of that context capacity unused.

Gemma 4 31B (released April 2, 2026, Apache 2.0) raises the quality ceiling. HumanEval sits at approximately 85% on the dense 31B variant — 13 points above Qwen3 8B — and AIME 2026 math reasoning reaches ~89%, which matters for algorithmic problem solving. The hardware requirement is 24GB dedicated VRAM minimum for Q4_K_M. Google’s same-day Gemma 4 integration in Android Studio’s agent mode is a direct signal: a major platform vendor shipped this as a production coding model the same day it was released. For the 26B MoE variant, hardware requirements drop considerably — it activates only 3.8B parameters per forward pass and fits on a single 24GB GPU at Q4 — while still ranking #6 on the Arena AI leaderboard among open models.

Llama 4 Scout (April 5, 2026, Llama 4 Community Agreement) is the wildcard: MoE architecture with a 10M-token context window. The context is compelling for repo-scale reasoning, but SWE-bench sits at ~55-60% — lower than Qwen3 8B despite the larger parameter count. Check Meta’s Community Agreement (700M MAU cap) before commercial deployment.

Quantization guidance: Q4_K_M is the minimum for coding quality. Q5_K_M gains 3-5% quality at 10-15% more VRAM — worth it for multi-file reasoning. Q8_0 only makes sense above 32GB VRAM.

ModelHumanEvalSWE-benchVRAM (Q4_K_M)License
Qwen3 8B72%~50%8GBApache 2.0
Gemma 4 31B~85%24GBApache 2.0
Llama 4 Scout~84%~55-60%48GB+Community (700M MAU cap)

For most teams: start with Qwen3 8B. Upgrade to Gemma 4 31B if your hardware supports it. Gemma 4 26B MoE is the practical middle ground if you have a single 24GB GPU and want meaningfully better reasoning than Qwen3 8B without the full 31B dense VRAM requirement.

Editor Integration: Continue.dev and Cline

Continue.dev is the path of least resistance for VS Code or JetBrains. The one configuration change that matters:

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Qwen3 8B (local)",
      "provider": "ollama",
      "model": "qwen3:8b",
      "apiBase": "http://localhost:11434"
    }
  ]
}

Cloud completions stop at the config file. No code leaves the machine. After applying this config, verify by watching your network monitor — zero outbound HTTPS to any AI provider on completion requests is what you’re looking for.

Cline is the better choice for agentic workflows — multi-file edits, test-fix loops, scaffolding. It supports tool calling and function schemas against Ollama directly. One security note specific to Cline: if you’re using it with MCP integrations, a malicious .mcp.json in an untrusted repository can auto-initialize external tools. Require explicit trust confirmation before running Cline in any repository you didn’t author.

ToolDifferenceSwitch if
ClineTerminal-based agent; better for multi-file agentic tasksYou need structured tool calling and file editing loops
CursorCloud-based; local model support limitedYou prioritize editor UX over security and don’t handle proprietary code
CopilotGitHub-hosted inference onlyYou’re working on open-source code and accept cloud API exposure

LiteLLM Local Proxy

The routing layer is optional but valuable for teams that hit the capability ceiling of their local hardware on complex tasks. LiteLLM proxies requests and applies routing rules — send routine completions to Ollama, escalate complex multi-file reasoning to a managed API endpoint.

One non-negotiable requirement: do not run LiteLLM versions 1.82.7 or 1.82.8. These were the compromised PyPI releases from March 24. If you installed LiteLLM during that window (10

–16
UTC), rotate all cloud provider credentials before continuing. Pin to 1.83.0 or later, which was released through an isolated CI/CD pipeline with stronger security gates.

# proxy_config.yaml
model_list:
  - model_name: local-qwen3
    litellm_params:
      model: ollama/qwen3:8b
      api_base: http://localhost:11434

  - model_name: frontier-fallback
    litellm_params:
      model: anthropic/claude-3-5-sonnet
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: simple-shuffle
  model_group_alias:
    default: local-qwen3
    complex: frontier-fallback

The routing decision is manual per-request via the model parameter. You control what goes local and what goes cloud — there’s no automatic classification.

ToolDifferenceSwitch if
nginx reverse proxySimpler; no routing logicYou only need single-model load balancing, not routing
OpenRouterCloud-hosted routing; no local model supportYou’re routing between cloud providers only

Docker Sandbox

Running a coding agent locally without sandboxing it means the agent has your full filesystem access. That’s not a theoretical risk — agentic loops that write files, execute tests, and manage dependencies will follow instructions into paths you didn’t intend.

The minimal sandbox that works:

docker run \
  --rm \
  --security-opt=no-new-privileges \
  --mount type=bind,source=$(pwd),target=/work \
  --network none \
  --workdir /work \
  your-agent-image

This gives the agent read/write access to the current project directory, no network access, and no privilege escalation. For Ollama access from inside the container, swap --network none for --add-host=host.docker.internal:host-gateway and restrict outbound calls at the firewall level to localhost:11434 only.

The Docker overhead per tool invocation is 50-200ms. For interactive coding assistance that’s imperceptible. For tight test-fix loops with hundreds of iterations, it’s noticeable — use Bubblewrap on Linux if you need the isolation with lower latency.

ToolDifferenceSwitch if
BubblewrapLinux-only; ~10MB overhead vs Docker startupLinux workstation, high-frequency tool calls, latency matters
macOS sandbox-execmacOS-native; profile-basedmacOS and you want native rather than Docker VM overhead

Setup Walkthrough

Step 1 — Install Ollama

Ollama ships a single installer script for Linux and macOS. Windows has a native installer at ollama.ai.

# Linux / macOS
curl -fsSL https://ollama.ai/install.sh | sh

# Verify GPU detection
ollama info

After install, confirm the output shows your GPU (NVIDIA CUDA, AMD ROCm, or Apple Metal). CPU-only inference works but drops below 5 tok/s on 13B+ models — impractical for interactive use.

Step 2 — Pull Your Model

Start with Qwen3 8B if your machine has 16GB RAM. Pull Gemma 4 26B MoE (the quantized default available in the Ollama library) if you have 24GB VRAM.

# 16GB RAM / no dedicated GPU
ollama pull qwen3:8b

# 24GB VRAM (RTX 3090 / 4090)
ollama pull gemma4:27b-it-q4_K_M

Model files are content-addressed and stored once. Pulling the same model variant twice is a no-op.

Step 3 — Configure Context Window

Ollama defaults to 8K context. Qwen3 8B supports 131K+ via YaRN. Create a Modelfile that sets the context explicitly.

# ~/.ollama/Modelfile.qwen3-131k
FROM qwen3:8b
PARAMETER num_ctx 131072
PARAMETER num_predict 4096
ollama create qwen3-131k -f ~/.ollama/Modelfile.qwen3-131k

Use qwen3-131k as your model name in Continue.dev and LiteLLM configs from this point forward.

Step 4 — Test the REST API

Before configuring any editor integration, confirm the Ollama API responds correctly.

curl http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3-131k",
    "messages": [{"role": "user", "content": "Write a Python function to reverse a string."}],
    "stream": false
  }'

A response with "done": true and a non-empty message.content confirms the runtime is working. If you see a connection error, check that the Ollama service is running (ollama serve).

Step 5 — Configure Continue.dev

Install the Continue extension in VS Code or JetBrains. Replace the default config with a localhost endpoint.

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Qwen3 131K (local)",
      "provider": "ollama",
      "model": "qwen3-131k",
      "apiBase": "http://localhost:11434",
      "contextLength": 131072
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen3 (autocomplete)",
    "provider": "ollama",
    "model": "qwen3:8b",
    "apiBase": "http://localhost:11434"
  }
}

Open the Continue panel and run a test completion. Network monitor (or lsof -i :443) should show zero outbound HTTPS calls to any AI provider.

Step 6 — Deploy LiteLLM Proxy (optional)

Skip this step if you’re running purely local with no cloud fallback. If you want hybrid routing:

# Install — pin to safe version
pip install litellm==1.83.0

# Create config
cat > proxy_config.yaml << 'EOF'
model_list:
  - model_name: local
    litellm_params:
      model: ollama/qwen3-131k
      api_base: http://localhost:11434

  - model_name: frontier
    litellm_params:
      model: anthropic/claude-3-5-sonnet
      api_key: os.environ/ANTHROPIC_API_KEY

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
EOF

# Start proxy
litellm --config proxy_config.yaml --port 4000

Point Continue.dev at http://localhost:4000 instead of localhost:11434. Use model: local for routine tasks, model: frontier for complex cross-repository work.

Step 7 — Set Up the Agent Sandbox

For any workflow where the model executes code (Cline agentic mode, Claude Code against local models), run the agent inside a container.

# Build a minimal agent image
cat > Dockerfile.agent << 'EOF'
FROM python:3.12-slim
RUN pip install cline
WORKDIR /work
ENTRYPOINT ["cline"]
EOF

docker build -t local-agent -f Dockerfile.agent .

# Run with project directory mounted, network restricted to Ollama host
docker run \
  --rm -it \
  --security-opt=no-new-privileges \
  --mount type=bind,source=$(pwd),target=/work \
  --add-host=host.docker.internal:host-gateway \
  local-agent \
  --api-base http://host.docker.internal:11434 \
  --model qwen3-131k

The agent reads and writes to your current directory, calls Ollama on the host, and cannot make any other outbound network calls.

Pricing

ComponentLicenseFree TierPaid fromNote
OllamaMITUnlimitedNo usage limits, no telemetry on model calls
Qwen3 8BApache 2.0UnlimitedCommercial use unrestricted
Gemma 4 31BApache 2.0UnlimitedCommercial use unrestricted
Llama 4 ScoutCommunityUnlimited700M MAU cap; verify Meta’s Community Agreement before commercial deployment
Continue.devApache 2.0UnlimitedNo cloud dependency when using local models
LiteLLM (proxy)MITUnlimited$0 (self-hosted)Enterprise: LiteLLM Cloud pricing applies if not self-hosted
DockerApache 2.0Unlimited$9/mo (Desktop, large teams)Free for personal and small team use

Hardware is the real cost. A used RTX 3090 (24GB VRAM) runs approximately $700 and handles Qwen3 32B at Q4 — 30-40 tok/s. An RTX 4090 ($1,600 new) runs 70B-class models. A Mac Studio M4 Max with 128GB unified memory (~$7,000) runs Llama 4 Scout unquantized. At $0.10–$0.30 per million tokens for cloud APIs, the hardware pays for itself only if you’re running significant volume — roughly 5-10 million tokens per month to break even on an RTX 3090 over 12 months.

LiteLLM versions 1.82.7 and 1.82.8 (published March 24, 2026, 10

–16
UTC) contain a credential stealer. If you installed LiteLLM during that window, rotate all cloud provider credentials immediately and upgrade to 1.83.0+. Do not use any version below 1.83.0.

When This Stack Fits

  • You work on proprietary code that cannot leave your network. The stack eliminates the primary exfiltration vector. No completion request leaves the machine unless you configure the LiteLLM frontier fallback.
  • Your coding tasks are routine. Qwen3 8B at 72% HumanEval covers function-level generation, test writing, refactoring, documentation, and most debugging. The gap versus frontier models is not decisive here.
  • You have a developer workstation with ≥16GB RAM. The hardware requirement is real. Under-powered laptops make this impractical without a shared Ollama server behind a reverse proxy — which introduces its own attack surface.
  • You’re building a hybrid workflow. Routing 80% of completions locally and 20% to a frontier model is a practical compromise. LiteLLM makes that boundary explicit and auditable.

When This Stack Does Not Fit

  • You need complex cross-repository reasoning. Claude Sonnet scores 77% on SWE-bench Verified; Qwen3 8B ~50%. For changes that ripple across 50 files, that gap is decisive — use a frontier model (via LiteLLM hybrid routing to minimize exposure).
  • You’re doing greenfield architecture work. Novel architectural decisions, API design from scratch, complex schema design — these benefit from frontier reasoning local 8B-class models don’t match. Local is a containment strategy for routine coding, not a replacement on hard design problems.
  • Your team doesn’t have the hardware budget. The $1,500–$3,000 machine floor is real. If budget is tight, a managed API with strict DPAs and egress monitoring may be cheaper and safer than under-specced local hardware. A 5 tok/s model is worse than no model.
  • You want to share Ollama without hardening it. A shared instance without per-user auth, rate limiting, and TLS is a new attack surface. Implement nginx with OAuth2 proxy or LiteLLM per-key auth first.

Three supply chain incidents in two weeks suggest the threat model has changed faster than most teams have updated their tooling. The stack above is an afternoon of configuration — the question is whether you want that afternoon before or after something goes wrong.