intermediate ⏱ 45 minutes 15 min read

Local LLM API — Ollama beats others for dev workflows

Turn a local Llama 3.3 8B Q4_K_M model into an OpenAI-compatible API using Ollama, LM Studio, or llama.cpp's llama-server — same model, three runtimes, clear...

ollamalm-studiollama-cpplocal-llmopenai-apipython Mar 28, 2026

prerequisites

Ollama installed with llama3.3:8b (Q4_K_M) pulled — see Part 1
Python 3.9+ (`python --version` to check)
openai Python package (`pip install openai`)
For LM Studio sections: a machine with a GUI (not headless)
For llama-server sections: a downloaded .gguf model file

tools used

ollama

last tested

2026-03-28

You have a model running in a terminal. That is not yet useful for building anything. The gap between “model works in a chat prompt” and “my Python script can call it” is exactly what this guide closes.

There are three credible ways to expose a local model as an OpenAI-compatible API: Ollama’s built-in server, LM Studio’s developer mode, and llama-server from llama.cpp. All three implement /v1/chat/completions. All three run llama.cpp under the hood — so raw inference speed is essentially identical. The differences are entirely operational, and they matter a lot once you move beyond a single test script.

My take: Ollama is the right answer for 90% of developers building applications. LM Studio is where you go to explore models before committing to one. llama-server is for production headless deployments or situations where you need a single binary with zero dependencies and surgical GPU control. The rest of this guide walks through all three using the same model — llama3.3:8b-q4_k_m, the Q4_K_M quantized 8B build pulled in Part 1 — so you can see the tradeoff directly.

TL;DR

Building an app? Use Ollama — always-on daemon, model management, request batching, OpenAI-compatible out of the box.
Evaluating models? Use LM Studio — HuggingFace browser, visual VRAM monitor, click-to-serve simplicity.
Deploying headless or need LoRA? Use llama-server — ~90MB binary, no runtime dependencies, --n-gpu-layers precision.

This is Part 2 of a series. If you have not set up Ollama and pulled a model yet, start with How to Run Your First Local LLM.

Prerequisites

Ollama installed with llama3.3:8b-q4_k_m already pulled — see Part 1
Python 3.9 or higher (python --version to check)
openai Python package (pip install openai)
For LM Studio sections: a machine with a GUI (not headless)
For llama-server sections: a downloaded .gguf model file

The Shared API Pattern

Before diving into each runtime, there is one design principle worth front-loading because it shapes everything else in this guide. All three runtimes expose the same /v1/chat/completions endpoint. Your Python code does not care which runtime is running behind the port — it only cares about the URL. That means you can swap runtimes without touching application code, as long as you externalize base_url as an environment variable:

# Set this once per runtime — your app code never needs to change
export LOCAL_LLM_BASE_URL="http://localhost:11434/v1"  # Ollama
# export LOCAL_LLM_BASE_URL="http://localhost:1234/v1"  # LM Studio
# export LOCAL_LLM_BASE_URL="http://localhost:8080/v1"  # llama-server

export OPENAI_API_KEY="ollama"  # value is ignored by all three runtimes, but the SDK requires it

With those two variables set, your client instantiation is always the same:

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ["LOCAL_LLM_BASE_URL"]
    # api_key is read automatically from OPENAI_API_KEY env var
)

None of the three runtimes validate the API key — they accept any non-empty string. But the openai SDK requires the field to be set. Exporting OPENAI_API_KEY=ollama (or any placeholder) lets the client initialize without an explicit parameter, which matches how most production code is written. Use this pattern throughout your project and switching runtimes in development becomes a one-line environment change, not a code change.

Part 1 — Ollama: The API Is Already Running

Most developers who install Ollama do not realize this: the moment you run ollama serve — or install Ollama on macOS or Windows where it starts automatically on login — a full HTTP server is live on port 11434. You do not need to do anything extra. The API endpoint has been waiting for you this whole time.

Verify it:

curl http://localhost:11434/api/tags

If you get back a JSON list of your pulled models, the server is up. If you get a connection refused error, start it manually:

ollama serve

Step 1: Test the Native Ollama API

Ollama exposes two API flavors. The first is its own native format at /api/generate and /api/chat. These work fine for simple scripts but are not compatible with the OpenAI SDK — you would need to handle the request and response shapes yourself.

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3:8b-q4_k_m",
    "messages": [{"role": "user", "content": "What is 2 + 2?"}],
    "stream": false
  }'

The native API is useful to know about, but you will almost never reach for it in application code. The endpoint below is what actually matters.

Step 2: Use the OpenAI-Compatible Endpoint

The endpoint that matters for application development is /v1/chat/completions on the same port. Any code written against the OpenAI API works against Ollama with a single change to base_url. That design decision is what makes Ollama operationally practical for local dev rather than just technically interesting.

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3:8b-q4_k_m",
    "messages": [{"role": "user", "content": "What is 2 + 2?"}]
  }'

Step 3: Call It from Python

import os
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1"
    # OPENAI_API_KEY env var must be set to any non-empty string
)

response = client.chat.completions.create(
    model="llama3.3:8b-q4_k_m",
    messages=[
        {"role": "user", "content": "Explain tail call optimization in one paragraph."}
    ]
)

print(response.choices[0].message.content)

The model string llama3.3:8b-q4_k_m refers to the Q4_K_M quantized 8B build you pulled in Part 1. Ollama uses this tag to identify the model in its local registry — it maps directly to the .gguf file on disk.

Step 4: Keep the Model Warm

By default, Ollama unloads a model from VRAM after 5 minutes of inactivity. During active development this is actively annoying — every request after a pause carries a cold-start delay while the model reloads into GPU memory. The fix is a single environment variable:

OLLAMA_KEEP_ALIVE=-1 ollama serve

Other valid values are 24h, 1h, or 30m. Setting -1 means the model stays in VRAM until you explicitly unload it or restart the server. For a development machine where you want instant responses on every request, this is the right default. Add it to your shell profile so it persists across sessions.

Concurrent request handling is where Ollama’s operational design really shows. Ollama supports request batching natively — multiple async calls against the API can be processed simultaneously rather than queuing serially. To set the concurrency level explicitly:

OLLAMA_NUM_PARALLEL=4 ollama serve

The combination of OLLAMA_KEEP_ALIVE=-1 and OLLAMA_NUM_PARALLEL=4 is what I run during active development. Cold starts disappear, and concurrent requests from test scripts do not pile up waiting for the previous call to complete. Once you have worked this way, going back to an on-demand loading server feels broken.

Part 2 — LM Studio: GUI-First, API Second

LM Studio is a desktop application — Electron-based, point-and-click, with a built-in model browser that connects to HuggingFace. It is well-suited for finding and evaluating models before committing to one. It is not the right tool for serving an API to an application you are actively building, and understanding why requires a closer look at how it handles requests under load.

That said, it does have a developer server mode, and the workflow is straightforward. Many developers use both Ollama and LM Studio — just not for the same job.

Step 1: Load a Model

Open LM Studio and navigate to the “My Models” tab. If you have the same llama3.3-8b-q4_k_m.gguf file from Part 1, it will appear here. Select it and click “Load.”

The GPU layer slider appears during loading. Drag it toward “Max” to offload as many transformer layers as your VRAM allows — this is LM Studio’s equivalent of --n-gpu-layers in llama-server, presented visually. Watch the VRAM monitor to avoid running out. If the model fails to load, reduce the slider until it fits. The visual VRAM monitor is one of LM Studio’s practical strengths: it gives you immediate feedback during model selection that you do not get anywhere else without running manual benchmarks.

Step 2: Start the Server

Switch to the “Developer” tab and toggle “Start Server.” The default port is 1234. You will see a green indicator and a log confirming the server is active. That is the full setup.

Step 3: Same Python Code, Different Port

import os
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1"
    # OPENAI_API_KEY env var must be set to any non-empty string
)

response = client.chat.completions.create(
    model="llama3.3:8b-q4_k_m",
    messages=[
        {"role": "user", "content": "Explain tail call optimization in one paragraph."}
    ]
)

print(response.choices[0].message.content)

The only change from the Ollama code is the port number. That is the entire migration — which is exactly why the environment variable pattern from the intro matters.

What LM Studio Does Not Do Well

LM Studio’s server is single-threaded for concurrent requests (single-threaded as of LM Studio 0.3.x, tested 2026-03-28). If two processes hit the API simultaneously, one waits. The model also loads on demand rather than being pre-warmed — which means the first request after startup carries a cold-start penalty while the model initializes. For a server you are firing requests at continuously during development, that behavior becomes a recurring friction point.

There is also the question of source code: LM Studio is closed source. That is not a dealbreaker for most use cases, but it is worth knowing if you are building something where supply chain visibility matters or if you ever need to debug unexpected server behavior.

Where LM Studio earns its place is the model discovery phase. The HuggingFace browser is well-implemented, the ability to visually compare model sizes against your available VRAM makes the selection process faster than pulling models blindly, and the chat interface lets you run quick qualitative evaluations before committing to a model. For exploring a dozen quantization options in an afternoon, it is the best tool available. The point is not to avoid LM Studio — it is to not mistake “good for exploring models” for “good for serving an API.”

:::callout type=tip A workflow many developers land on: spend an hour in LM Studio narrowing down model candidates, then ollama pull the winner for actual development. You get LM Studio’s model-browsing strengths without inheriting its concurrency limitations during the build phase. :::

Part 3 — llama-server: One Binary, Full Control

llama-server is the HTTP server that ships directly with llama.cpp. It is a single statically-linked binary — no Electron runtime, no model registry, no background daemon. You point it at a .gguf file and it starts serving. The binary comes in around 90MB. That contrast with a full Electron desktop app or Ollama’s bundled tooling and runtime is not just aesthetic — it determines where you can actually deploy the thing. If you have ever tried to install a desktop Electron app inside a Docker container, you already know why having a 90MB statically-linked binary matters.

This is the tool for headless deployments, Docker containers, CI environments, and any situation where you need precise control over GPU layer allocation or want zero operational overhead beyond the process itself.

Step 1: Get the Binary

The fastest path on most systems:

# macOS (Homebrew)
brew install llama.cpp

# Or pull the pre-built Docker image
docker pull ghcr.io/ggml-org/llama.cpp:server

For GPU-specific builds — CUDA, Vulkan, Intel — the llama.cpp GitHub releases page provides pre-built binaries tagged server-cuda, server-vulkan, and so on. Check the llama.cpp releases for the current build matching your hardware. The naming convention is consistent across releases.

Step 2: Start the Server

llama-server \
  -m ~/models/llama3.3-8b-q4_k_m.gguf \
  --port 8080

The server starts, loads the model, and begins accepting requests. That is the entire setup — no configuration files, no model registry, no pull commands.

For GPU offloading, add --n-gpu-layers:

llama-server \
  -m ~/models/llama3.3-8b-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 33

The -ngl shorthand does the same thing. Setting -ngl -1 offloads all layers automatically — this is what you want if your VRAM can hold the full model. For a Q4_K_M quantization of an 8B model you need roughly 5–6GB of VRAM to fit everything. Layers not offloaded to GPU run on CPU, which is slower but functional. If the server fails to start with out-of-memory errors, reduce --n-gpu-layers incrementally until it loads.

For multi-GPU setups, --tensor-split controls the allocation ratio across devices:

llama-server \
  -m ~/models/llama3.3-8b-q4_k_m.gguf \
  --port 8080 \
  --tensor-split 3,1

The 3,1 splits layers in a 3

ratio across two GPUs. This level of control is not available in either Ollama or LM Studio at this level of directness. Ollama exposes GPU layer counts via Modelfile or API options; it does not give you per-device tensor splitting. LM Studio gives you a slider. llama-server gives you arithmetic.

Step 3: Same Python Code, Different Port

import os
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1"
    # OPENAI_API_KEY env var must be set to any non-empty string
)

response = client.chat.completions.create(
    model="llama3.3:8b-q4_k_m",
    messages=[
        {"role": "user", "content": "Explain tail call optimization in one paragraph."}
    ]
)

print(response.choices[0].message.content)

Port changes. Code does not. This is the pattern across all three runtimes — and if you externalized LOCAL_LLM_BASE_URL as shown at the top of this guide, not even the port change touches your application code.

Router Mode for Multiple Models

Recent versions of llama-server add a router mode — start the server without -m and point it at a directory of models. They load on demand and evict via LRU when memory fills:

llama-server --models-dir ./models --port 8080

Models in ~/.cache/llama.cpp are also auto-discovered by default. This makes llama-server viable as a lightweight multi-model serving layer without any of Ollama’s bundled tooling — useful if you want model-switching without pulling in a full daemon or managing a model registry.

LoRA Support

If you are working with fine-tuned LoRA adapters, llama-server is currently your only option among the three. Pass the adapter at startup:

llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf --lora adapter.gguf --port 8080

LoRA support in llama-server is experimental as of early 2026. Ollama and LM Studio do not support LoRA adapters in their current versions. If you are fine-tuning models and need to serve them locally, this is where llama-server stops being an optional alternative and becomes the only path.

Head-to-Head Comparison

	Ollama	LM Studio	llama-server
Installation	CLI tool	Desktop app (Electron)	Single binary (~90MB)
Time to API live	Seconds (after pull)	2–3 minutes (GUI load)	Under 30 seconds
Concurrent requests	Batching, configurable parallel	Single-threaded (v0.3.x)	Stateless, efficient
Model management	`ollama pull`, Modelfile, registry	GUI browser, HuggingFace	Manual or Router Mode
Runs as daemon	Yes (`ollama serve`)	No (GUI or CLI process)	No (single process)
GPU layer control	Via Modelfile or API options	GUI slider	`--n-gpu-layers`, `--tensor-split`
LoRA support	No	No	Experimental
Docker support	Official image	No official image	Official image, GPU variants
OpenAI compat	`/v1/chat/completions`	`/v1/chat/completions`	`/v1/chat/completions`
Source	Open source	Closed source	Open source
Disk footprint	Bundled tooling + runtime	Full Electron app	~90MB binary

The concurrency row deserves more than a table cell. Ollama handles batching natively and can process multiple requests simultaneously — OLLAMA_NUM_PARALLEL lets you tune exactly how many. LM Studio queues concurrent requests serially as of v0.3.x: the second request waits for the first to complete. llama-server sits between them — stateless and efficient for the requests it handles, but without Ollama’s model-warming between requests unless you explicitly manage process lifecycle. For a development server fielding requests from test scripts running in parallel, this difference shows up immediately and keeps showing up.

The disk footprint row also deserves a caveat: the installed size of Ollama depends on your OS and the models you pull. The directional contrast with llama-server’s ~90MB binary holds regardless of the exact numbers — you are comparing a single binary against a full runtime with bundled tooling.

The Verdict

The three tools serve three distinct use cases, and most developers end up touching all three at different points in a project. The shared API surface means your application code stays the same regardless of which runtime is behind the port — that is a feature worth taking seriously, and the environment variable pattern makes it trivially easy to act on.

Use Ollama if you are building an application. It runs as a background daemon, keeps models warm between requests with OLLAMA_KEEP_ALIVE=-1, handles concurrent traffic with batching, and its model registry means you can ollama pull any community model and have it available via API in under a minute. For a local development workflow, it is operationally superior to the alternatives. The always-on nature means you never wait for a cold start in the middle of a test run, and the overhead of running it is minimal once it is set up.

Use LM Studio if you are evaluating models. When you need to compare a dozen quantizations across different model families before committing to one, LM Studio’s HuggingFace browser and visual VRAM monitor make the selection process faster than pulling models blindly into Ollama and hoping they fit. The practical pattern I keep seeing: spend an hour in LM Studio narrowing down candidates, then pull the winner into Ollama for actual development. You get the best of both without getting stuck with LM Studio’s concurrency limitations during the build phase.

Use llama-server if you are deploying headless or need fine-grained control. A ~90MB binary with no runtime dependencies is the right answer for a Docker container, a CI environment, a remote server you are SSHing into, or any context where you want zero operational overhead. If you are working with LoRA adapters, it is currently your only option among the three. The --n-gpu-layers and --tensor-split flags give you layer-level GPU allocation that Ollama’s higher-level abstractions do not expose.

Key Configuration Reference

# Ollama — keep model permanently in VRAM
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Ollama — parallel request handling
OLLAMA_NUM_PARALLEL=4 ollama serve

# Set your runtime via environment variable (swap without touching app code)
export LOCAL_LLM_BASE_URL="http://localhost:11434/v1"   # Ollama
# export LOCAL_LLM_BASE_URL="http://localhost:1234/v1"  # LM Studio
# export LOCAL_LLM_BASE_URL="http://localhost:8080/v1"  # llama-server
export OPENAI_API_KEY="ollama"  # required by SDK, not validated by any of the three

# llama-server — GPU offloading (all layers)
llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf -ngl -1 --port 8080

# llama-server — multi-GPU split (3:1 ratio)
llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf --tensor-split 3,1 --port 8080

# llama-server — router mode (multi-model auto-discovery)
llama-server --models-dir ./models --port 8080

# llama-server — with LoRA adapter
llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf --lora adapter.gguf --port 8080

Troubleshooting

`Connection refused` on port 11434

Cause: Ollama server is not running.

Fix:

ollama serve
# or check if it's already running as a background process
ps aux | grep ollama

Model unloads between requests (slow cold starts)

Cause: Default OLLAMA_KEEP_ALIVE is 5 minutes.

Fix:

export OLLAMA_KEEP_ALIVE=-1
ollama serve

Add the export to your shell profile (~/.zshrc or ~/.bashrc) so it persists across terminal sessions.

`llama-server: model is too large for available VRAM`

Cause: Model does not fit fully in GPU memory with the current -ngl setting.

Fix:

# Reduce GPU layers until it fits — start low and increase
llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf --n-gpu-layers 20 --port 8080

Layers not offloaded to GPU fall back to CPU — slower, but the model will run. Reduce --n-gpu-layers incrementally until you find a value that loads without errors. For a Q4_K_M 8B model on a GPU with 6GB VRAM, 33 layers is typically the ceiling before overflow.

LM Studio server returns 503 on first request

Cause: Model is still loading — LM Studio loads on demand, not pre-warmed.

Fix: Wait for the loading indicator in the Developer tab to complete before sending the first request. For application code that might hit the server immediately after startup, add a retry with exponential backoff on the initial connection attempt. This is a structural limitation of LM Studio’s on-demand loading, not a configuration issue you can tune away.

`openai.AuthenticationError` when connecting to local runtime

Cause: The openai SDK requires OPENAI_API_KEY to be set — it raises an error if the variable is missing or empty, even though the local runtimes never validate the value.

Fix:

export OPENAI_API_KEY="ollama"

Any non-empty string works. "ollama" is the conventional placeholder you will see across examples online. Set it in your shell profile so it is always available.

Next Steps

Once your local API is running, the natural next direction is understanding where local models fit in a multi-model architecture. Local APIs eliminate per-token costs entirely for high-volume workloads — the Agent Pipeline Cost Optimization guide walks through how to think about that tradeoff in practice. For broader context on why running models locally matters beyond cost, Open Source vs Proprietary AI Models covers the tradeoffs that go beyond what any API surface can show you.

By dennis · Mar 28, 2026 ← all guides

Local LLM API — Ollama beats others for dev workflows

Prerequisites

The Shared API Pattern

Part 1 — Ollama: The API Is Already Running

Step 1: Test the Native Ollama API

Step 2: Use the OpenAI-Compatible Endpoint

Step 3: Call It from Python

Step 4: Keep the Model Warm

Part 2 — LM Studio: GUI-First, API Second

Step 1: Load a Model

Step 2: Start the Server

Step 3: Same Python Code, Different Port

What LM Studio Does Not Do Well

Part 3 — llama-server: One Binary, Full Control

Step 1: Get the Binary

Step 2: Start the Server

Step 3: Same Python Code, Different Port

Router Mode for Multiple Models

LoRA Support

Head-to-Head Comparison

The Verdict

Key Configuration Reference

Troubleshooting

Connection refused on port 11434

Model unloads between requests (slow cold starts)

llama-server: model is too large for available VRAM

LM Studio server returns 503 on first request

openai.AuthenticationError when connecting to local runtime

Next Steps

`Connection refused` on port 11434

`llama-server: model is too large for available VRAM`

`openai.AuthenticationError` when connecting to local runtime