Local LLM API — Ollama beats others for dev workflows
Turn a local Llama 3.3 8B Q4_K_M model into an OpenAI-compatible API using Ollama, LM Studio, or llama.cpp's llama-server — same model, three runtimes, clear...
- Ollama installed with llama3.3:8b (Q4_K_M) pulled — see Part 1
- Python 3.9+ (`python --version` to check)
- openai Python package (`pip install openai`)
- For LM Studio sections: a machine with a GUI (not headless)
- For llama-server sections: a downloaded .gguf model file
You have a model running in a terminal. That is not yet useful for building anything. The gap between “model works in a chat prompt” and “my Python script can call it” is exactly what this guide closes.
There are three credible ways to expose a local model as an OpenAI-compatible API: Ollama’s built-in server, LM Studio’s developer mode, and llama-server from llama.cpp. All three implement /v1/chat/completions. All three run llama.cpp under the hood — so raw inference speed is essentially identical. The differences are entirely operational, and they matter a lot once you move beyond a single test script.
My take: Ollama is the right answer for 90% of developers building applications. LM Studio is where you go to explore models before committing to one. llama-server is for production headless deployments or situations where you need a single binary with zero dependencies and surgical GPU control. The rest of this guide walks through all three using the same model — llama3.3:8b-q4_k_m, the Q4_K_M quantized 8B build pulled in Part 1 — so you can see the tradeoff directly.
TL;DR
- Building an app? Use Ollama — always-on daemon, model management, request batching, OpenAI-compatible out of the box.
- Evaluating models? Use LM Studio — HuggingFace browser, visual VRAM monitor, click-to-serve simplicity.
- Deploying headless or need LoRA? Use llama-server — ~90MB binary, no runtime dependencies,
--n-gpu-layersprecision.
This is Part 2 of a series. If you have not set up Ollama and pulled a model yet, start with How to Run Your First Local LLM.
Prerequisites
- Ollama installed with
llama3.3:8b-q4_k_malready pulled — see Part 1 - Python 3.9 or higher (
python --versionto check) openaiPython package (pip install openai)- For LM Studio sections: a machine with a GUI (not headless)
- For llama-server sections: a downloaded
.ggufmodel file
The Shared API Pattern
Before diving into each runtime, there is one design principle worth front-loading because it shapes everything else in this guide. All three runtimes expose the same /v1/chat/completions endpoint. Your Python code does not care which runtime is running behind the port — it only cares about the URL. That means you can swap runtimes without touching application code, as long as you externalize base_url as an environment variable:
# Set this once per runtime — your app code never needs to change
export LOCAL_LLM_BASE_URL="http://localhost:11434/v1" # Ollama
# export LOCAL_LLM_BASE_URL="http://localhost:1234/v1" # LM Studio
# export LOCAL_LLM_BASE_URL="http://localhost:8080/v1" # llama-server
export OPENAI_API_KEY="ollama" # value is ignored by all three runtimes, but the SDK requires it
With those two variables set, your client instantiation is always the same:
import os
from openai import OpenAI
client = OpenAI(
base_url=os.environ["LOCAL_LLM_BASE_URL"]
# api_key is read automatically from OPENAI_API_KEY env var
)
None of the three runtimes validate the API key — they accept any non-empty string. But the openai SDK requires the field to be set. Exporting OPENAI_API_KEY=ollama (or any placeholder) lets the client initialize without an explicit parameter, which matches how most production code is written. Use this pattern throughout your project and switching runtimes in development becomes a one-line environment change, not a code change.
Part 1 — Ollama: The API Is Already Running
Most developers who install Ollama do not realize this: the moment you run ollama serve — or install Ollama on macOS or Windows where it starts automatically on login — a full HTTP server is live on port 11434. You do not need to do anything extra. The API endpoint has been waiting for you this whole time.
Verify it:
curl http://localhost:11434/api/tags
If you get back a JSON list of your pulled models, the server is up. If you get a connection refused error, start it manually:
ollama serve
Step 1: Test the Native Ollama API
Ollama exposes two API flavors. The first is its own native format at /api/generate and /api/chat. These work fine for simple scripts but are not compatible with the OpenAI SDK — you would need to handle the request and response shapes yourself.
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3:8b-q4_k_m",
"messages": [{"role": "user", "content": "What is 2 + 2?"}],
"stream": false
}'
The native API is useful to know about, but you will almost never reach for it in application code. The endpoint below is what actually matters.
Step 2: Use the OpenAI-Compatible Endpoint
The endpoint that matters for application development is /v1/chat/completions on the same port. Any code written against the OpenAI API works against Ollama with a single change to base_url. That design decision is what makes Ollama operationally practical for local dev rather than just technically interesting.
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3:8b-q4_k_m",
"messages": [{"role": "user", "content": "What is 2 + 2?"}]
}'
Step 3: Call It from Python
import os
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1"
# OPENAI_API_KEY env var must be set to any non-empty string
)
response = client.chat.completions.create(
model="llama3.3:8b-q4_k_m",
messages=[
{"role": "user", "content": "Explain tail call optimization in one paragraph."}
]
)
print(response.choices[0].message.content)
The model string llama3.3:8b-q4_k_m refers to the Q4_K_M quantized 8B build you pulled in Part 1. Ollama uses this tag to identify the model in its local registry — it maps directly to the .gguf file on disk.
Step 4: Keep the Model Warm
By default, Ollama unloads a model from VRAM after 5 minutes of inactivity. During active development this is actively annoying — every request after a pause carries a cold-start delay while the model reloads into GPU memory. The fix is a single environment variable:
OLLAMA_KEEP_ALIVE=-1 ollama serve
Other valid values are 24h, 1h, or 30m. Setting -1 means the model stays in VRAM until you explicitly unload it or restart the server. For a development machine where you want instant responses on every request, this is the right default. Add it to your shell profile so it persists across sessions.
Concurrent request handling is where Ollama’s operational design really shows. Ollama supports request batching natively — multiple async calls against the API can be processed simultaneously rather than queuing serially. To set the concurrency level explicitly:
OLLAMA_NUM_PARALLEL=4 ollama serve
The combination of OLLAMA_KEEP_ALIVE=-1 and OLLAMA_NUM_PARALLEL=4 is what I run during active development. Cold starts disappear, and concurrent requests from test scripts do not pile up waiting for the previous call to complete. Once you have worked this way, going back to an on-demand loading server feels broken.
Part 2 — LM Studio: GUI-First, API Second
LM Studio is a desktop application — Electron-based, point-and-click, with a built-in model browser that connects to HuggingFace. It is well-suited for finding and evaluating models before committing to one. It is not the right tool for serving an API to an application you are actively building, and understanding why requires a closer look at how it handles requests under load.
That said, it does have a developer server mode, and the workflow is straightforward. Many developers use both Ollama and LM Studio — just not for the same job.
Step 1: Load a Model
Open LM Studio and navigate to the “My Models” tab. If you have the same llama3.3-8b-q4_k_m.gguf file from Part 1, it will appear here. Select it and click “Load.”
The GPU layer slider appears during loading. Drag it toward “Max” to offload as many transformer layers as your VRAM allows — this is LM Studio’s equivalent of --n-gpu-layers in llama-server, presented visually. Watch the VRAM monitor to avoid running out. If the model fails to load, reduce the slider until it fits. The visual VRAM monitor is one of LM Studio’s practical strengths: it gives you immediate feedback during model selection that you do not get anywhere else without running manual benchmarks.
Step 2: Start the Server
Switch to the “Developer” tab and toggle “Start Server.” The default port is 1234. You will see a green indicator and a log confirming the server is active. That is the full setup.
Step 3: Same Python Code, Different Port
import os
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1"
# OPENAI_API_KEY env var must be set to any non-empty string
)
response = client.chat.completions.create(
model="llama3.3:8b-q4_k_m",
messages=[
{"role": "user", "content": "Explain tail call optimization in one paragraph."}
]
)
print(response.choices[0].message.content)
The only change from the Ollama code is the port number. That is the entire migration — which is exactly why the environment variable pattern from the intro matters.
What LM Studio Does Not Do Well
LM Studio’s server is single-threaded for concurrent requests (single-threaded as of LM Studio 0.3.x, tested 2026-03-28). If two processes hit the API simultaneously, one waits. The model also loads on demand rather than being pre-warmed — which means the first request after startup carries a cold-start penalty while the model initializes. For a server you are firing requests at continuously during development, that behavior becomes a recurring friction point.
There is also the question of source code: LM Studio is closed source. That is not a dealbreaker for most use cases, but it is worth knowing if you are building something where supply chain visibility matters or if you ever need to debug unexpected server behavior.
Where LM Studio earns its place is the model discovery phase. The HuggingFace browser is well-implemented, the ability to visually compare model sizes against your available VRAM makes the selection process faster than pulling models blindly, and the chat interface lets you run quick qualitative evaluations before committing to a model. For exploring a dozen quantization options in an afternoon, it is the best tool available. The point is not to avoid LM Studio — it is to not mistake “good for exploring models” for “good for serving an API.”
:::callout type=tip
A workflow many developers land on: spend an hour in LM Studio narrowing down model candidates, then ollama pull the winner for actual development. You get LM Studio’s model-browsing strengths without inheriting its concurrency limitations during the build phase.
:::
Part 3 — llama-server: One Binary, Full Control
llama-server is the HTTP server that ships directly with llama.cpp. It is a single statically-linked binary — no Electron runtime, no model registry, no background daemon. You point it at a .gguf file and it starts serving. The binary comes in around 90MB. That contrast with a full Electron desktop app or Ollama’s bundled tooling and runtime is not just aesthetic — it determines where you can actually deploy the thing. If you have ever tried to install a desktop Electron app inside a Docker container, you already know why having a 90MB statically-linked binary matters.
This is the tool for headless deployments, Docker containers, CI environments, and any situation where you need precise control over GPU layer allocation or want zero operational overhead beyond the process itself.
Step 1: Get the Binary
The fastest path on most systems:
# macOS (Homebrew)
brew install llama.cpp
# Or pull the pre-built Docker image
docker pull ghcr.io/ggml-org/llama.cpp:server
For GPU-specific builds — CUDA, Vulkan, Intel — the llama.cpp GitHub releases page provides pre-built binaries tagged server-cuda, server-vulkan, and so on. Check the llama.cpp releases for the current build matching your hardware. The naming convention is consistent across releases.
Step 2: Start the Server
llama-server \
-m ~/models/llama3.3-8b-q4_k_m.gguf \
--port 8080
The server starts, loads the model, and begins accepting requests. That is the entire setup — no configuration files, no model registry, no pull commands.
For GPU offloading, add --n-gpu-layers:
llama-server \
-m ~/models/llama3.3-8b-q4_k_m.gguf \
--port 8080 \
--n-gpu-layers 33
The -ngl shorthand does the same thing. Setting -ngl -1 offloads all layers automatically — this is what you want if your VRAM can hold the full model. For a Q4_K_M quantization of an 8B model you need roughly 5–6GB of VRAM to fit everything. Layers not offloaded to GPU run on CPU, which is slower but functional. If the server fails to start with out-of-memory errors, reduce --n-gpu-layers incrementally until it loads.
For multi-GPU setups, --tensor-split controls the allocation ratio across devices:
llama-server \
-m ~/models/llama3.3-8b-q4_k_m.gguf \
--port 8080 \
--tensor-split 3,1
The 3,1 splits layers in a 3
Step 3: Same Python Code, Different Port
import os
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1"
# OPENAI_API_KEY env var must be set to any non-empty string
)
response = client.chat.completions.create(
model="llama3.3:8b-q4_k_m",
messages=[
{"role": "user", "content": "Explain tail call optimization in one paragraph."}
]
)
print(response.choices[0].message.content)
Port changes. Code does not. This is the pattern across all three runtimes — and if you externalized LOCAL_LLM_BASE_URL as shown at the top of this guide, not even the port change touches your application code.
Router Mode for Multiple Models
Recent versions of llama-server add a router mode — start the server without -m and point it at a directory of models. They load on demand and evict via LRU when memory fills:
llama-server --models-dir ./models --port 8080
Models in ~/.cache/llama.cpp are also auto-discovered by default. This makes llama-server viable as a lightweight multi-model serving layer without any of Ollama’s bundled tooling — useful if you want model-switching without pulling in a full daemon or managing a model registry.
LoRA Support
If you are working with fine-tuned LoRA adapters, llama-server is currently your only option among the three. Pass the adapter at startup:
llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf --lora adapter.gguf --port 8080
LoRA support in llama-server is experimental as of early 2026. Ollama and LM Studio do not support LoRA adapters in their current versions. If you are fine-tuning models and need to serve them locally, this is where llama-server stops being an optional alternative and becomes the only path.
Head-to-Head Comparison
| Ollama | LM Studio | llama-server | |
|---|---|---|---|
| Installation | CLI tool | Desktop app (Electron) | Single binary (~90MB) |
| Time to API live | Seconds (after pull) | 2–3 minutes (GUI load) | Under 30 seconds |
| Concurrent requests | Batching, configurable parallel | Single-threaded (v0.3.x) | Stateless, efficient |
| Model management | ollama pull, Modelfile, registry | GUI browser, HuggingFace | Manual or Router Mode |
| Runs as daemon | Yes (ollama serve) | No (GUI or CLI process) | No (single process) |
| GPU layer control | Via Modelfile or API options | GUI slider | --n-gpu-layers, --tensor-split |
| LoRA support | No | No | Experimental |
| Docker support | Official image | No official image | Official image, GPU variants |
| OpenAI compat | /v1/chat/completions | /v1/chat/completions | /v1/chat/completions |
| Source | Open source | Closed source | Open source |
| Disk footprint | Bundled tooling + runtime | Full Electron app | ~90MB binary |
The concurrency row deserves more than a table cell. Ollama handles batching natively and can process multiple requests simultaneously — OLLAMA_NUM_PARALLEL lets you tune exactly how many. LM Studio queues concurrent requests serially as of v0.3.x: the second request waits for the first to complete. llama-server sits between them — stateless and efficient for the requests it handles, but without Ollama’s model-warming between requests unless you explicitly manage process lifecycle. For a development server fielding requests from test scripts running in parallel, this difference shows up immediately and keeps showing up.
The disk footprint row also deserves a caveat: the installed size of Ollama depends on your OS and the models you pull. The directional contrast with llama-server’s ~90MB binary holds regardless of the exact numbers — you are comparing a single binary against a full runtime with bundled tooling.
The Verdict
The three tools serve three distinct use cases, and most developers end up touching all three at different points in a project. The shared API surface means your application code stays the same regardless of which runtime is behind the port — that is a feature worth taking seriously, and the environment variable pattern makes it trivially easy to act on.
Use Ollama if you are building an application. It runs as a background daemon, keeps models warm between requests with OLLAMA_KEEP_ALIVE=-1, handles concurrent traffic with batching, and its model registry means you can ollama pull any community model and have it available via API in under a minute. For a local development workflow, it is operationally superior to the alternatives. The always-on nature means you never wait for a cold start in the middle of a test run, and the overhead of running it is minimal once it is set up.
Use LM Studio if you are evaluating models. When you need to compare a dozen quantizations across different model families before committing to one, LM Studio’s HuggingFace browser and visual VRAM monitor make the selection process faster than pulling models blindly into Ollama and hoping they fit. The practical pattern I keep seeing: spend an hour in LM Studio narrowing down candidates, then pull the winner into Ollama for actual development. You get the best of both without getting stuck with LM Studio’s concurrency limitations during the build phase.
Use llama-server if you are deploying headless or need fine-grained control. A ~90MB binary with no runtime dependencies is the right answer for a Docker container, a CI environment, a remote server you are SSHing into, or any context where you want zero operational overhead. If you are working with LoRA adapters, it is currently your only option among the three. The --n-gpu-layers and --tensor-split flags give you layer-level GPU allocation that Ollama’s higher-level abstractions do not expose.
Key Configuration Reference
# Ollama — keep model permanently in VRAM
OLLAMA_KEEP_ALIVE=-1 ollama serve
# Ollama — parallel request handling
OLLAMA_NUM_PARALLEL=4 ollama serve
# Set your runtime via environment variable (swap without touching app code)
export LOCAL_LLM_BASE_URL="http://localhost:11434/v1" # Ollama
# export LOCAL_LLM_BASE_URL="http://localhost:1234/v1" # LM Studio
# export LOCAL_LLM_BASE_URL="http://localhost:8080/v1" # llama-server
export OPENAI_API_KEY="ollama" # required by SDK, not validated by any of the three
# llama-server — GPU offloading (all layers)
llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf -ngl -1 --port 8080
# llama-server — multi-GPU split (3:1 ratio)
llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf --tensor-split 3,1 --port 8080
# llama-server — router mode (multi-model auto-discovery)
llama-server --models-dir ./models --port 8080
# llama-server — with LoRA adapter
llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf --lora adapter.gguf --port 8080
Troubleshooting
Connection refused on port 11434
Cause: Ollama server is not running.
Fix:
ollama serve
# or check if it's already running as a background process
ps aux | grep ollama
Model unloads between requests (slow cold starts)
Cause: Default OLLAMA_KEEP_ALIVE is 5 minutes.
Fix:
export OLLAMA_KEEP_ALIVE=-1
ollama serve
Add the export to your shell profile (~/.zshrc or ~/.bashrc) so it persists across terminal sessions.
llama-server: model is too large for available VRAM
Cause: Model does not fit fully in GPU memory with the current -ngl setting.
Fix:
# Reduce GPU layers until it fits — start low and increase
llama-server -m ~/models/llama3.3-8b-q4_k_m.gguf --n-gpu-layers 20 --port 8080
Layers not offloaded to GPU fall back to CPU — slower, but the model will run. Reduce --n-gpu-layers incrementally until you find a value that loads without errors. For a Q4_K_M 8B model on a GPU with 6GB VRAM, 33 layers is typically the ceiling before overflow.
LM Studio server returns 503 on first request
Cause: Model is still loading — LM Studio loads on demand, not pre-warmed.
Fix: Wait for the loading indicator in the Developer tab to complete before sending the first request. For application code that might hit the server immediately after startup, add a retry with exponential backoff on the initial connection attempt. This is a structural limitation of LM Studio’s on-demand loading, not a configuration issue you can tune away.
openai.AuthenticationError when connecting to local runtime
Cause: The openai SDK requires OPENAI_API_KEY to be set — it raises an error if the variable is missing or empty, even though the local runtimes never validate the value.
Fix:
export OPENAI_API_KEY="ollama"
Any non-empty string works. "ollama" is the conventional placeholder you will see across examples online. Set it in your shell profile so it is always available.
Next Steps
Once your local API is running, the natural next direction is understanding where local models fit in a multi-model architecture. Local APIs eliminate per-token costs entirely for high-volume workloads — the Agent Pipeline Cost Optimization guide walks through how to think about that tradeoff in practice. For broader context on why running models locally matters beyond cost, Open Source vs Proprietary AI Models covers the tradeoffs that go beyond what any API surface can show you.