Lessons from the trenches: why llama.cpp works best (today)

19 Sep Lessons from the trenches: why llama.cpp works best (today)

Posted at 16:47h in AI by Antonio Poggi 0 Comments

Why llama.cpp beats vLLM for running gpt-oss models locally

We’ve spent the past few months knee-deep in the messy reality of adapting our application to run on local LLMs. On paper it should have been simple: swap the API endpoint, keep everything the same, enjoy privacy and control on your own hardware. In practice it turned into weeks of trial, error and head scratching. Production-grade engines like vLLM kept tripping over themselves, while the supposedly humble llama.cpp just worked with the new GPT-OSS models. For now, and this might change tomorrow, llama.cpp is ahead.

So much for “openai compatible”

Omniscope was built against OpenAI’s APIs. Our AI features include:

Report Ninja for dashboards and data questions
Workflow Ninja for explaining ETL workflows
Data Q&A for natural language queries across multiple tables

We relied heavily on OpenAI’s APIs and their features:

Chat Completions API and the newer Responses API
Structured outputs in two flavours: strict JSON schemas and the older loose “JSON object mode”
Streaming tokens for responsive UX
Multi-role conversations (system, user, assistant, sometimes developer)
Reasoning outputs and reasoning effort controls
Tool calls (function calling) with strict JSON schema arguments
Configurable temperature, verbosity

So in theory, all we had to do was point from https://api.openai.com to http://localhost:8000. That’s what the ecosystem promises.

Of course, it wasn’t that tidy. Different engines interpret the API differently. Some features were missing, others half-working. Some calls crashed outright. We ended up adding a whole layer of configuration hints in Omniscope just to cope.

And why? Because the stack isn’t just the model. There’s the model, with its templates and tokeniser quirks, and there’s the engine, with its parser and output rules. If those don’t line up, you get chaos.

Why things fall apart

A typical request has three steps:

Format the prompt. The engine applies the model’s chat template (often a Jinja2 template). If it’s wrong, key messages vanish. We saw templates that dropped system messages unless they were the very first message.
Model generates tokens. Ideally respecting markers.
Engine parses the output. For reasoning or tool use, the model emits special markers (<|call|>, <|return|>, <|end|>). The engine must split these correctly and return proper JSON.

Each step can break in fun ways. One example: an EOS token like <|end|> might map to ID 20001 in the model’s tokenizer. If the engine doesn’t recognise that, you either get endless babble or premature cut-offs. Another: “banana” might split into two tokens (“ban” and “ana”), which means tokenisation tables must match or you’ll misparse output.

Different model families handle this differently. DeepSeek used literal <think> tags in its completions. Older LLaMA models stuck closer to ChatML. GPT-OSS introduced something entirely new.

Harmony: great idea, industry not ready

OpenAI’s GPT-OSS models (gpt-oss-20b, gpt-oss-120b) use Harmony, a formal token format for conversations. Every role, channel, and boundary is explicitly tagged. The model can write private thoughts in an analysis channel, prepare an action or tool call in commentary, and finish with a final user-facing message. Control tokens like <|call|> and <|return|> tell the engine when to hand over to a tool or end the turn.

It’s neat, and designed for the Responses API. Much better than ad-hoc <think> tags. But only if your engine supports it.

And that’s the rub: when we tested GPT-OSS, the models were fine, but engines weren’t quite there yet.

We ran gpt-oss-20b both on vLLM and on llama.cpp. The results could not have been more different.

Our misadventures with vllm

vLLM is designed for throughput on server hardware. It has clever scheduling and batching, and in theory should be the right tool for production. In practice with GPT-OSS it drove us up the wall.

Function calls: often ignored, or cut short. Sometimes replaced with a hallucinated answer. Multi-step tool use almost never worked. The vLLM docs themselves warn that function calling is still a work in progress.
Structured outputs: schema enforcement was patchy. Sometimes extra text slipped into what should have been pure JSON.
Harmony parsing: fragile. Complex prompts would freeze mid-way. The model would start reasoning, maybe begin a tool call, and then vLLM would stop cold. We saw logs where the stop reason was token 20012, which is actually <|end|>.

We asked questions like “How have seasonal trends in online sales shifted over the past five years, and what impact did promotions have on basket size?”

The assistant immediately returned a confident narrative answer, with trends and numbers. Looked great. Except no tool call had been made. It hadn’t touched the dataset. When challenged, it even invented an “execution plan” in JSON to justify itself. Only on the third attempt did it run the actual query.

That’s not just hallucination. That’s orchestration failing because the engine mis-handled Harmony.

Configuration didn’t help much. You might need flags like:

For DeepSeek you’d swap the parser to deepseek_r1. For Qwen, something else again. Documentation lags reality, so half the time we were trawling GitHub issues.

And we couldn’t debug properly. We wanted to see raw input text after the Jinja template, and raw output tokens before parsing markers. vLLM doesn’t expose this. Without that visibility, you’re guessing.

It’s no wonder someone on Reddit summed up vLLM as “error whack-a-mole – just run llama.cpp”.

Llama.cpp just worked

Then we tried llama.cpp. Same model, same machine. Using its llama-server CLI which mimics the OpenAI API.

And it just worked. Multi-step function calls, structured outputs, reasoning traces. No premature stops, no phantom answers.

It even felt faster, probably thanks to quantisation and the simplicity of the codepath.

We tested on a MacBook Pro M2 and on a Google Cloud g2-standard-4 VM with an L4 GPU. On both, llama.cpp handled GPT-OSS 20B without complaint.

Other models were mixed. DeepSeek and Qwen worked for basic chat but their special markers leaked through. LLaMA 3.1 was mostly fine, though its template occasionally ignored system messages. But GPT-OSS was the one we cared about most, and here llama.cpp beat vLLM hands down.

llama.cpp isn’t perfect. It doesn’t scale like vLLM. But for an interactive app like ours, reliability mattered more than maximum throughput. And reliability is what we got.

Things we learnt the hard way

Don’t assume “OpenAI-compatible” means fully working. Engines differ.
Without raw input/output logs you’re flying blind. It’s impossible to tell whether the fault lies in the prompt, the model’s own quirks, the tokeniser, the chat template, the engine’s parsing, or even our app.
Models can and will cheat if the engine lets them. We saw one skip tools entirely and fabricate both results and a fake plan.
Community threads often beat official docs. A GitHub issue often had the answer long before the README. You might even want to dive into the engine source code – sometimes that was the only way to confirm how a flag actually behaved.
Everything changes weekly. Engines are scrambling to catch up with models. What broke in July may be fixed by September. But then something else might be broken.

Cutting to the chase

For a real-world complex application, the simplest thing that works is usually the right choice. Right now that’s llama.cpp.

vLLM and others will improve, no doubt. But if you want to run GPT-OSS locally today without weeks of grief, start with llama.cpp.

Versions we tested

vLLM: via Docker on Linux using the v0.10.2 tag
llama.cpp: via Docker on Linux using the server-cuda-b6485 tag, and directly on macOS via Homebrew (same version)
What’s next

We’re continuing to expand Omniscope’s AI features – Report Ninja, Workflow Ninja, and Data Q&A – with local model support. If you’d like to try them, drop us a line at support@visokio.com. We’re keen to get feedback from people building and testing in the same trenches.

And if you want to dig deeper, here’s where we’d suggest starting:

GPT-OSS 20B on Hugging Face: the model card and config files we fought with
OpenAI Harmony format: the spec for the token stream that engines are still catching up with
OpenAI cookbook article on Harmony: some practical examples
Our guide to running GPT-OSS models locally in Omniscope: lessons we wrote up while still nursing the bruises

Print page

19 Sep Lessons from the trenches: why llama.cpp works best (today)

Why llama.cpp beats vLLM for running gpt-oss models locally

So much for “openai compatible”

Why things fall apart

Harmony: great idea, industry not ready

Our misadventures with vllm

Llama.cpp just worked

Things we learnt the hard way

Cutting to the chase

Versions we tested

No Comments

Leave a ReplyCancel reply

19 Sep Lessons from the trenches: why llama.cpp works best (today)

Why llama.cpp beats vLLM for running gpt-oss models locally

So much for “openai compatible”

Why things fall apart

Harmony: great idea, industry not ready

Our misadventures with vllm

Llama.cpp just worked

Things we learnt the hard way

Cutting to the chase

Versions we tested

No Comments

Leave a ReplyCancel reply

Discover more from Visokio