Engineering for determinism: a tale of two local LLM inference engines

26 Jan Engineering for determinism: a tale of two local LLM inference engines

Posted at 10:20h in AI, Data Analytics, Uncategorized by superstevespencer 2 Comments

Local large language models are often presented as creative tools. They generate fluent prose, infer intent, and fill in gaps with impressive confidence. That framing works well for chatbots and copilots. It breaks down in a different class of application: those where an AI agent is tightly integrated into the execution path of the software itself.

In a sophisticated AI-enabled application, an LLM is not free to improvise. It emits structured artefacts that other parts of the system must consume and trust: configuration objects, execution plans, tool invocation payloads, or UI state definitions. A single malformed response can surface directly to a user as a broken interface, an inexplicable error, or a loss of confidence in the product.

This article is about what happens when you try to integrate local LLMs into that kind of environment. It is not a benchmark and not a performance comparison. Instead, it is a field report from integrating local inference engines into a real product, and discovering that determinism, correctness, and diagnosability depend less on model intelligence than on exactly where and when constraints are applied during generation.

The concrete case study here is Omniscope, which integrates LLMs into interactive BI workflows such as dashboard generation and data question-answering. BI makes the failure modes particularly visible, but the underlying problems apply far more broadly to any application where an AI agent manipulates structured state.

Omniscope's Report Ninja in action, showing an AI-generated dashboard of a spend analysis dataset.

Omniscope’s Report Ninja showing an Instant Dashboard. A real-world example of a tightly integrated AI-enabled application. Outputs here are consumed directly by the UI and underlying data workflows, not treated as free-form text.

At the centre of the story are two widely used local inference engines – llama.cpp and vLLM – and an unexpectedly subtle topic: grammar-constrained decoding.

The setting: local models under real constraints

The experiments described here were driven by product requirements rather than academic curiosity. Omniscope’s AI features involve multi-turn conversations, long prompts with extensive instructions and examples, structured outputs with non-trivial schemas, and tool calls that query data and modify application state. All of this needs to work reliably and repeatedly under user interaction.

The testing environment was intentionally constrained. Everything ran on a single NVIDIA H100 80 GB GPU, pushing large open-weight models close to their practical limits. Models such as gpt-oss-120b and Llama 3.3 70B are large enough that memory headroom, fragmentation, and runtime behaviour matter. Several of the failure modes described below only became visible once these systems were exercised under sustained, production-like pressure.

The goal was not to find the fastest configuration or the largest possible context window. It was to understand whether local inference engines could provide deterministic, diagnosable guarantees under realistic workloads.

Why determinism matters

In tightly integrated AI-enabled applications, the output of an LLM is rarely the final product. It is usually an intermediate artefact consumed programmatically by other components.

If that artefact deviates from its expected structure, downstream systems cannot recover gracefully. Unlike a chat interface, there is no opportunity to interpret the output charitably. Either the structure is valid, or it is not.

One common mitigation is a validation-and-feedback loop: detect invalid output, feed an error back to the model, and ask it to try again. Under realistic workloads this approach breaks down. Each retry adds latency, compounds failure modes, and degrades the user experience, particularly with long prompts, large context windows, or multi-turn interaction where retries can cascade. In practice, this also increases operational complexity, as downstream systems must distinguish between transient model errors and genuine user intent.

After careful prompting with clear and succinct examples, the most reliable single-pass approach is to prevent invalid output from being generated in the first place. Grammar-constrained decoding builds on that foundation: it can only enforce correctness once the model already understands the shape and intent of the output it is expected to produce.

A brief detour: what grammar enforcement actually means

At a high level, LLMs generate text token by token. At each step, the model assigns probabilities to all possible next tokens, and one token is selected.

Without constraints, the model is free to choose any token in the vocabulary at any step. Prompting and examples influence probabilities, but they do not remove invalid options. The model can always decide to emit something that breaks structure.

Grammar-constrained decoding changes this process fundamentally. Before sampling, the inference engine removes any tokens that would violate a specified grammar. Tokens that would make it impossible to complete a valid structure are never considered. The model simply does not see them as options.

Conceptually, this provides an all-or-nothing distinction from post-hoc validation. In practice, real systems introduce failure modes of their own: tokenisation mismatches, grammar conversion limits, parser bugs, or engine-level fallbacks can cause grammars to be partially applied or dropped entirely. The distinction remains important, but it is not absolute, and failures at this layer can undermine guarantees in ways that are difficult to detect without careful instrumentation.

Abstracting away implementation details, grammar enforcement aims to ensure that:

structural delimiters must balance
required fields must appear
impossible continuations are ruled out early

This is enforcement during generation, not validation after the fact. Its value lies in preventing invalid output paths from being explored at all, rather than attempting to recover once they have already occurred.

In practice, grammar enforcement does not operate over a clean stream of user-visible text. It must coexist with the model’s chat template and any special tokens used to delimit messages, tool calls, or internal structure. Models such as the GPT-OSS series emit a structured control stream using the Harmony format, with tokens such as <|start|> and <|message|> embedded directly in generation.

Harmony stream illustration showing tokens and content

Example of a Harmony-formatted token stream. Grammar constraints are typically triggered at specific boundaries (such as the start of a tool call or final message).

A grammar can either model this entire stream, or be triggered part-way through it — for example at the start of a tool call or final message. Where and how that boundary is drawn turns out to matter a great deal.

The first engine: llama.cpp

llama.cpp has an appealing simplicity. It supports GGUF models, starts easily, and handles multi-turn conversations without fuss. It is explicitly designed to run well on commodity hardware such as a MacBook Pro, whereas vLLM is clearly centred on datacentre-class GPUs; these different design centres help explain later behavioural differences.

In early experiments, llama.cpp simply appeared to work. The AI integration was smooth, the engine configuration was relatively straightforward, and initial tests produced outputs that matched the expected schema. At the time, this gave the impression that structured output constraints were being enforced.

Only later, after constructing targeted test cases and inspecting server logs, did it become clear that much of this apparent correctness came from careful prompting and examples alone, and that in several cases the model was effectively unconstrained rather than grammar-bound.

llama.cpp applies grammar constraints more reliably to tool call arguments than to the assistant’s final message. Tool definitions are converted into grammars, and argument generation is constrained accordingly, subject to known conversion and parsing limitations. By contrast, the assistant’s final message is often not grammar-constrained when tool calling is enabled, and in some configurations grammar enforcement can be bypassed or dropped.

More concerning is a specific failure mode in which grammar construction succeeds but grammar parsing fails — for example due to unsupported regex constructs in a schema. In this case, llama.cpp logs an internal error but proceeds without applying the grammar. The HTTP response still returns 200 OK. From the client’s perspective, nothing has gone wrong, except that guarantees have silently disappeared. After reproducing this deterministically, this behaviour was reported upstream. The behaviour is conditional and configuration-dependent, but when it occurs it materially undermines diagnosability.

The second engine: vLLM

vLLM sits at the opposite end of the spectrum. It is powerful, configurable, and optimised for high-throughput serving, but can be difficult to get up and running. Extracting good behaviour from lower-memory devices is much harder, and the design centre of gravity clearly assumes access to server-class GPUs such as multiple H100s.

Structured outputs in vLLM are implemented via xgrammar. When configured correctly, this provides genuine grammar-constrained decoding for JSON schemas. Invalid schemas fail early, and valid schemas produce rigorously constrained output.

Tool calling follows a different architectural path. Tool argument schemas are first converted lossily into a textual description and injected into the prompt via the model’s Jinja chat template. This is a known limitation of the current tool-calling path rather than an incidental bug: the schema information is advisory, not enforced by grammar, and complex constraints are intentionally elided. This behaviour is documented and understood within the project, and may evolve over time, but today it reflects a deliberate separation between structured outputs and tool invocation.

Under realistic workloads, a fairly serious vLLM bug also emerged. When using Harmony-based models with structured outputs enabled, multi-turn conversations would consistently fail with null or missing content in the API response. Tokens were generated, but the final message was dropped during response assembly. After reproducing this reliably, it was also reported upstream because it makes structured outputs incompatible with multi-turn interaction in certain configurations.

Inverted limitations

At this point, a clear pattern emerges.

When both structured outputs and tool calling are enabled, llama.cpp and vLLM enforce opposite halves of the contract.

llama.cpp constrains tool calls more strongly, but structured output constraints can disappear
vLLM constrains structured outputs strongly, but tool calls remain advisory

Capability	llama.cpp	vLLM
Structured outputs	⚠️ Conditionally enforced (may fail open)	🧱 Strongly enforced
Tool call arguments	🧱 Strongly enforced (subject to conversion and parse failures)	📝 Advisory / prompt-level only
Multi-turn interaction	👍 Generally robust	⚠️ Fragile with structured outputs in some configurations

Capability

llama.cpp

vLLM

Structured outputs

⚠️ Conditionally enforced (may fail open)

🧱 Strongly enforced

Tool call arguments

🧱 Strongly enforced (subject to conversion and parse failures)

📝 Advisory / prompt-level only

Multi-turn interaction

👍 Generally robust

⚠️ Fragile with structured outputs in some configurations

For developers building complex AI integrations, this inversion is the central difficulty. Mixing tool calls and structured outputs feels natural, but in practice it often reduces determinism and diagnosability rather than increasing them.

A pragmatic engineering response is to collapse onto a single enforcement mechanism. In production, this compromise was acceptable because it restored a single, predictable failure surface: either everything was grammar-constrained, or nothing was. In practice, tool calls can be emulated using structured outputs, and structured outputs can be emulated using tool calls, allowing grammar constraints to be applied consistently end-to-end.

This is not a recommendation so much as an observation. It reflects what was required to make real systems robust under pressure, rather than an ideal end state for local LLM infrastructure.

Note that neither engine is perfect. During this work, distinct failure modes were encountered in each engine, and separate bug reports were filed upstream for llama.cpp and vLLM. In both cases, the most damaging failures were those that degraded guarantees without signalling. Silent fail-open behaviour proved more costly in production than explicit hard failure, because it violated contracts while leaving systems unaware that correctness guarantees had been lost.

Closing thoughts

Local LLMs are powerful, and inference engines such as llama.cpp and vLLM are impressive pieces of engineering. When these systems are embedded into professional, AI-enabled applications, however, the bar is different.

One reason this gap can be surprising is historical. When Report Ninja and Data Q&A were first developed in Omniscope, they were built against OpenAI’s hosted models. Those models originate the Chat Completions and Responses APIs, and their serving stack provides a complete, first-class implementation of the semantics those APIs imply. Strongly constrained structured outputs and tool calls compose cleanly there, even across multi-turn interactions. This comes with trade-offs — opacity, cost, and platform lock-in — but it also sets clear expectations around semantic guarantees.

The hardest problems are not about model quality or raw performance. They are about contracts, failure modes, diagnosability, and whether guarantees hold consistently across features and modes.

Correctness failures surface as user experience failures: pauses, retries, broken UI states, and loss of trust. That, more than any benchmark, is what determines whether an AI-enabled product feels robust or fragile.

Omniscope's Data Q&A interactive AI powered data query interface.

The new Data Q&A experience in Omniscope.

This work exists in service of user experience. When structured outputs and tool calling are robustly constrained, AI-enabled systems stop feeling fragile and start feeling dependable.

A concrete example of what this enables in practice can be seen in Omniscope’s Data Q&A view, which shows how tightly integrated, strongly constrained AI can support rich, explainable, and trustworthy interaction with data.

Print page

2 Comments

Glen Moutrie
Posted at 11:37h, 28 January Reply

Really nice piece, reminds me of some of the work done by thinking machines (https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/). Also worth checking out some of the open source libraries Open AI cite on their structured data work such as jsonformer https://github.com/1rgs/jsonformer
- superstevespencer
  Posted at 12:33h, 28 January Reply
  
  Hi Glen, thanks for your feedback, and I’m glad you liked the article. Interesting article link, a little deeper than this one! I recall from their original blog that OpenAI’s structured outputs was inspired in part by jsonformer. We’ve also dabbled with generating our own GBNF grammars for our applications and using them with llama.cpp, quite successfully, before tool calling made it impractical without much more work. Maybe that’s a topic for another article.

26 Jan Engineering for determinism: a tale of two local LLM inference engines

The setting: local models under real constraints

Why determinism matters

A brief detour: what grammar enforcement actually means

The first engine: llama.cpp

The second engine: vLLM

Inverted limitations

Closing thoughts

2 Comments

Glen Moutrie

superstevespencer

Leave a ReplyCancel reply

Discover more from Visokio