🚀 Not for the Faint-Hearted: Diving Deep into GPT-OSS

15 Sep 🚀 Not for the Faint-Hearted: Diving Deep into GPT-OSS

Posted at 18:57h in AI by Antonio Poggi 0 Comments

OpenAI recently dropped something unusual: open-weight models – GPT-OSS-20B and GPT-OSS-120B.

Open weights matter because you can download the raw model files and run them yourself, whether that’s on your laptop or a GPU cluster in the cloud. No API calls, no third-party provider, no data leaving your environment.

On paper, these models bring some useful features:

Apache-2 licence
128k context window
Native quantisation (MXFP4)
Roles, tool use, code execution, variable-effort reasoning
Capability roughly in line with OpenAI’s own mid-tier closed models

So far so good. But running them inside a production tool like Omniscope is a different beast.

What We Tried

We put GPT-OSS through its paces using llama.cpp, vLLM, HuggingFace Transformers, and lmstudio, across everything from a MacBook Air to an H100 in the cloud.

Some things we ran into:

1. Harmony tags break structured outputs

The models sometimes emit their new “harmony” chat tags. If your grammar isn’t ready for them, you just get noise. The fix: extend your JSON/GBNF grammar or accept the chaos.

2. Local vs. cloud performance is very different

On an M2 Max MacBook Pro, the 20B runs at around 74 tokens/sec. Surprisingly usable.
On an H100, things are blisteringly fast – but only if your inference engine isn’t bottlenecked by the CPU.

3. Reasoning output is inconsistent

Different engines do different things: strip it out, pass it through, or scramble it. Fixing it usually means digging into inference server code and adjusting your connectors.

Takeaway

This is the messy side of working with open-weight models. Demos look smooth, but real workflows mean handling quirks in:

APIs and event streams
Prompt templates and structured outputs
GPU utilisation and engine bottlenecks

That’s the work we do at Visokio every day: break it, patch it, and push until it runs inside Omniscope.

Your Turn

Tried GPT-OSS yet?

Which engine tripped you up most?
Found a reliable way to handle harmony tags?
Any tips for squeezing more out of 120B without cooking the GPU?

We’d love to compare notes.

👉 Follow us if you want to keep up with how we’re bringing open-weight LLMs into real workflows.

Print page

15 Sep 🚀 Not for the Faint-Hearted: Diving Deep into GPT-OSS

What We Tried

1. Harmony tags break structured outputs

2. Local vs. cloud performance is very different

3. Reasoning output is inconsistent

Takeaway

Your Turn

No Comments

Leave a ReplyCancel reply

15 Sep 🚀 Not for the Faint-Hearted: Diving Deep into GPT-OSS

What We Tried

1. Harmony tags break structured outputs

2. Local vs. cloud performance is very different

3. Reasoning output is inconsistent

Takeaway

Your Turn

No Comments

Leave a ReplyCancel reply

Discover more from Visokio