15 Sep 🚀 Not for the Faint-Hearted: Diving Deep into GPT-OSS
OpenAI recently dropped something unusual: open-weight models – GPT-OSS-20B and GPT-OSS-120B.
Open weights matter because you can download the raw model files and run them yourself, whether that’s on your laptop or a GPU cluster in the cloud. No API calls, no third-party provider, no data leaving your environment.
On paper, these models bring some useful features:
- Apache-2 licence
- 128k context window
- Native quantisation (MXFP4)
- Roles, tool use, code execution, variable-effort reasoning
- Capability roughly in line with OpenAI’s own mid-tier closed models
So far so good. But running them inside a production tool like Omniscope is a different beast.
What We Tried
We put GPT-OSS through its paces using llama.cpp, vLLM, HuggingFace Transformers, and lmstudio, across everything from a MacBook Air to an H100 in the cloud.
Some things we ran into:
1. Harmony tags break structured outputs
The models sometimes emit their new “harmony” chat tags. If your grammar isn’t ready for them, you just get noise. The fix: extend your JSON/GBNF grammar or accept the chaos.
2. Local vs. cloud performance is very different
- On an M2 Max MacBook Pro, the 20B runs at around 74 tokens/sec. Surprisingly usable.
- On an H100, things are blisteringly fast – but only if your inference engine isn’t bottlenecked by the CPU.
3. Reasoning output is inconsistent
Different engines do different things: strip it out, pass it through, or scramble it. Fixing it usually means digging into inference server code and adjusting your connectors.
Takeaway
This is the messy side of working with open-weight models. Demos look smooth, but real workflows mean handling quirks in:
- APIs and event streams
- Prompt templates and structured outputs
- GPU utilisation and engine bottlenecks
That’s the work we do at Visokio every day: break it, patch it, and push until it runs inside Omniscope.
Your Turn
Tried GPT-OSS yet?
- Which engine tripped you up most?
- Found a reliable way to handle harmony tags?
- Any tips for squeezing more out of 120B without cooking the GPU?
We’d love to compare notes.
👉 Follow us if you want to keep up with how we’re bringing open-weight LLMs into real workflows.

No Comments