What's in a GGUF, besides the weights - and what's still missing?

GGUF files contain chat templates, sampler configs, and special tokens—but missing metadata for tool calling formats, think tokens, and multimodal projections still forces developers to write model-specific code.

May 15, 2026 · ai ml

Read Original

• GGUF's single-file format beats scattered JSONs (HuggingFace) or OCI layers (Ollama), bundling chat templates (Jinja2 scripts), special tokens, and sampler chains in one place
• Tool calling formats vary wildly—Qwen3 uses JSON objects, Qwen3.5 uses XML tags, Gemma4 uses custom syntax—forcing every inference engine to hardcode parsers for each model family
• Missing metadata: think_token field (to separate reasoning from output), projection model weights (breaking the one-file promise for multimodal), and feature flags (no way to detect if a model supports tools/images/thinking)
• Author proposes adding grammars to GGUF spec so parsers can be auto-generated, plus bundling projection models as optional variants
• NobodyWho generates constraining grammars per tool call to guarantee type safety, preventing 1B models from passing floats when integers are required

GGUF consolidates everything needed to run a language model into a single file—chat templates (Jinja2 scripts that format conversations and handle tool calling), special tokens (like end-of-sequence markers), and sampler configurations (probability distribution transformations). This beats HuggingFace's scattered JSON files or Ollama's OCI layers. The GGUF standard even includes sampler chain sequence ordering, which most formats omit, letting you specify exactly how sampling steps should be applied.

But critical metadata is still missing, forcing inference engines to maintain model-specific codepaths. Tool calling formats are the worst offender: Qwen3 outputs {"name": "get_weather", "arguments": {"location": "Copenhagen"}}, Qwen3.5 uses <tool_call>get_weather<location>Copenhagen</location></tool_call>, and Gemma4 uses <|tool_call>call:get_weather{city:<|"|>Copenhagen<|"|>}. Every engine rushes to implement parsers when new models drop. The author proposes adding grammars to GGUF so parsers can be auto-generated. NobodyWho goes further by generating constraining grammars per specific tool call, guaranteeing type safety—crucial for tiny models that might pass floats when integers are required.

Other gaps: think_token fields exist in upstream HuggingFace repos but get dropped in GGUF conversions, making it impossible to separate reasoning streams from main output without model-specific code. Projection models for multimodal input (images/audio) require a second GGUF file, breaking the one-file ergonomics—bundling them as optional variants would fix this. And there's no feature flag system, so you can't detect if a model supports tool calling or images without hacky substring matching on chat templates. The author calls for the GGUF community to extend the standard to eliminate these remaining codepaths.

What's in a GGUF, besides the weights - and what's still missing?

TLDR

In Detail

TLDR

In Detail

Related