AI ToolsJune 5, 2026

xAI Ships Grok Build 0.1: A Purpose-Built Coding Model Enters the Agentic Race

By AgentRiot

xAI's new grok-build-0.1 model is now available via API. 256K context, 100+ tok/s, and a 70.8% SWE-bench score. Here is what the benchmarks, reviews, and pricing actually say.

Editorial hero for Grok Build 0.1 article — dark tech aesthetic with product name and version, abstract parallel code streams suggesting agentic architecture.

xAI Grok Build grok-build-0.1 coding models agentic coding SWE-bench Claude Code Codex CLI AI agents developer tools

On May 14, 2026, xAI released Grok Build, a terminal-native coding agent built on a brand-new model called grok-build-0.1. Six days later, on May 20, the model itself dropped on the xAI API in public beta. That two-step launch matters: Grok Build is the CLI product, but grok-build-0.1 is the engine, and developers can now call that engine directly without paying for a SuperGrok subscription.

This is xAI's first serious move into the agentic coding space currently dominated by Anthropic's Claude Code and OpenAI's Codex CLI. The question is whether grok-build-0.1 is competitive on the metrics that matter, or just a fast follower with a lower price tag.

What grok-build-0.1 Actually Is

Per xAI's documentation, grok-build-0.1 is a coding model "specifically trained for agentic coding workflows." It accepts text and image inputs, outputs text, and carries a 256,000-token context window. That is large enough for most production codebases in a single pass, though smaller than Claude Opus 4.7's 1M-token long-context mode.

What xAI has not disclosed is just as telling. There is no published model card, no parameter count, no architecture diagram, and no training-data specification for grok-build-0.1. xAI's docs list capabilities and pricing but remain silent on whether the model is a dense transformer, a Mixture-of-Experts (MoE) design, or something else entirely. One third-party source (TypingMind/OpenRouter) claims a 309B-parameter MoE with 15B active parameters and "hybrid attention architecture," but that figure is unattributed and xAI has not confirmed it. Treat it as unverified speculation.

What we do know from multiple independent sources is that the model sits outside the Grok 4 lineage. DevOps.com and ByteIota both report that grok-code-fast-1, the predecessor that grok-build-0.1 replaced, was "built from scratch, separate from the Grok 4 lineage, with a training corpus heavy on programming content and post-training focused on real-world pull requests and coding tasks." xAI has not stated whether grok-build-0.1 continues that same training recipe or starts from a different base, but the separation from the main Grok chat model line is well established.

Key specs from primary sources:

Context window: 256,000 tokens
Input/output pricing: $1.00 / $2.00 per million tokens (API)
Speed: 100+ tokens per second (xAI claim)
Modalities: Text + image in, text out
Features: Function calling, structured outputs, built-in reasoning, MCP support

The model also replaces grok-code-fast-1 in xAI's routing table. API requests previously hitting the older model now route to grok-build-0.1, signaling xAI is consolidating its coding stack around this release rather than maintaining parallel variants.

What "Agentic Coding" Means Here

The phrase "agentic coding" gets used loosely, so it is worth clarifying what grok-build-0.1 is actually optimized for. Standard code-completion models predict the next token given a prompt. Agentic coding models are trained to operate across multi-step workflows: plan a change, read files, edit code, run commands, observe output, and iterate. That requires different capabilities than autocomplete: tool use (function calling), long-horizon reasoning across many context windows, and error recovery when a command fails or a test breaks.

grok-build-0.1's training appears targeted at this loop. xAI describes it as "specifically trained for agentic coding tasks, including web development, debugging, and MCP support." The predecessor model (grok-code-fast-1) was reportedly post-trained on "real-world pull requests and coding tasks," suggesting a focus on practical software engineering rather than competitive-programming puzzles. Whether grok-build-0.1 uses the same post-training recipe, or adds reinforcement learning from code execution feedback, is undisclosed.

The model's feature set supports the agentic story: native function calling for tool use, structured outputs for parsing command results, a 256K context window for keeping large codebases in scope, and built-in reasoning that is always active. But features are not training methodology. Until xAI publishes a model card or technical report, the exact recipe that produces grok-build-0.1's agentic behavior remains opaque.

Benchmarks: The Numbers We Have

The most-cited score is 70.8% on SWE-bench Verified, a benchmark that measures real-world software engineering task completion. That figure appears across multiple sources including DevOps.com and BuildFastWithAI, though it originally belongs to grok-code-fast-1, the predecessor model that grok-build-0.1 replaced. Whether grok-build-0.1 improves on that number is unverified; xAI has not published an updated SWE-bench score for the new model specifically.

Independent evaluation from Vals AI puts grok-build-0.1 at 71.40% on SWE-bench (±2.02), slightly above the predecessor figure. That is one of the few third-party data points available, but it comes with caveats. BenchLM, another independent tracker, excludes grok-build-0.1 from its public leaderboard because the model "still lacks enough non-generated benchmark coverage to rank safely." In plain terms: most benchmark data circulating for this model is either vendor-reported or generated by small-sample evaluations, not the large-scale independent testing that produces reliable rankings.

Third-party benchmark data from Kilo Code's PinchBench (OpenClaw-style agent tasks) paints a more detailed picture:

PinchBench average: 88.9% (#7 of 50 official models)
Terminal Bench 2.0: 50.6% completion
Top category scores: Log Analysis 97.0%, CSV Analysis 96.1%, Writing 95.8%, Analysis 95.1%
Perfect-task hits: Access Control Log Anomaly Detection, Calendar Event Creation, Commit Message Writer, Create Project Structure, Dockerfile Optimization, and Earnings Analysis all scored 100.0%

Benchable.ai's independent evaluation flags a notable split: grok-build-0.1 scores 95.0% on coding accuracy (90th percentile) and achieves 100% on hallucination and ethics baselines, making it the most accurate model at its price point for factual reliability. Its instruction-following score is weaker at 60.0% (53rd percentile), suggesting the model excels at well-scoped coding tasks but may struggle with ambiguous or multi-layered prompts.

How It Compares to Claude Code and Codex CLI

The 2026 terminal-agent race now has three serious entrants. Here's the state of play on vendor-reported benchmarks:

Codex CLI (GPT-5.5): 88.7% SWE-bench Verified
Claude Code (Opus 4.7): 87.6% SWE-bench Verified
Grok Build (grok-build-0.1): 70.8% SWE-bench Verified (predecessor model score); 71.40% per Vals AI independent test

That ~17-point gap is real. As one analyst noted, Grok Build "trails by a real margin" on the industry-standard benchmark. Kilo.ai's live leaderboard currently ranks Claude Opus 4.7 at #1 and grok-build-0.1 at #2, though that ranking appears to weight factors beyond raw SWE-bench score. Benchmark scores don't tell the whole workflow story, though.

Pricing: At $1/million input and $2/million output tokens, grok-build-0.1 undercuts both Claude Code and Codex CLI on raw API costs. One recent comparison puts the cost differential in stark terms: GPT-5.5 costs 13.2× more than grok-build-0.1 at the same token volume. However, the Grok Build CLI itself requires a SuperGrok or X Premium+ subscription ($99–$299/month), which changes the economics for individual developers. One independent analysis estimates the full CLI experience is "roughly 15x the price of Claude Code or Codex CLI" when subscription costs are factored in.

Architecture: Grok Build's headline differentiator is parallelism. Up to 8 sub-agents run concurrently in isolated Git worktrees, each following a plan → search → build loop. The CLI also defaults to "Plan Mode," where the agent writes a structured execution plan before touching files, letting developers review and edit steps before execution. MCP support is native, meaning existing MCP servers and AGENTS.md conventions work out of the box.

Speed: xAI claims 100+ tokens per second. Benchable.ai ranks grok-build-0.1 in the 33rd percentile for speed across tested models — moderate, not class-leading. The tradeoff appears to be reliability over raw velocity.

What Reviews Are Saying

Early coverage is split between enthusiasm for the API accessibility and skepticism about the benchmark gap.

DevOps.com's Tom Smith calls the API launch "a meaningful step" that puts the model "in front of a much wider developer audience." The Futurum Group's Mitch Ashley is more cautious: "Grok Build enters as a callable model without published benchmarks and without the engineering depth the category leaders hold. For depth-sensitive work, teams will keep defaulting to established agents."

The consensus from independent reviewers (Codersera, BuildFastWithAI, ByteIota) is that Grok Build is the "one to watch" rather than the pick today. Codex CLI leads on speed and benchmark scores; Claude Code leads on maturity, ecosystem depth, and self-verification for long-horizon tasks. Grok Build's bet is on parallelism and price — a viable niche, but not yet a category leader.

Bottom Line

grok-build-0.1 is a credible coding model with strong reliability, competitive API pricing, and native agentic features. The 70.8% SWE-bench score and 88.9% PinchBench average show it can handle real engineering tasks. But the ~17-point gap behind Claude Code and Codex CLI on the most-watched benchmark, combined with the subscription wall around the full CLI experience, means xAI still has ground to cover before it challenges the incumbents on depth.

For developers already running agentic workflows via API, grok-build-0.1 is now a viable option worth benchmarking in your own stack. For teams choosing their first terminal agent, the safer bets remain Claude Code for reliability and Codex CLI for raw performance — with Grok Build as a fast-moving alternative to re-evaluate in three months.

Sources

xAI Docs: grok-build-0.1 specifications and capabilities
xAI API pricing (OpenRouter, xAI docs)
DevOps.com: "xAI Enters the Coding Agent Race With Grok Build" (May 15, 2026)
DevOps.com: "xAI Opens Grok Build 0.1 to Developers via API" (June 1, 2026)
BuildFastWithAI: "Grok Build: xAI's Agent CLI Reviewed" (May 26, 2026)
Kilo Code / PinchBench: OpenClaw benchmark data
Benchable.ai: Independent model evaluation (May 20, 2026)
Codersera: Three-way comparison (May 18, 2026)
Vals AI: Independent SWE-bench evaluation (71.40%, ±2.02)
BenchLM: Model tracking and leaderboard exclusion note
Kilo.ai: Live AI coding model leaderboard