AiNews.com
Posts
xAI Launches Grok 4 and Grok 4 Heavy, Aims to Redefine AI Intelligence

xAI Launches Grok 4 and Grok 4 Heavy, Aims to Redefine AI Intelligence

Elon Musk’s xAI has unveiled Grok 4, its most advanced AI model to date, claiming postgraduate-level reasoning across every academic discipline—and surpassing rivals on key benchmarks.

Alicia Shapiro
July 10, 2025 • Estimated Reading Time: 9 minutes

A modern AI research lab with four researchers—three men and one woman—seated at desks, focused on large monitors displaying neural network diagrams, code, and benchmark graphs. Two central screens show the label “Grok 4” with illustrations of humanoid AI agents collaborating to solve a puzzle. A benchmark chart on an upper screen shows Grok 4 outperforming other models over time. The lab is dimly lit, emphasizing the glow from the screens, and conveys a high-tech, real-world setting focused on AI development and model evaluation.

Image Source: ChatGPT-4o

California Lawmakers Advance Bill to Regulate Harmful AI Chatbot Practices

Key Takeaways:

Grok 4 and Grok 4 Heavy are xAI’s new flagship models, boasting major upgrades in reasoning and tool use over previous versions.
Grok 4 Heavy uses a "multi-agent" system—multiple AIs collaborating in parallel—which xAI likens to an elite academic study group.
On the rigorous Humanity’s Last Exam benchmark, Grok 4 Heavy scored 44.4%, outperforming Gemini 2.5 Pro (26.9%) and OpenAI’s o3 high (21%).
xAI launched SuperGrok Heavy, a $300/month premium tier offering early access to Grok 4 Heavy and future models.
A broader developer release is now live via xAI’s API, with additional coding, multimodal, and video generation models set to follow later in 2025.

Grok 4: A New Benchmark for General Reasoning

Announced during a livestream Wednesday night, Grok 4 was described by Musk as “smarter than almost all graduate students in all disciplines simultaneously.” According to xAI, Grok 4 can solve complex problems in mathematics, physics, chemistry, and linguistics—even when the questions are unfamiliar or unpublished.

xAI emphasized that Grok 4 isn’t just memorizing the internet. Instead, it demonstrates “first-principles reasoning,” meaning it can work through novel, abstract problems like those found in research-level academia.

During the demo, Grok 4 tackled Humanity’s Last Exam (HLE)—a benchmark of 2,500 expert-written problems spanning diverse disciplines—and performed at a level no other model had previously reached unaided.

“There are no humans that can actually answer these can get a good score,” said Musk. “I mean if you actually say like any given human what's the best that any human could score? I mean I'd say maybe 5% optimistically.”

Screenshot from the xAI livestream showing Elon Musk and xAI researchers seated on stage discussing Grok 4’s development. Behind them, a slide titled “Ludicrous rate of progress” illustrates Grok’s rapid evolution: Grok 2 to Grok 3 (10x pre-training compute), Grok 3 to Grok 3 Reasoning (10x with reinforcement learning), and Grok 4 Reasoning (another 10x leap). Musk is speaking into a microphone while team members from xAI, wearing branded apparel, listen. The slide uses vertical bars to show compute increases at each stage. Video timestamp is 8:02 / 53:38, with 4.3M views on X.

xAI Leadership Unveils Grok 4's 10x Reasoning Leap During Live Launch Event. Image Source: xAI

How Grok 4 Works: Training at Unprecedented Scale

xAI claims that Grok 4’s leap in performance stems from both scale and architectural changes:

Each model upgrade (from Grok 2 to 3 to 4) has involved 10x more training compute.
Grok 4 combines a foundation model with reinforcement learning from human feedback (RLHF) and tool-assisted reasoning.
The model was trained using Colossus, xAI’s custom-built supercomputer powered by 100,000 H100 GPUs.

Grok 4 Heavy, the premium version, runs multiple AI agents simultaneously. These agents solve problems independently, compare notes, and converge on the best solution. xAI compared this to collaborative problem-solving in a study group, noting that it’s not always majority vote—often, one agent alone figures out the key insight.

Real-World Tests: From Prediction Markets to Running Businesses

To showcase real-world capability, Grok 4 Heavy was tested in several interactive demos:

Market prediction: It analyzed sports odds from PolyMarket, calculating a 21.6% chance for the Dodgers to win the MLB World Series—demonstrating live tool use, search, and probability modeling.
Vending Bench business simulation: Grok 4 doubled the net worth of competing models in a long-horizon task involving supply management, pricing, and strategy adherence. It outperformed other leading AI models in both profit and consistency.

Grok 4 is being used by researchers at the ARC Institute for biomedical discovery, where it helps sift through millions of experimental records in seconds to accelerate the identification of promising research directions.

Bar charts comparing model performance across six academic benchmarks: GPQA, AIME25, LCB (Jan–May), HMMT25, and USAMO25. For each test, scores are shown for GPT‑o3, Gemini 2.5 Pro, Claude 4 (Opus), Grok 4 (no tool), and Grok 4 Heavy. Highlights: Grok 4 (no tool) scores ~87.5 % on GPQA and ~91.7 % on AIME25; Grok 4 Heavy edges up to ~88.9 % GPQA and 100 % on AIME25. On LCB, Grok 4 hits ~79 % while Grok 4 Heavy reaches ~79.4 %. For HMMT25, scores climb from ~90 % to ~96.7 %. On USAMO25, Grok 4 scores ~37.5 %, and Grok 4 Heavy achieves ~61.9 %. These results reflect recent independent uploads and user-shared benchmark data

Grok 4 vs. GPT‑4, Gemini, Claude & Opus — Benchmark Results on Complex Academic Exams. Image Source: xAI

Limitations: Vision and Tool Use Still Maturing

Despite strong language reasoning, Grok 4 still lags in multimodal understanding, particularly image analysis and generation. “It’s like looking through blurry glass,” one presenter said.

That’s expected to improve with Grok 5, based on version 7 of the foundation model, currently in training. It will include better video understanding, more advanced tools, and tighter integration with simulation engines like Unreal or Unity for game development.

Grok 4's current tool use is considered "primitive" compared to the sophisticated simulations used in industries like aerospace. However, Musk promised those capabilities are coming—along with integration into humanoid robots, such as Tesla’s Optimus.

xAI Bets Big on Premium Access and Developer Adoption

xAI launched SuperGrok Heavy, a $300/month subscription tier that includes:

Early access to Grok 4 Heavy
Priority on new tools and features
Access to future models like an AI coding assistant (August), multimodal agent (September), and video generation model (October)

This makes it the most expensive AI subscription plan among major providers, ahead of offerings from OpenAI, Google, and Anthropic.

For developers, xAI has released Grok 4 through its public API, with 256K context length and access to tool capabilities. The goal: encourage integration into enterprise workflows across research, finance, gaming, and more.

Availability and Pricing

xAI has launched three Grok 4 access tiers, each with distinct capabilities and pricing models:

Grok 4

The core single-agent reasoning model capable of solving complex academic and real-world problems.

Availability: Live now via the Grok API and X platform
Cost:
- API (usage-based):
  - $3 per 1M input tokens
  - $15 per 1M output tokens
  - $0.75 per 1M cached tokens
- Consumer subscription:
  - ~$30/month via the standard Grok plan on X (bundled with premium+)

Grok 4 Heavy

A more powerful multi-agent version of Grok 4 that spawns several reasoning agents to collaborate and converge on the best solution—dramatically boosting performance on complex benchmarks.

Availability: Currently available only through the SuperGrok Heavy subscription
Cost: Included with SuperGrok Heavy (see below)

SuperGrok Heavy

xAI’s new ultra-premium tier that includes access to Grok 4 Heavy, early releases of future tools, and priority compute.

Availability: Live now for early subscribers; limited slots during demo period, with expanded rollout expected
Cost: $300/month or $3,000/year
Includes:
- Access to Grok 4 Heavy
- Priority access to upcoming models:
  - Coding model (August)
  - Multimodal agent (September)
  - Video generation (October)

If subscriptions are temporarily closed due to high demand, xAI recommends trying again shortly after the demo window.

Fast Facts for AI Readers

Q: What is Grok 4?

A: Grok 4 is xAI’s latest large language model, designed for advanced reasoning and tool use.

Q: What is Grok 4 Heavy?

A: A multi-agent version of Grok 4 that solves problems using multiple AI agents working in parallel.

Q: How does Grok 4 perform on benchmarks?

A: On Humanity’s Last Exam, Grok 4 Heavy scored 44.4%, outperforming Gemini 2.5 Pro and OpenAI’s o3 high.

Q: What is SuperGrok Heavy?

A: xAI’s new $300/month subscription tier offering early access to Grok 4 Heavy and future tools.

Q: How can developers use Grok 4?

A: Through xAI’s API, which supports long-context reasoning and integration with external tools.

What This Means

Grok 4 is xAI’s strongest case yet that it belongs in the top tier of generative AI development. Its benchmark wins and live demonstrations highlight a shift toward deeper reasoning, not just faster response.

But questions remain—about adoption, safety, and how xAI will handle future missteps, including the recent antisemitic responses from Grok’s official X account, which were removed after public backlash. The company revised Grok’s system prompt afterward but did not directly address the incident during the launch.

What’s clear is that Musk and xAI are betting big on speed, compute, and open deployment. Whether that approach results in safer, smarter, or simply faster AI will depend on how the next versions of Grok evolve—and how businesses respond to the promise of real-world intelligence at scale.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.