AiNews.com
Posts
Moonshot AI Releases Kimi K2, a 1T Open-Source Model Built for Agentic Reasoning

Moonshot AI Releases Kimi K2, a 1T Open-Source Model Built for Agentic Reasoning

Kimi K2 delivers top-tier coding and tool use performance with a 1 trillion-parameter MoE design and a 128K context window.

Alicia Shapiro
July 15, 2025 • Estimated Reading Time: 11 minutes

Image Source: ChatGPT-4o

Moonshot AI Releases Kimi K2, a 1T Open-Source Model Built for Agentic Reasoning

Key Takeaways:

Kimi K2 is Moonshot AI’s newest open-source model, featuring a 1T-parameter Mixture-of-Experts architecture with 32B active parameters per token.
The model excels in coding and tool-use tasks, outperforming many proprietary models on benchmarks like SWE-bench, LiveCodeBench, and AceBench.
Kimi K2 is fully open source under a Modified MIT License, with support for commercial use and multiple deployment engines including vLLM and TensorRT-LLM.
Agentic reasoning is a core design goal, with native tool-calling, planning, and autonomy built into both the training pipeline and model structure.
Available now via API or Hugging Face, Kimi K2 can be self-hosted or integrated into products with OpenAI-compatible tooling.

A Scaled Open Model Focused on Action

Moonshot AI has released Kimi K2, a high-performance open model designed to compete directly with proprietary leaders in coding, reasoning, and agentic workflows. The model is available in two variants—Kimi-K2-Base and Kimi-K2-Instruct—with full weights, documentation, and tooling accessible through Hugging Face and GitHub.

Kimi K2 uses a 1 trillion-parameter Mixture-of-Experts (MoE) architecture with 384 experts and 8 active per token, yielding 32 billion active parameters per inference. This design allows for large-scale capacity with efficient compute, and contributes to the model’s strong performance across a wide range of evaluations.

Moonshot AI is positioning Kimi K2 as more than just a general-purpose language model. It’s described as “agent-first”—built to operate tools, execute commands, and handle complex workflows with minimal prompting. Real-world demonstrations include:

Editing and running shell commands in a live terminal
Refactoring full software projects across languages
Automating analytics workflows with libraries like Weights & Biases
Coordinating multi-step travel planning and web browsing tasks

Strong Benchmark Performance vs. Open and Closed Models

Despite being open source, Kimi K2 holds its own—and sometimes outperforms—closed models from Anthropic, OpenAI, and Google in targeted benchmarks.

🔹 Agentic and Coding Tasks LiveCodeBench v6: 53.7% (vs. GPT-4.1 at 44.7%)

SWE-bench Verified: 65.8% single attempt; 71.6% multiple attempts
MultiPL-E: 85.7% (vs. Claude Opus 4 at 89.6%)
OJBench: 27.1% (best among open models; ahead of GPT-4.1 and Claude 4)

🔹 Tool Use and Planning Tau2 Bench (Tool Use): 66.1 weighted avg. (vs. Claude Opus 4 at 67.6%)

AceBench: 76.5% (on par with GPT-4-tier models)

🔹 Math and Reasoning MATH-500: 97.4% (best overall in class)

AIME 2025: 49.5% (outperforms many open models)

Here’s how Kimi K2 performs across key benchmarks:

Composite bar chart showing Kimi K2’s performance across several benchmark categories. In “SWE-bench Verified,” Kimi K2 scores 65.8% for single attempt and 71.6% for multiple attempts, outperforming Claude, GPT-4, and DeepSeek. In “SWE-bench Multilingual,” it scores 47.3%, ahead of other open models. On “LiveCodeBench v6,” Kimi K2 leads with 53.7%, and on “OJBench,” it tops the chart at 27.1%. In tool use, it scores 66.1% on Tau2 weighted average and 76.5% on AceBench. In math and STEM, it reaches 49.5% on AIME 2025 and 75.1% on GPQA-Diamond, demonstrating strong general and domain-specific capabilities.

Kimi K2 Performance on Key Benchmarks: Coding, Tool Use, and STEM (July 2025). Image Source: Moonshot AI on GitHub

Table comparing coding benchmark results across open-source and proprietary models. Metrics include performance on LiveCodeBench, OJBench, MultiPL-E, and various SWE-bench settings. Kimi-K2-Instruct scores 53.7% on LiveCodeBench v6, 27.1% on OJBench, and 85.7% on MultiPL-E—placing it ahead of all other open models and close to proprietary leaders. On SWE-bench Verified, Kimi K2 reaches 65.8% (single attempt) and 71.6% (multiple attempts), outperforming GPT-4.1 and Claude 3 Sonnet, and ranking just below Claude 3 Opus. The table is divided into open and proprietary columns, clearly showing how Kimi K2 compares at scale.

Kimi K2 vs GPT-4, Claude, and DeepSeek on Coding Benchmarks (July 2025). Image Source: Moonshot AI on GitHub

Benchmark table displaying Kimi K2’s performance on tool-use tasks (like Tau2 and AceBench) and math/STEM tasks (like AIME, MATH-500, and CNMO 2024). Kimi K2 leads with 70.6% on Tau2 retail, 65.8% on Tau2 telecom, and 76.5% on AceBench. In math-heavy evaluations, it scores 97.4% on MATH-500 and 74.3% on CNMO 2024, outperforming other open models and matching or exceeding proprietary competitors on multiple tasks. The chart includes comparison data for Claude, GPT-4, and others, showing Kimi K2’s strength in technical and reasoning domains.

Kimi K2 Scores on Tool Use and STEM Benchmarks. Image Source: Moonshot AI on GitHub

Table highlighting Kimi K2’s performance across general reasoning and logic tasks. The model scores 89.5% on MMLU, 92.7% on MMLU-Redux, and 81.1% on MMLU-Pro. It also performs well on IFEval with 89.8%, and scores 76.4% on Livebench. On AutoLogi, ZebraLogic, and GPQA-Diamond, it delivers industry-leading results, surpassing other open models. The chart compares Kimi K2 with leading models including Claude, GPT-4.1, DeepSeek, and Gemini, showcasing Kimi’s strong results in multi-task generalization and logic-heavy tasks.

Kimi K2 Benchmark Scores in Reasoning, Logic, and General Knowledge. Image Source: Moonshot AI on GitHub

Kimi K2’s strengths appear to lie in structured problem solving, tool use, and low-latency reasoning, rather than extended thinking or multimodal tasks (which it does not yet support). Its SWE-bench and MATH-500 scores reflect strong agentic performance in competitive coding and STEM reasoning.

Built for Open Deployment at Scale

Kimi K2 is released under a Modified MIT License, allowing full commercial use, modification, and redistribution. Users can choose from four supported inference engines:

vLLM
SGLang
KTransformers
TensorRT-LLM

The model supports both chat completion APIs and native tool calling, with OpenAI-compatible endpoints for easy integration. Kimi K2 also offers a 128K token context window, enabling long-document processing and sustained multi-turn conversations—an advantage for researchers, agents, and enterprises working with complex workflows or extensive prompts. You can see the full API integration details here.

Moonshot AI’s deployment guides provide examples for agent use, chat applications, and custom tool integrations. While GPU requirements are significant, the model is designed to be scalable for production-grade deployments.

Who's Behind Kimi K2?

Kimi K2 was developed by Moonshot AI, a Chinese AI research lab backed by Alibaba. While Moonshot operates independently, Alibaba is one of its key investors and has helped position the lab as a major contender in China’s AI race. The Kimi model family also powers the Alibaba-affiliated AI assistant Kimi Chat, available via web and mobile.

Native Agentic Design and Reinforcement Learning

Kimi K2’s standout feature is its deep focus on agentic behavior. The model was trained with a custom tool-use simulator inspired by ACEBench, allowing it to learn from thousands of virtual environments where agents interact with tools under human-like task rubrics.

Moonshot also introduced a new optimizer called MuonClip, designed to stabilize training at trillion-parameter scale. This addresses training instability from exploding attention logits and is part of what enabled Kimi K2’s smooth scale-up on 15.5 trillion tokens.

In post-training, the model was further refined using reinforcement learning across both verifiable and non-verifiable tasks. For creative tasks like writing or planning, Kimi K2 uses a self-judging critic to generate structured feedback—a strategy that mimics supervised learning without the need for labeled human data.

How Kimi K2 Compares to DeepSeek R2

DeepSeek R2 is another high-performance open model from China, also built on a trillion-parameter MoE architecture. While both models aim to push the boundaries of open AI development, Kimi K2 distinguishes itself through its deep integration of agentic capabilities—particularly tool use, planning, and command execution. Benchmark results show Kimi K2 leading in several key areas, including SWE-bench Verified and LiveCodeBench. As of now, Kimi K2 is the only model of its scale with full open weights available for commercial and research use.

Fast Facts for AI Readers

Q: What is Kimi K2?

A: Kimi K2 is Moonshot AI’s 1 trillion-parameter open-source model optimized for coding, reasoning, and agentic tool use.

Q: What architecture does it use?

A: It’s a Mixture-of-Experts model with 384 experts and 8 selected per token (32B active parameters per inference).

Q: How does it perform?

A: It achieves 53.7% on LiveCodeBench and 65.8% on SWE-bench Verified—better than many closed models.

Q: Is it free and open source?

A: Yes. Released under a Modified MIT License, it allows full commercial use with no restrictions.

Q: Where can I try it?

A: You can access Kimi K2 via API at platform.moonshot.ai, or download it on Hugging Face.

What This Means

Kimi K2 shows that open-source models can now match—and in some domains, outperform—their closed counterparts. With agentic intelligence emerging as a core capability in AI development, Kimi K2’s native support for tool use and command execution gives it a distinct advantage in real-world deployment scenarios.

For startups, researchers, and enterprises building intelligent agents, the release offers a rare blend of scale, openness, and usability. For the broader AI ecosystem, it’s a reminder that powerful models don’t need to come with usage restrictions—or a price tag.

As with other advanced open models developed in China, including those from DeepSeek, users should weigh the benefits of access against the potential risks of data exposure. While Kimi K2 is open-source and commercially licensed, deploying it in sensitive environments may raise concerns around data flow, security, and long-term dependencies—particularly given Moonshot AI’s backing by Alibaba. This doesn’t diminish the model’s technical strength, but it does highlight the need for transparency not just in code, but in ownership and jurisdiction.

As AI models grow more powerful and more global, evaluating who builds them—and who benefits—matters as much as how well they perform.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.