AiNews.com
Posts
OpenAI Launches GPT‐4.1: Faster, Smarter, and Built for Scale

OpenAI Launches GPT‐4.1: Faster, Smarter, and Built for Scale

Alicia Shapiro
April 15, 2025 • Estimated Reading Time: 12 minutes

A high-tech digital workspace featuring three holographic AI models labeled GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano. Each model is represented by a glowing, translucent humanoid bust surrounded by visual data streams. Floating screens display benchmark charts, code snippets, and graphs that highlight performance improvements in coding, instruction following, and long-context comprehension. A large background screen shows an upward-trending graph, symbolizing increased model capability. The scene is illuminated with cool blue lighting, evoking a futuristic, data-driven environment designed for advanced AI development.

Image Source: ChatGPT-4o

OpenAI Launches GPT‑4.1: Faster, Smarter, and Built for Scale

OpenAI has introduced a new generation of AI models—GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano—through its API, promising faster, smarter, and more cost-effective performance for developers. With improvements across key capabilities like coding, instruction following, and long-context comprehension, the GPT‑4.1 family sets new benchmarks in real-world AI deployment.

Key Improvements

Superior Coding Skills: GPT‑4.1 achieves 54.6% on the SWE-bench Verified benchmark, a 21.4% absolute gain over GPT‑4o and a 26.6% absolute gain over GPT-4.5.
Improved Instruction Following: It scores 38.3% on the Scale's MultiChallenge benchmark, which measures instruction following, up 10.5% from GPT‑4o.
Long-Context Mastery: Supports up to 1 million tokens of context setting state-of-the-art results on Video-MME (72.0% accuracy), up 6.7% over GPT-4o.
New Model Sizes: GPT‑4.1 mini matches or exceeds GPT‑4o in intelligence evaluations while cutting latency nearly in half and reducing cost by 83%. GPT‑4.1 nano, OpenAI’s fastest and most affordable model, scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding—delivering strong performance for low-latency tasks like classification and autocompletion, and making it a standout choice for speed-critical applications.

These models reflect OpenAI’s ongoing collaboration with the developer community, with training grounded in real-world utility.

Smarter Agents and Better Apps

GPT‑4.1 is optimized for building AI agents capable of handling tasks such as:

Code generation and diff editing
Customer service resolutions
Insight extraction from large documents
Long-context legal and financial reasoning

When combined with tools like the Responses API, developers can now build agents with greater reliability and task independence.

Real-World Adoption

Companies across industries are already seeing significant, measurable gains from adopting GPT‑4.1:

Windsurf reported a 60% improvement in coding accuracy compared to GPT‑4o on its internal benchmarks. This translated into faster iteration cycles—thanks to a 30% boost in tool use efficiency and a 50% reduction in unnecessary edits and overly incremental code analysis, especially in pull request workflows.
Qodo found that GPT‑4.1 delivered 55% more useful code review suggestions across 200 real GitHub pull requests. The model showed improved precision—knowing when not to comment—and greater depth when it did, leading to higher-quality feedback.
Blue J saw a 53% increase in accuracy when handling complex tax law scenarios, thanks to GPT‑4.1’s improved instruction following and long-context comprehension. This led to faster tax research and greater reliability in regulatory interpretation.
Thomson Reuters, through its CoCounsel AI legal assistant, achieved a 17% boost in multi-document legal review accuracy. GPT‑4.1 was more reliable at keeping track of context across lengthy documents and identifying subtle connections—such as conflicting clauses, repeated terms, or missing references—that are critical for legal analysis.
Carlyle used GPT‑4.1 to extract detailed financial data from large, complex documents—including PDFs, Excel files, and other dense formats. It performed 50% better at retrieving information from these sources compared to previous models, and was the first to reliably overcome key limitations like needle-in-the-haystack retrieval, lost-in-the-middle errors, and multi-hop reasoning—enabling more accurate insights from sprawling, high-stakes datasets.

Long Context & Reasoning Breakthroughs

While GPT‑4.1 delivers strong gains across instruction following, vision, and general intelligence, the most transformative improvements are in its long-context comprehension and multi-step reasoning. This is where GPT‑4.1 pulls ahead of prior models in both scale and reliability, making it uniquely suited for complex, real-world tasks involving large volumes of information.

To better evaluate this leap, OpenAI introduced new benchmarks that go beyond traditional "needle in a haystack" tests. These benchmarks were specifically designed to reflect the real challenges developers face—like tracking nuanced requests scattered across lengthy documents, or reasoning across multiple files and contexts that aren't linearly connected.

New Benchmarks for Long Context Reasoning

OpenAI-MRCR: Tests the model’s ability to retrieve and distinguish between multiple, similar user requests (e.g., “give me the third poem about tapirs”) embedded in long, mixed contexts. It challenges the model to stay precise even when prompts are repeated or subtly varied throughout the input.
Graphwalks: Graphwalks is designed to test a model’s ability to handle multi-step reasoning across long inputs. It works by filling the context with a simulated graph made of connected nodes, each represented by a unique hash. The model is then asked to perform a breadth-first search—a task that requires it to identify and return all nodes that are a specific number of steps away from a starting point.
On the Graphwalks benchmark, GPT‑4.1 achieves 61.7% accuracy, matching the performance of OpenAI’s top-tier o1 model and significantly outperforming GPT‑4o. This strong showing highlights GPT‑4.1’s ability to reason across large, complex inputs where sequential reading alone isn’t enough.
Unlike simpler tasks that can be solved by reading in order, this challenge demands global reasoning. The model has to understand the structure of the graph and retrieve relevant information from multiple, scattered locations in the input—mimicking how real-world applications often require jumping between related data points across large files.

Line graph of OpenAI-MRCR benchmark showing 2-needle accuracy versus input token length. GPT‑4.1 maintains higher accuracy across all input sizes, outperforming other models in disambiguating between similar prompts embedded in long contexts.

GPT‑4.1 Maintains High Accuracy in Long-Context Retrieval. Image Source: OpenAI

Bar graph showing Graphwalks BFS accuracy for GPT‑4.1 and comparator models at <128K tokens. GPT‑4.1 and mini reach 62% accuracy, clearly outperforming GPT‑4o and nano, showcasing strong multi-hop reasoning in long-context tasks.

GPT‑4.1 Excels at Multi-Hop Long Context Reasoning. Image Source: OpenAI

In both evaluations, GPT‑4.1 leads the performance charts, maintaining high accuracy even at the full 1 million-token range—a feat unmatched by earlier models. These gains represent a significant milestone in AI’s ability to navigate and understand long, complex, and interconnected information structures.

API-Only Access and Pricing

GPT‑4.1 is available exclusively via the API and is 26% less expensive on average than GPT‑4o for typical use cases. Developers benefit from enhanced efficiency through prompt caching discounts (now up to 75%, up from 50%) and batch processing support, which further reduce costs and latency for high-volume and repeated tasks. Finally, long-context requests are supported at no extra charge—developers only pay the standard per-token rates, regardless of input length.

Table showing per-million-token pricing for the GPT‑4.1 model family, including gpt‑4.1, gpt‑4.1-mini, and gpt‑4.1-nano. Columns display costs for input, cached input, output, and blended pricing. GPT‑4.1 costs $2.00 for input and $8.00 for output, with a blended rate of $1.84. GPT‑4.1-mini costs $0.40 for input and $1.60 for output, with a $0.42 blended rate. GPT‑4.1-nano, the cheapest, is priced at $0.10 input, $0.40 output, and $0.12 blended. A footnote explains pricing is based on typical input/output ratios, and models are eligible for an additional 50% discount via the Batch API.

GPT‑4.1 Series Pricing Overview. Image Source: OpenAI

In ChatGPT, many of GPT‑4.1’s improvements—particularly in coding, instruction following, and general intelligence—have been gradually integrated into the latest GPT‑4o model. More enhancements will continue to roll out over time through product updates.

Deprecation of GPT‑4.5 Preview

OpenAI will deprecate GPT‑4.5 Preview in the API on July 14, 2025, giving developers three months to transition to GPT‑4.1. Designed as a research preview, GPT‑4.5 helped OpenAI explore a larger, more compute-intensive architecture. With GPT‑4.1 offering similar or better performance at significantly lower cost and latency, GPT‑4.5 will be retired.

OpenAI has emphasized that the creativity, writing quality, humor, and nuance developers appreciated in GPT‑4.5 will continue to shape the design of future API models.

Appendix: Model Benchmark Results

The following tables summarize results from a wide range of academic, coding, instruction-following, and long-context benchmarks used to evaluate the GPT‑4.1 model family. These results reflect both public and internal evaluations conducted by OpenAI as of April 2025.

Bar graph displaying SWE-bench Verified accuracy. GPT‑4.1 leads with 55%, followed by GPT‑4o at 33% and other OpenAI models. Mini and nano variants show lower scores, emphasizing GPT‑4.1’s strength in software engineering benchmarks.

GPT‑4.1 Sets a New Standard in Real-World Coding Tasks. Image Source: OpenAI

Table comparing academic benchmark scores for GPT‑4.1, mini, and nano models versus GPT‑4o, GPT‑4.5, and other OpenAI models. Categories include AIME, GPQA Diamond, MMLU, and Multilingual MMLU, showing significant gains in GPT‑4.1’s academic performance.

GPT‑4.1 Outperforms in Academic Benchmarks. Image Source: OpenAI

Table showing coding evaluation results for GPT‑4.1 and other OpenAI models across benchmarks like SWE-bench Verified, SWE-Lancer, and Aider’s polyglot. GPT‑4.1 achieves the highest SWE-bench score at 54.6% and demonstrates major cost and performance efficiency.

GPT‑4.1 Leads in Coding Accuracy and Efficiency. Image Source: OpenAI

Table comparing instruction-following accuracy across various benchmarks including MultiChallenge, IFEval, and COLLIE. GPT‑4.1 leads over GPT‑4o in most categories, highlighting improved reliability in multi-turn and format-sensitive tasks.

GPT‑4.1 Shows Significant Gains in Instruction Following. Image Source: OpenAI

Table of long-context evaluation scores showing GPT‑4.1’s performance across OpenAI-MRCR and Graphwalks benchmarks. Highlights include 57.2% on MRCR 2-needle (128k) and 61.7% on Graphwalks BFS (<128k), indicating superior long-context comprehension.

GPT‑4.1 Dominates Long-Context and Retrieval Benchmarks. Image Source: OpenAI

What This Means

The GPT‑4.1 family represents a strategic shift toward models that are not just powerful, but practically deployable—faster, cheaper, and better aligned with developer needs. With major leaps in long-context understanding, instruction following, and real-world coding performance, these models open the door to building more capable AI agents, smarter apps, and scalable enterprise tools.

They also reflect a broader evolution in how AI is evaluated—not just by benchmark scores, but by how reliably it performs under real-world constraints. The introduction of new long-context benchmarks like OpenAI-MRCR and Graphwalks shows a growing focus on meaningful, applied intelligence.

The future of AI isn’t just smarter—it’s leaner, sharper, and ready to work.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.