This website uses cookies

Read our Privacy policy and Terms of use for more information.

A photorealistic concept image showing GPT-5.5 positioned as an enterprise AI system designed to complete multi-step professional work across analysis, coding, and research with less human supervision. AI-generated image via ChatGPT (OpenAI)

OpenAI Launches GPT-5.5 for Autonomous Enterprise AI Work

OpenAI has launched GPT-5.5, a new model built to complete complex professional work with less human supervision across enterprise knowledge work, agentic coding, and scientific research.

The release matters because many enterprise teams are deciding right now which AI models to trust inside production workflows. GPT-5.5 is designed to plan, use tools, check its work, handle ambiguity, and continue across multi-step tasks with less back-and-forth from users. The model is rolling out now to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, with GPT-5.5 Pro available to Pro, Business, and Enterprise users. API access for both versions is coming soon.

For enterprise teams, the decision point is whether GPT-5.5 can reliably take on more of the multi-step work they currently supervise by hand.

In short, GPT-5.5 moves the enterprise AI question from which model answers best to which model can complete real professional work with less supervision.

GPT-5.5 is OpenAI's latest large language model for autonomous multi-step execution across professional knowledge work, coding, and scientific research with reduced human oversight.

Key Takeaways: GPT-5.5 Enterprise AI Deployment

GPT-5.5 is an enterprise AI model designed to complete complex professional work end-to-end with less human supervision across knowledge work, coding, and research workflows.

  • GPT-5.5 is available now for Plus, Pro, Business, and Enterprise users in ChatGPT and Codex; API access is coming soon at $5 per 1M input tokens and $30 per 1M output tokens with a 1M context window.

  • GPT-5.5 scores 84.9% on GDPval, 78.7% on OSWorld-Verified, and 91.7% on BigLaw Bench, showing gains across enterprise knowledge work, autonomous computer use, and legal workflows.

  • Box reported a 10-point accuracy gain with GPT-5.5 over GPT-5.4 across financial services, healthcare, and due diligence evaluations.

  • In agentic coding, GPT-5.5 reaches 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro, while Claude Opus 4.7 leads SWE-Bench Pro at 64.3%, according to OpenAI's published benchmark data.

  • GPT-5.5 Pro is priced at $30 per 1M input tokens and $180 per 1M output tokens, targeting higher-accuracy professional workflows in legal, finance, education, and data science.

  • Artificial Analysis data shows GPT-5.5 leading AA-Omniscience accuracy at 57%, but its approximately 14% non-hallucination rate creates a calibration tradeoff for high-stakes enterprise workflows.

GPT-5.5 Expands Autonomous Enterprise Knowledge Work

GPT-5.5 delivers its strongest enterprise gains not in coding but across the professional work most organizations run on every day — documents, research, financial analysis, customer workflows, and legal review.

OpenAI says GPT-5.5 is better at the full knowledge work loop: finding information, understanding what matters, using tools, checking output, and turning raw material into a finished deliverable. That full-loop capability — not any single feature — is what makes autonomous execution possible in real professional environments.

Enterprise knowledge work benchmark results:

  • GDPval (knowledge work across 44 occupations): 84.9%

  • OSWorld-Verified (autonomous computer operation — clicking, typing, navigating real interfaces): 78.7%

  • Tau2-bench Telecom (complex customer-service workflows, no prompt tuning): 98.0%

  • FinanceAgent: 60.0%

  • Internal investment-banking modeling tasks: 88.5%

  • OfficeQA Pro: 54.1%

  • BigLaw Bench: 91.7%

The BigLaw Bench result is worth pausing on. Legal workflows — especially multi-issue regulatory analysis — have been among the hardest categories for AI models to handle reliably because they require sustained reasoning across dense, high-stakes material.

Niko Grupen, Head of Applied Research at Harvey, said: "GPT-5.5 delivered one of the strongest BigLaw Bench performances we've seen to date. The model showed significant improvements on complex legal workflows like multi-issue SEC regulatory analysis, which have historically challenged frontier models. It also stands out on open-ended research queries, leading with the takeaway and scaling detail to the complexity of the question."

GPT-5.5 Pro targets harder questions and higher-accuracy work — early testers found responses significantly more comprehensive, well-structured, accurate, and relevant than GPT-5.4 Pro, with especially strong results in business, legal, education, and data science.

What autonomous execution looks like inside OpenAI's own teams

The most concrete signal of what GPT-5.5 can do in enterprise workflows comes from OpenAI's own operations. More than 85% of the company uses Codex every week — not just in engineering, but across finance, communications, marketing, data science, and product management.

Three internal use cases show the range:

  • The Comms team used GPT-5.5 in Codex to analyze six months of speaking request data, build a scoring and risk framework, and validate an automated Slack agent — routing low-risk requests automatically while keeping higher-risk ones in human review.

  • The Finance team used Codex to review 24,771 K-1 tax forms totaling 71,637 pages, with a privacy-protective workflow that excluded personal information. The task completed two weeks faster than the prior year.

  • On the Go-to-Market team, an employee automated weekly business report generation, saving 5-10 hours per week.

These are not demos. They are operational workflows inside a large organization — which is the context enterprise buyers need when evaluating whether AI can take on more of the work they currently supervise by hand.

What enterprise partners reported

Ben Kus, CTO at Box: "GPT-5.5 delivered one of the best results we've seen on Box's enterprise work evaluation set. It showed broad gains across every category we tested, including Financial Services, Healthcare, Public Sector, Data Analysis, and Due Diligence, where stronger reasoning, better follow-through, and more reliable execution matter most. Overall accuracy rose 10 points over GPT-5.4, from 67% to 77%."

Rajesh Tella, Director of AI Engineering at Lowe's:

"Across three distinct Lowe's use cases — plant image recognition, agentic tool routing, and conversational AI qualityGPT-5.5 delivered meaningful gains over the prior model, including a 25% lift in plant recognition accuracy, stronger product matching, and more reliable 20-tool agentic workflows. These improvements translate into better experiences for associates, leaders, and customers."

Nilesh Dalvi, Engineering Lead at Glean:

"GPT-5.5 delivers on both fronts: better instruction-following throughout complex tasks, paired with low latency and token efficiency that come from its improved adaptive reasoning. It's well-suited for our customers' most demanding agentic workloads."

Vinod Peris, SVP of Network Product Engineering at Cisco:

"GPT-5.5 is more persistent, needs less guidance, and cuts through complexity with high-signal outputs. On large-scale codebases, it surfaces insights we can act on immediately — from deep memory optimization to proactive security hardening — helping engineers drive outcomes in days, not months."

Justin Boitano, VP of Enterprise AI at NVIDIA:

"GPT-5.5 enables our teams to ship end-to-end features from natural language prompts, cut debug time from days to hours, and turn weeks of experimentation into overnight progress in complex codebases. It's more than faster coding — it's a new way of working that helps people operate at a fundamentally different speed."

Across these results, the pattern is consistent: GPT-5.5 is completing work that previously required more human coordination to finish — not just completing it faster.

GPT-5.5 Improves Agentic Coding and Long-Horizon Engineering

GPT-5.5 is designed to complete long-horizon coding tasks — holding context across large systems, reasoning through ambiguous failures, and carrying changes through an entire codebase — not just produce clean output on isolated problems.

That distinction is what separates a model engineering teams can trust with real work from one that performs well in demos.

Coding benchmark results:

  • Terminal-Bench 2.0 (complex command-line workflows requiring planning, iteration, and tool coordination): GPT-5.5 82.7% vs. GPT-5.4 75.1% — above Claude Opus 4.7 at 69.4% and Gemini 3.1 Pro at 68.5%

  • SWE-Bench Pro (Public) (real-world GitHub issue resolution): GPT-5.5 58.6% vs. GPT-5.4 57.7%Claude Opus 4.7 leads at 64.3%, Gemini 3.1 Pro at 54.2%, per OpenAI's own published data

  • Expert-SWE (OpenAI internal benchmark — long-horizon coding tasks with a median estimated human completion time of 20 hours): GPT-5.5 73.1% vs. GPT-5.4 68.5% — no external model scores published

Across all three benchmarks, OpenAI says GPT-5.5 improves on GPT-5.4 while using fewer tokens. The Claude Opus 4.7 lead on SWE-Bench Pro is worth noting — GPT-5.5 does not hold every coding benchmark, and teams evaluating it for GitHub-style issue resolution should test both models against their actual workflows.

What early testers actually experienced

Benchmarks capture scores. They don't capture the qualitative shift several early testers described — the feeling that GPT-5.5 understands the shape of a problem, not just the surface of it.

Dan Shipper, Founder and CEO of Every, called it "the first coding model I've used that has serious conceptual clarity." His test was direct: could GPT-5.5 look at a broken system and independently produce the same architectural rewrite his best engineer had eventually decided on? GPT-5.4 could not. GPT-5.5 could.

Pietro Schirano, CEO of MagicPath, described merging a branch with hundreds of frontend and refactor changes into a substantially changed main branch — resolved in one shot in about 20 minutes. His reaction: "It genuinely feels like I'm working with a higher intelligence, and there's almost a sense of respect."

Senior engineers who tested the model said GPT-5.5 catches issues in advance and predicts testing and review needs without being asked — a behavior that defines engineering trust, not just output quality. One engineer at NVIDIA described losing access to the model as feeling "like I've had a limb amputated."

What coding platform partners reported

Michael Truell, Co-founder and CEO at Cursor: "GPT-5.5 is noticeably smarter and more persistent than GPT-5.4, with stronger coding performance and more reliable tool use. It stays on task for significantly longer without stopping early, which matters most for the complex, long-running work our users delegate to Cursor."

Scott Wu, Co-founder and CEO at Cognition: "GPT-5.5 has set a new bar for what's possible with Devin. It runs longer and more autonomously than any GPT model we've tested. It surfaces bugs that no other model can catch, and also investigates and fixes production issues end-to-end."

Joe Binder, VP of Product at GitHub: "In our evaluations, we're seeing meaningful gains in capability on complex, multi-step coding tasks. On SWE-Bench, the model resolves 6+ percentage points more tasks and, on more complex workflows, reaches solutions in substantially fewer steps — often 50-60% less — with incremental improvements in end-to-end completion time. For developers, that translates to less waiting, less manual intervention, and more confidence that the system can carry difficult problems through to resolution."

Fabian Hedin, CTO and Co-Founder at Lovable: "Builders want continuous progress, not endless iteration. GPT-5.5 breaks through the walls people usually hit on more complex tasks, like authentication flows and real-time syncing, in far fewer turns. The model really shines when the work gets hard, handling tough tasks with far less back-and-forth."

Jeff Wang, CEO at Windsurf: "GPT-5.5 is a giant leap forward for handling ambiguity compared to previous GPT models. As Windsurf 2.0 focuses more on parallel agents, this model is key for long-horizon tasks. It excels at understanding intent, reasoning through complexity, and executing with minimal back-and-forth."

For engineering teams evaluating how much work AI can own end-to-end, GPT-5.5 moves that line further than any previous model — but the SWE-Bench Pro gap with Claude Opus 4.7 is a reminder that the right model still depends on the specific task.

GPT-5.5 Advances Scientific Research in Bioinformatics and Mathematics

GPT-5.5 shows meaningful gains in scientific research — tasks that require exploring an idea, gathering evidence, testing assumptions, and deciding what to try next across multiple passes.

The results are most pronounced in bioinformatics, genetics, and mathematics, where the work has no clean finish line and sustained reasoning across ambiguous problems is the core requirement.

Bioinformatics and genetics

On GeneBench, a benchmark focused on multi-stage scientific data analysis in genetics and quantitative biology, GPT-5.5 shows a clear improvement over GPT-5.4.

These aren't clean, well-structured problems. GeneBench tasks require models to reason about potentially ambiguous or error-prone data with minimal guidance, navigate hidden confounders and quality-control failures, and correctly implement modern statistical methods — work that often corresponds to multi-day projects for scientific experts.

On BixBench, built around real-world bioinformatics and data analysis, GPT-5.5 achieved leading performance among models with published scores.

Brandon White, Co-Founder and CEO at Axiom Bio, said: "It's incredibly energizing to use OpenAI's new GPT-5.5 model in our harness, have it reason over massive biochemical datasets to predict human drug outcomes, and then see it deliver significant accuracy gains on our hardest drug discovery evals. If OpenAI keeps cooking like this, the foundations of drug discovery will change by the end of the year."

Mathematics

An internal version of GPT-5.5 with a custom harness helped discover a new proof about Ramsey numbers — a result in combinatorics, the branch of mathematics studying how discrete objects such as graphs, networks, and patterns fit together.

The proof concerned a longstanding asymptotic fact about off-diagonal Ramsey numbers and was later verified in Lean. It represents GPT-5.5 contributing not just code or explanation but a novel mathematical argument in an active research area — the kind of output that has historically required human mathematicians working over extended periods.

What research collaboration looks like in practice

Early testers used GPT-5.5 Pro less like a query engine and more like a research collaborator — critiquing manuscripts over multiple passes, stress-testing technical arguments, and working across code, notes, and PDF context simultaneously.

Derya Unutmaz, an immunology professor at the Jackson Laboratory for Genomic Medicine, used GPT-5.5 Pro to analyze a gene-expression dataset with 62 samples and nearly 28,000 genes. The result was a detailed research report that surfaced key questions and insights — work he said would have taken his team months.

Bartosz Naskręcki, assistant professor of mathematics at Adam Mickiewicz University in Poznań, Poland, built an algebraic-geometry app from a single prompt in 11 minutes using GPT-5.5 in Codex, visualizing the intersection of quadratic surfaces and converting the resulting curve into a Weierstrass model.

For research teams, the shift GPT-5.5 represents isn't about getting answers faster — it's about having a system that can stay inside the problem long enough to be genuinely useful.

GPT-5.5 Improves Token Efficiency With NVIDIA Infrastructure

GPT-5.5 matches GPT-5.4 per-token latency in real-world serving despite being a significantly more capable model — a result that required rethinking inference as an integrated system, not a set of isolated optimizations.

GPT-5.5 was co-designed for, trained with, and served on NVIDIA GB200 and GB300 NVL72 systems, built into the model's development from the start rather than added after the fact.

One key engineering improvement involved load balancing and partitioning heuristics. Previously, requests were split into a fixed number of chunks to balance work across computing cores, a static approach that doesn't account for varying traffic patterns. OpenAI used Codex to analyze weeks of production traffic and write custom heuristic algorithms to optimally partition and balance work. That single effort increased token generation speeds by more than 20%.

For enterprise buyers, the efficiency story matters as much as the capability story. OpenAI says GPT-5.5 completes comparable Codex tasks with fewer tokens than GPT-5.4 — meaning the higher per-token price doesn't automatically translate to higher total cost for most workflows.

Justin Boitano, VP of Enterprise AI at NVIDIA, said GPT-5.5 — built and served on NVIDIA GB200 NVL72 systems — enables teams to "cut debug time from days to hours, and turn weeks of experimentation into overnight progress."

OpenAI Rates GPT-5.5 Cybersecurity and Biology Capabilities as High Risk

OpenAI has classified GPT-5.5's biological and cybersecurity capabilities as "High" under its Preparedness Framework — the risk evaluation system the company has applied across multiple model generations.

Why API Access Is Delayed

That classification carries a direct practical consequence: it is the primary reason API access is delayed. OpenAI has stated that serving a High-rated model at API scale requires additional safeguards the company is still working through, and is moving deliberately before opening it to developers at scale.

What "High" Means in Practice

GPT-5.5 did not reach the "Critical" cybersecurity capability level in OpenAI's evaluations — but it represents a meaningful step up from GPT-5.4. OpenAI first introduced cyber-specific safeguards with GPT-5.2 and has refined them through each subsequent release.

For GPT-5.5, the company designed tighter controls around higher-risk activity, sensitive cyber requests, and repeated misuse patterns. Some users may initially find the new classifiers more restrictive than expected while OpenAI continues tuning them. The company frames this directly: the goal is stronger controls around the cyber workflows most likely to cause harm while keeping access open for legitimate defensive work.

OpenAI's broader position is that frontier model cybersecurity capabilities will become widely distributed regardless — and that the best path forward is ensuring those capabilities are put to work accelerating cyber defense, not waiting until they're democratized by other means.

How OpenAI Is Expanding Access Responsibly

Rather than restricting access broadly, OpenAI is expanding it selectively. The company is making cyber-permissive models available through Trusted Access for Cyber, starting with Codex. Verified users meeting defined trust signals get expanded access to GPT-5.5's advanced cybersecurity capabilities with fewer restrictions — reducing unnecessary friction for the security professionals who need them most.

Organizations responsible for defending critical infrastructure can apply to access GPT-5.4-Cyber, a separate cyber-permissive model, subject to strict security requirements, at chatgpt.com/cyber. This two-track approachGPT-5.5 with trusted access, GPT-5.4-Cyber for infrastructure defenders — is designed to get capable tools into the hands of verified defenders at every level.

OpenAI is also working with government partners on deploying advanced AI to support officials responsible for defending systems including those securing taxpayer data, power grids, and water supplies.

What Went Into the Release Evaluation

The full GPT-5.5 release included:

Preparedness evaluations across OpenAI's full safety framework Domain-specific testing for advanced biology and cybersecurity capabilities New targeted evaluations developed specifically for this model External red-teaming with nearly 200 trusted early-access partners before release

For enterprise teams deploying GPT-5.5 in security-adjacent workflows, the Preparedness Framework classification is worth understanding — not as a reason to avoid the model, but as context for where OpenAI sees the capability boundaries, what safeguards are in place, and why API timing is being handled differently than prior releases.

GPT-5.5 Leads AA-Omniscience Accuracy but Raises Calibration Questions

Artificial Analysis is an independent AI benchmarking organization that evaluates frontier models across accuracy, speed, and reliability metrics. Its AA-Omniscience benchmark tests how well models answer factual questions across a broad range of knowledge domains — making it one of the more comprehensive third-party measures of real-world model accuracy.

According to Artificial Analysis data, GPT-5.5 leads all models on AA-Omniscience accuracy at 57% — meaning it produces the highest rate of correct answers across the evaluation set.

But when GPT-5.5 doesn't know an answer, it is far less likely to say so. Artificial Analysis data shows it abstains from uncertain questions only about 14% of the time, compared to significantly higher abstention rates from models like Grok, Claude, and Gemini. That matters because a model that answers confidently when uncertain gives professionals no signal to stop and verify. A wrong answer that looks like a right answer is harder to catch than no answer at all.

This is a calibration tradeoff, not a straightforward hallucination problem. GPT-5.5 attempts answers more aggressively, which drives its accuracy score up but also means it produces confident responses in situations where a more cautious model would express uncertainty.

For general knowledge retrieval and most professional tasks, this tradeoff may be acceptable. But in legal analysis, financial modeling, healthcare decision support, compliance review, and scientific research, a confident wrong answer can be more damaging than an admitted unknown. Enterprise teams deploying GPT-5.5 in these environments should build human review steps or verification workflows into their processes accordingly.

GPT-5.5 Pricing, API Access and Subscription Availability

GPT-5.5 is available now in ChatGPT and Codex for Plus, Pro, Business, and Enterprise users. GPT-5.5 Thinking is available to Plus, Pro, Business, and Enterprise users. GPT-5.5 Pro is available to Pro, Business, and Enterprise users.

In Codex specifically, GPT-5.5 extends to Edu and Go plans as well, with a 400K context window. A Fast mode generates tokens 1.5x faster at 2.5x the standard cost.

API pricing:

  • gpt-5.5: $5 per 1M input tokens / $30 per 1M output tokens1M context window

  • gpt-5.5-pro: $30 per 1M input tokens / $180 per 1M output tokens

  • Batch and Flex pricing: half the standard rate Priority processing: 2.5x the standard rate

OpenAI has published API pricing ahead of availability. API access for both versions is coming soon.

While GPT-5.5 is priced higher than GPT-5.4, OpenAI says the model is more token-efficient, with net costs for most Codex users expected to remain comparable.

Q&A: GPT-5.5 Enterprise Deployment, Pricing and Risk

Q: What is GPT-5.5?
A: GPT-5.5 is OpenAI's latest large language model, built to complete complex, multi-step professional tasks with less human supervision. It is designed for enterprise knowledge work, agentic coding, scientific research, and other workflows that require planning, tool use, context management, and follow-through across long tasks.

Q: Why does GPT-5.5 matter for enterprise teams?
A: GPT-5.5 matters because enterprise technology leaders are deciding which AI models to embed in document workflows, business operations, research pipelines, and software development environments. Its benchmark results and early enterprise use cases give buyers new evidence for evaluating whether autonomous AI can complete more production work with less oversight.

Q: How is GPT-5.5 different from GPT-5.4?
A: GPT-5.5 improves on GPT-5.4 by requiring less step-by-step guidance, maintaining context across longer tasks, and continuing through ambiguity instead of stopping early. OpenAI says the model is also more token-efficient, completing comparable Codex tasks with fewer tokens despite higher per-token pricing.

Q: Is GPT-5.5 safe to use for legal, finance, healthcare, or compliance work?
A: Enterprise teams should evaluate GPT-5.5 carefully before using it in high-stakes workflows such as legal analysis, finance, healthcare, compliance, or scientific research. Artificial Analysis data shows GPT-5.5 leading AA-Omniscience accuracy at 57%, but with an approximately 14% abstention rate, meaning it may answer more aggressively instead of abstaining when uncertain.

Q: How much does GPT-5.5 cost, and who can access it?
A: GPT-5.5 is available now in ChatGPT and Codex for eligible paid plans, while API access is coming soon. In the API, gpt-5.5 will cost $5 per 1M input tokens and $30 per 1M output tokens with a 1M context window. gpt-5.5-pro will cost $30 per 1M input tokens and $180 per 1M output tokens.

What This Means: GPT-5.5 and the Enterprise AI Deployment Decision

GPT-5.5 is OpenAI's clearest attempt yet to move enterprise AI from assisted work to completed work, with less human supervision across real professional workflows.

Key point: GPT-5.5 reduces the human oversight required to complete professional work end-to-end. The gains are not limited to coding; they extend across knowledge work, document analysis, research, and scientific workflows. That makes the model a business infrastructure decision, not just a technology upgrade.

Who should care: Enterprise technology leaders, operations teams, legal and finance departments, researchers, and knowledge workers in document-heavy environments should evaluate GPT-5.5 now. The model is most relevant for teams deciding which tasks can move from human-assisted AI workflows to more autonomous execution.

Why this matters now: Organizations are choosing which AI models to build into core workflows, and those decisions become harder to reverse once infrastructure is in place. GPT-5.5's performance gains make it an important model to test against real work, especially for teams evaluating how much of a workflow AI can complete without constant handholding.

What decision this affects: The practical decision is where GPT-5.5's autonomous execution belongs inside the organization. Some workflows may benefit from more delegation to AI, while higher-risk tasks may still require stronger human review, validation, or model comparison.

In short, GPT-5.5 gives enterprise teams a more practical way to test AI delegation: not by asking whether the model is impressive, but by asking which parts of real work it can finish reliably.

The real test of GPT-5.5 will not be whether it looks impressive in a benchmark, but whether it can carry meaningful work far enough that teams trust it with more of the job.

Sources:

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing support from Claude, and AEO/GEO/SEO optimization, image concept development, and editorial structuring support from ChatGPT, AI assistants. All final editorial decisions, perspectives, and publishing choices were made by Alicia Shapiro.

Keep Reading