
OpenAI’s GPT-Realtime shown in action, combining natural speech, image input, and SIP calling for production-ready voice agents. Image Source: ChatGPT-5
OpenAI Launches GPT-Realtime and Realtime API for Production Voice Agents
Key Takeaways: OpenAI GPT-Realtime Model and Realtime API Updates
OpenAI made the Realtime API generally available, optimized for production-ready voice agents.
The launch includes GPT-Realtime, a new speech-to-speech model with more natural, expressive audio.
The API now supports image input, remote MCP servers, and phone calling via Session Initiation Protocol (SIP).
GPT-Realtime improves on instruction following, function calling, and multilingual comprehension, scoring significantly higher than earlier models.
Pricing for GPT-Realtime is 20% lower than the previous GPT-4o-realtime-preview model.
OpenAI Expands Realtime API for Production-Ready Voice Agents
OpenAI announced the general availability of its Realtime API, making it easier for developers and enterprises to build reliable, production-ready voice agents. Since its public beta in October 2024, thousands of developers have tested the API, contributing to improvements in reliability, latency, and quality.
The updated API now includes support for remote MCP servers, image input, and Session Initiation Protocol (SIP) phone calling, enabling agents to connect to the public phone network and PBX systems. The new features broaden how developers can integrate external tools and real-world context into conversational agents.
Unlike traditional pipelines that combine separate speech-to-text and text-to-speech models, the Realtime API generates audio directly within a single system. This reduces latency, preserves conversational nuance, and produces more natural, expressive interactions.
Introducing GPT-Realtime: Advanced Speech-to-Speech Model for Real-World Use
At the center of the update is GPT-Realtime, a new speech-to-speech model that OpenAI calls its most advanced to date. The system was built with input from early customers to ensure it performs well in practical scenarios such as customer service, personal assistants, and educational tools.
The model offers stronger instruction-following, more accurate calling tool usage, and more expressive, human-like audio delivery. It can reliably handle tasks like reading scripted disclaimers word-for-word, repeating back phone numbers or other alphanumeric details, and seamlessly shifting between languages in the middle of a conversation.
Audio Quality: Natural, Expressive Speech for Voice Agent Conversations
For voice agents to succeed in real-world settings, they need to sound natural and human, with the right intonation, emotion, and pacing to keep conversations engaging.
GPT-Realtime was trained to capture these qualities, producing audio that feels expressive and lifelike while still responding accurately to developer instructions. It can follow fine-grained prompts such as “speak empathetically in a French accent” or “speak quickly and professionally.”
Two new voices, Cedar and Marin, showcase the most significant improvements in natural-sounding speech, while OpenAI’s existing eight voices have also been upgraded to benefit from these enhancements.
Intelligence and Comprehension: Multilingual, Context-Aware Audio Understanding
A key strength of GPT-Realtime is its ability to understand and respond to spoken audio with greater accuracy. The model not only transcribes words correctly but also interprets the context, tone, and non-verbal signals that shape natural conversation. For example, it can recognize a laugh and adjust its response accordingly, making interactions feel less mechanical.
The model can also switch between languages mid-sentence, maintaining coherence across multilingual conversations. In addition, it can adapt tone dynamically, sounding “snappy and professional” in one exchange and “kind and empathetic” in the next.
According to internal evaluations, GPT-Realtime also shows improved accuracy in detecting and repeating alphanumeric sequences—such as phone numbers, VINs, or serial codes—in multiple languages, including Spanish, Chinese, Japanese, and French. This precision is particularly valuable in contexts like customer support or identity verification, where detail and accuracy are critical.
On the Big Bench Audio reasoning benchmark, GPT-Realtime achieved 82.8% accuracy, compared to 65.6% for the December 2024 model.
Instruction Following: Improved Accuracy for Developer Prompts and Scripts
When developers build a speech-to-speech application, they provide the model with a set of rules about how it should behave — including what to say in specific scenarios, what language to avoid, and how to respond to sensitive prompts. In earlier models, these directions sometimes carried little weight, leading to inconsistent results.
GPT-Realtime is much better at respecting those developer instructions. For example, if the prompt requires a support agent to read a disclaimer exactly as written, the model will now deliver it word-for-word instead of paraphrasing. If the setup specifies that the agent should never provide certain information, the model is more reliable at following that rule. Even small adjustments, like asking the agent to pause briefly before answering or keep responses concise, are handled with greater consistency.
On the MultiChallenge audio benchmark, which measures performance on this kind of instruction adherence, GPT-Realtime scored 30.5%, compared with 20.6% for the December 2024 model. This marks a step forward in giving developers tighter control over how their voice agents perform in production.
Function Calling: Smarter, More Precise Tool Use in Voice Applications
For a voice agent to be truly useful, it needs more than natural conversation — it also has to interact with external tools and systems at the right time. That could mean looking up a customer’s account details, booking an appointment, or checking inventory during a live call.
GPT-Realtime improves on earlier models by being more accurate in three key areas:
Choosing the right function for the task
Triggering it at the appropriate moment in the conversation
Supplying the correct arguments or inputs so the request runs smoothly
In testing, the model scored 66.5% on the ComplexFuncBench benchmark, compared with 49.7% for the December 2024 model — showing meaningful progress in real-world reliability.
The system also supports asynchronous function calling, which means a long-running action, like retrieving data from a remote system, doesn’t stall the entire conversation. Instead, the voice agent can keep the dialogue flowing naturally while waiting for results, creating a smoother experience for end users.
New Capabilities in the Realtime API: Image Input, SIP Calling, MCP Support, and Reusable Prompts
The latest release also brings several new features that make the Realtime API more flexible and production-ready:
Remote MCP server support: Developers can now connect a Realtime session to an external MCP server simply by passing in its URL. The API then manages tool calls automatically, removing the need for manual integrations. This means new tools can be added on the fly — point the session to a different server, and those capabilities are available immediately.
Image input: Sessions can now include images alongside audio and text. Rather than treating images like a video stream, the API integrates them into the conversation. For example, a user could upload a screenshot and ask the agent to read text from it or describe what’s shown. Developers stay in control of when and how images are shared, ensuring agents only see what’s relevant.
Session Initiation Protocol (SIP) support: With SIP integration, applications can connect to the public phone network, PBX systems, desk phones, and other SIP endpoints. This allows AI-powered voice agents to operate within traditional call center infrastructure, making deployment smoother for enterprises already invested in existing telecom systems.
Reusable prompts: Developers can now save and reuse structured prompts that include developer messages, tools, variables, and example user/assistant interactions. These prompts can be applied across multiple Realtime API sessions, similar to how they work in the Responses API, making it easier to maintain consistency across applications.
Safety, Privacy, and Compliance in the Realtime API
OpenAI has built multiple layers of safeguards into the Realtime API to make it safer and easier for developers to deploy responsibly:
Built-in content moderation: Active classifiers monitor conversations in real time. If a session is flagged for harmful or prohibited content, the interaction can be halted immediately to prevent misuse.
Customizable guardrails: Developers can add their own safety measures through the Agents SDK, giving them control over how agents behave in sensitive or high-stakes environments.
Responsible use requirements: The API’s usage policies prohibit developers from repurposing outputs for spam, deception, impersonation, or other harmful purposes. Applications must also make it clear when a user is interacting with AI, unless it is already obvious from context.
Preset voices for security: To reduce the risk of malicious impersonation, the Realtime API only uses preset synthetic voices rather than allowing custom clones of real individuals.
Enterprise privacy and compliance: The system supports EU Data Residency for European applications and falls under OpenAI’s enterprise privacy commitments, giving organizations more confidence about regulatory compliance.
Pricing and Availability for GPT-Realtime and the Realtime API
The Realtime API and GPT-Realtime model are now generally available to all developers, with new pricing designed to lower costs and make large-scale deployment more affordable:
20% price reduction: GPT-Realtime is priced 20% lower than the previous GPT-4o-realtime-preview model, making advanced voice capabilities more accessible.
Audio input pricing: $32 per 1 million audio input tokens, with cached input tokens available at a reduced rate of $0.40 per 1 million tokens.
Audio output pricing: $64 per 1 million audio output tokens, giving developers predictable costs for scaling production use.
Fine-grained conversation controls: Developers can set intelligent token limits and truncate multiple conversation turns at once. This helps manage long-running sessions more efficiently while keeping costs under control.
Immediate availability: The updated model and API are available starting today, with documentation, a Playground for testing, and a Realtime API prompting guide to help developers get started quickly.
Q&A: OpenAI GPT-Realtime Model and Realtime API Features
Q: What is GPT-Realtime?
A: GPT-Realtime is OpenAI’s most advanced speech-to-speech model, designed for production-ready voice agents with improvements in audio quality, intelligence, instruction following, and function calling.
Q: What’s new in the Realtime API?
A: The API now supports remote MCP servers, image input, and SIP phone calling, alongside reusable prompts and enhanced multimodal functionality.
Q: How does GPT-Realtime compare to earlier models?
A: On benchmarks, GPT-Realtime scored 82.8% on Big Bench Audio, 30.5% on MultiChallenge, and 66.5% on ComplexFuncBench — all significantly higher than December 2024 results.
Q: How does OpenAI ensure safety and privacy?
A: The Realtime API uses multiple safeguards, prohibits spam or impersonation, requires disclosure of AI interactions, and supports EU Data Residency.
Q: How much does GPT-Realtime cost?
A: Pricing is 20% lower than before: $32 per 1M input tokens and $64 per 1M output tokens, with caching and truncation features to reduce costs further.
What This Means: The Future of AI Voice Agents in Customer Support and Beyond
The release of GPT-Realtime and the general availability of the Realtime API represent a major step in bringing AI-powered voice agents into real-world production. By combining speech-to-speech processing, multimodal input, and enterprise-grade integrations, OpenAI is positioning voice interfaces as a core layer of the next-generation AI ecosystem.
For developers, the updates provide more reliable tools, lower costs, and broader functionality to support customer-facing applications at scale. For users, the improvements in natural, expressive audio mean AI voice agents will feel increasingly seamless and human-like.
As voice agents expand from customer support into education, personal assistance, and beyond, the technology is moving closer to becoming an everyday interface for human-computer interaction.
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.