GPT-Realtime-2 Changes Everything: When Voice Agents Can Reason, What Happens to Brand Discovery?

15 min read · May 9, 2026

On May 7, 2026, OpenAI released three voice models into its Realtime API. One of them matters more than the other two combined. GPT-Realtime-2 is the first voice model built on GPT-5-class reasoning, and it does something no voice AI has done before: it can think while it talks.

That distinction sounds small. It is not. Every voice assistant you have ever used, from Siri to Alexa to the previous generation of ChatGPT Voice, operated on the same basic contract: you speak, it processes, it responds. The model could not reason through a multi-step problem while maintaining a natural conversation. It could not call a tool, wait for the result, and then explain what it found without breaking the flow. It could not recover when you changed your mind mid-sentence. Voice AI was fast, but it was shallow.

GPT-Realtime-2 changes the contract. A voice agent built on this model can listen, reason through your request, call multiple tools simultaneously, narrate what it is doing, recover gracefully from interruptions, and adjust its tone depending on whether you are frustrated, curious, or ready to buy. The context window jumped from 32K to 128K tokens, meaning the agent can maintain coherence across conversations that would have overwhelmed its predecessor. On the Big Bench Audio benchmark for audio intelligence, it scores 96.6%, compared to 81.4% for GPT-Realtime-1.5.

This is a technical milestone. But the real story is what it means for how brands get discovered, recommended, and chosen, because the interface through which consumers find and evaluate products is about to undergo its most significant shift since the smartphone.

What GPT-Realtime-2 Actually Does

The model was announced alongside two companions: GPT-Realtime-Translate, which translates live speech across 70 input languages into 13 output languages, and GPT-Realtime-Whisper, a streaming transcription model. Both are useful. Neither is the point.

GPT-Realtime-2 is the one that rewrites the rules. OpenAI describes it as "built for live voice interactions where the model keeps the conversation moving while it reasons through a request, calls tools, handles corrections or interruptions, and responds in a way that fits the moment." That description undersells it.

Consider what this enables in practice. Zillow is already building an assistant that lets you say: "Find me homes within my BuyAbility, avoid busy streets, and schedule a tour for Saturday." The agent reasons through the budget constraint, filters for street traffic data, checks calendar availability, and confirms the booking, all in one spoken exchange. Josh Weisberg, SVP and Head of AI at Zillow, reported a 26-point lift in call success rate on their hardest adversarial benchmark, from 69% to 95%, after optimizing for GPT-Realtime-2.

Priceline is building toward a future where a traveler can manage an entire trip by voice: searching for flights conversationally, adjusting hotel reservations after a flight delay, getting real-time TSA wait times, and even translating conversations at the destination. Deutsche Telekom is testing multilingual voice support where customers speak in their preferred language and the conversation happens in real time.

The model introduces several capabilities that, taken together, represent a qualitative shift in what voice AI can do:

Preamble phrases. The agent can say "let me check that" or "one moment" while it works, eliminating the awkward silence that made previous voice agents feel broken.

Parallel tool calls. It can call multiple tools at once and narrate what it is doing. "Checking your calendar and looking up available flights" happens simultaneously, not sequentially.

Recovery behavior. When something goes wrong, the model says "I'm having trouble with that right now" instead of failing silently or halting the conversation entirely.

Adjustable reasoning effort. Developers can dial reasoning from minimal to extra-high, balancing latency against complexity. A simple lookup uses minimal reasoning; a multi-step purchase decision uses extra-high.

Tone control. The model adjusts its speaking style based on context: calm during problem-solving, empathetic during complaints, upbeat after successful outcomes.

128K context window. The previous limit was 32K. This fourfold increase means the agent can maintain coherent, context-rich conversations across extended interactions, including complex purchase journeys that involve comparisons, deliberation, and multiple decision points.

On benchmarks, the gains are measurable. GPT-Realtime-2 with high reasoning scored 96.6% on Big Bench Audio compared to 81.4% for the previous model. On Audio MultiChallenge, which evaluates multi-turn conversational intelligence including instruction following and context integration, the extra-high reasoning variant scored 48.5% versus 34.7% for GPT-Realtime-1.5.

Why This Is Different From "Better Voice Search"

It is tempting to categorize GPT-Realtime-2 as an incremental improvement to voice search, the same way upgrading from GPT-3 to GPT-4 made chatbots better at answering questions. That framing misses what is structurally new.

Voice search, as it existed before this model, was a transcription problem wrapped in a search query. You spoke, the system transcribed your words, ran what amounted to a text search, and read back the top result. The voice layer was an interface, not an intelligence layer. The model did not understand your intent; it matched your keywords.

GPT-Realtime-2 is not voice search. It is a voice agent. The difference is not academic. A search engine returns results. An agent makes decisions.

When a consumer says to a GPT-Realtime-2 agent, "I need a good hotel in Barcelona for a family of four, near the beach, under 200 a night, and we need a place that can handle a gluten allergy," the agent does not run a keyword search and return a list. It reasons through the constraints, evaluates options against multiple dimensions simultaneously, calls booking APIs, checks restaurant options near candidate hotels for gluten-free menus, and presents a curated recommendation with rationale. It can even adjust in real time: "Actually, my budget is flexible if it means being closer to the beach" triggers a re-evaluation without restarting the conversation.

This is why the reasoning capability matters more than the voice capability. Voice is the interface. Reasoning is the intelligence. The combination means that for the first time, a spoken conversation with an AI can be a genuine substitute for the multi-tab browser session that consumers currently use to research and compare products.

The Brand Discovery Implications

Here is the core problem for brands: the entire discipline of digital marketing, from SEO to paid search to social media, is built around the assumption that consumers discover brands through text-based interfaces where brands can control their presentation. A search result has a title, a snippet, and a URL. A social media profile has a bio, images, and a content feed. An ad has copy, creative, and a landing page.

A voice conversation has none of that.

When a consumer asks a voice agent for a recommendation, there are no blue links. There are no featured snippets. There are no ad slots. There is a spoken answer, and in that answer, the agent either names your brand or it does not. There is no second page. There is no "showing results 1-10 of 4,237,891." There is the agent's answer, and that is the entire competitive landscape.

This compression from a page of results to a single spoken recommendation is not theoretical. It is already happening in text-based AI answers. ChatGPT, Gemini, and Perplexity regularly recommend one to three brands per answer. But in text, the consumer can still scroll, click, verify, compare. In a voice conversation, especially one happening while driving, cooking, or walking through an airport, the spoken recommendation carries even more weight because the friction of switching to a screen is high. The voice answer becomes the answer.

GPT-Realtime-2 accelerates this dynamic in three specific ways:

First, the reasoning depth enables product-level recommendations, not just brand-level mentions. Previous voice agents could answer "what is the best CRM software?" with a list of brand names. A reasoning voice agent can evaluate your specific requirements against multiple products, compare pricing tiers, check integration compatibility with your existing stack, and recommend a specific product at a specific tier. Brands that optimize only for brand-name recognition lose to brands that provide the structured, comparable, agent-readable product data that enables this kind of evaluation.

Second, the tool-calling capability creates a transactional layer inside the conversation. The agent does not just recommend; it can act. "Book it" or "add to cart" becomes a spoken command that the agent executes using live APIs. This turns the discovery-to-purchase pipeline into a single spoken exchange. Brands that are not connected to the relevant APIs, or whose product data is not structured in a way that agents can access and evaluate, get cut out of the transaction entirely.

Third, the translation model creates multilingual commerce at scale. GPT-Realtime-Translate handles 70 input languages into 13 output languages in real time. A traveler in Tokyo can speak English to a voice agent that then searches, evaluates, and books services in Japanese. This is not just a convenience feature. It removes the language barrier that has historically favored local brands in non-English markets. Global brands with strong agent-readable product data suddenly become competitive in markets where they previously lacked local language presence.

What "Voice Visibility" Actually Means

The term "AI visibility" has entered the marketing mainstream, but most of the conversation focuses on text-based AI answers: how often does ChatGPT mention your brand, does Google AI Overviews cite your content, does Perplexity link to your site. These are important questions. They are also incomplete.

Voice visibility is a distinct problem from text-based AI visibility, and GPT-Realtime-2 makes it urgent. Here is why:

In text-based AI answers, there is a citation layer. Perplexity shows inline sources. Google AI Overviews provides links. ChatGPT sometimes footnotes its claims. Even when the answer compresses a brand's presence to a single mention, there is a trail. The consumer can verify, compare, or dig deeper.

In a voice conversation, the citation layer is weak or absent. The agent speaks an answer. If the consumer is driving, cooking, or multitasking, they hear the recommendation and move on. There is no easy way to audit which sources the agent consulted, which products it considered and rejected, or why it chose one brand over another. The recommendation is the endpoint, not the starting point.

This creates a visibility challenge that is qualitatively different from anything the SEO industry has dealt with:

No SERP to optimize. There is no ranking position to track because there is no search engine results page. The agent evaluates brands algorithmically, but the evaluation happens inside a reasoning process, not a ranked list.

No click-through to measure. When a voice agent recommends a brand and the consumer accepts the recommendation without ever visiting the brand's website, traditional analytics register nothing. The brand gained a customer through a channel that leaves no trace in standard web analytics.

No creative to control. In paid search, brands control the ad copy, the landing page, the call-to-action. In voice, the agent paraphrases, summarizes, and interprets. The brand's carefully crafted messaging gets reassembled by a reasoning model into whatever phrasing best fits the conversation.

No A/B testing framework. You cannot run two versions of a voice answer and measure which one converts better because you do not control the answer. The agent does.

These constraints do not mean brands are helpless. They mean the optimization playbook needs to change, from optimizing for how a page looks to optimizing for how a brand's data is structured, connected, and agent-readable.

The New Optimization Playbook

Brands that want to be visible in voice agent conversations need to think about three layers:

Layer 1: Structured, Agent-Readable Product Data

Voice agents evaluate products by comparing structured attributes: price, features, availability, compatibility, ratings, certifications, and inventory. If your product data is trapped in PDFs, rendered only in JavaScript-heavy pages, or inconsistently formatted across your site, agents cannot evaluate it properly.

This is where standards like schema.org markup, product feeds, and machine-readable APIs become critical. Not because Google might use them, but because reasoning voice agents will use them to compare your product against competitors in real time.

The brands that win in voice commerce will be the ones that make their product data as easy for agents to parse as it is for humans to read.

Layer 2: API Connectivity and Transactional Presence

If a voice agent can reason through a purchase decision but cannot complete the transaction because the brand has no accessible API, the agent will recommend a competitor that does. Priceline's integration shows the pattern: flights, hotels, car rentals, and restaurant reservations all connected through APIs that the agent can call mid-conversation.

Brands that rely exclusively on website-based transactions, with no API, no agent integration, and no programmatic access to their inventory, will find themselves recommended but not chosen. The agent will say, "Brand X looks great, but I can only book Brand Y directly. Want me to go with Y?"

Layer 3: Conversational Relevance and Context Awareness

GPT-Realtime-2's tone control, reasoning depth, and 128K context window mean that agents can maintain nuanced, context-aware conversations. Brands that provide rich, conversational content, not just product specs but use cases, comparisons, FAQs, and real customer scenarios, give agents more material to work with when evaluating options.

A brand that publishes detailed comparison guides, honest pros-and-cons, and scenario-based content is more likely to be cited accurately by a reasoning agent than a brand that relies on marketing-speak and vague value propositions. The agent is not fooled by superlatives. It reasons through claims.

The Competitive Clock Is Running

The early adopters are already building. Zillow, Priceline, and Deutsche Telekom are not testing GPT-Realtime-2 in a lab. They are integrating it into production products that will reach millions of consumers. Zillow's 26-point improvement in call success rate is not a benchmark score. It is a measure of how much better voice agents are at completing real estate tasks compared to six months ago.

Meanwhile, the consumer adoption data is accelerating. NIQ reported that 42% of consumers now use AI to assist with shopping. Shopify reported 13x year-over-year growth in AI-driven orders. McKinsey projects agentic commerce could reach $1 trillion in the US by 2030. These numbers are not about voice specifically, but voice is the interface that makes agentic commerce accessible to the widest possible audience. Not everyone types queries into ChatGPT. Everyone speaks.

The agentic commerce readiness gap is real and growing. Brands that invested early in SEO had a multi-year head start when Google became the primary discovery channel. The same dynamic is about to play out with voice agents, but the window is shorter. The technology is moving from "impressive demo" to "production deployment" in months, not years.

Three Predictions for the Next 12 Months

First, voice-specific AI visibility tracking will emerge as a distinct discipline. Current tools measure text-based citations and recommendations. Within a year, brands will need dashboards that track how often they are mentioned, recommended, and chosen by voice agents across ChatGPT, Gemini, Siri, Alexa, and whatever Google launches next.

Second, the first major brand to build a direct GPT-Realtime-2 integration, where consumers can transact with the brand entirely through voice, will generate significant media coverage and consumer interest. The novelty effect is powerful, and the brands that move first will capture outsized attention.

Third, the gap between brands that optimize for voice agent visibility and those that do not will widen faster than the SEO gap did. The reason is structural: voice compresses the competitive landscape from a page of results to a single recommendation. In text-based search, being number five still gets some traffic. In voice, being number two might as well be invisible.

What Brands Should Do This Week

The technology is moving faster than most marketing strategies can absorb, but there are concrete steps that do not require a budget or a developer:

Audit your product data for agent readability. Can a machine parse your product pages without executing JavaScript? Are your prices, availability, features, and specifications in structured formats? If not, that is the highest-impact fix you can make right now.

Map your voice customer journey. Walk through the scenario where a consumer asks a voice agent for a recommendation in your category. What does the agent say? Which brands does it recommend? What data does it use to make that decision? This exercise reveals your actual competitive position in voice, not your imagined one.

Start tracking voice mentions. Even without specialized tools, you can query voice-enabled AI assistants with category-relevant questions and record which brands get recommended. Do this weekly. The data will accumulate quickly.

Prepare for API-based transactions. If you sell anything that could be purchased through a voice agent, start thinking about how an agent would access your inventory and complete a transaction. This does not require building a full API this week, but it does require understanding the gap between your current transaction infrastructure and what voice agents will need.

The shift from search to ask was the first wave. The shift from text-based asking to voice-based reasoning is the second. GPT-Realtime-2 is not the end point of this evolution. It is the moment the trajectory became irreversible.

Find out where your brand stands. Get a comprehensive AI visibility audit at audit.searchless.ai and see how you appear across ChatGPT, Gemini, Perplexity, and Claude, before voice agents start answering the same questions your customers are typing today.

Sources

OpenAI. "Advancing voice intelligence with new models in the API." OpenAI Blog, May 7, 2026. https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/
MarkTechPost. "OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API." May 8, 2026. https://www.marktechpost.com/2026/05/08/openai-releases-three-realtime-audio-models-gpt-realtime-2-gpt-realtime-translate-and-gpt-realtime-whisper-in-the-realtime-api/
9to5Mac. "OpenAI has new voice models that reason, translate, and transcribe as you speak." May 7, 2026. https://9to5mac.com/2026/05/07/openai-has-new-voice-models-that-reason-translate-and-transcribe-as-you-speak/
McKinsey & Company. "The Automation Curve in Agentic Commerce." 2026.
BuildFastWithAI. "GPT-Realtime-2: OpenAI Voice AI Models 2026." May 2026. https://www.buildfastwithai.com/blogs/openai-gpt-realtime-2-voice-ai-models
Artificial Analysis. "Big Bench Audio Benchmark." https://artificialanalysis.ai/methodology/speech-to-speech-benchmarking
Scale AI. "Audio MultiChallenge Leaderboard." https://labs.scale.com/leaderboard/audiomc-audio

How Visible Is Your Brand to AI?

88% of brands are invisible to ChatGPT, Perplexity, and Gemini. Find out where you stand in 60 seconds.

Check Your AI Visibility Score Free