Learn how to build voice agents that can understand audio and respond back in natural language.
Speech-to-speech (multimodal)
Chained (speech-to-text → LLM → text-to-speech)
gpt-4o-realtime-preview
. The model thinks and responds in speech. It doesn’t rely on a transcript of the user’s input—it hears emotion and intent, filters out noise, and responds directly in speech. Use this approach for highly interactive, low-latency, conversational use cases.
Strengths | Best for |
---|---|
Low latency interactions | Interactive and unstructured conversations |
Rich multimodal understanding (audio and text simultaneously) | Language tutoring and interactive learning experiences |
Natural, fluid conversational flow | Conversational search and discovery |
Enhanced user experience through vocal context understanding | Interactive customer service scenarios |
gpt-4o-transcribe
→ gpt-4o
→ gpt-4o-mini-tts
Strengths | Best for |
---|---|
High control and transparency | Structured workflows focused on specific user objectives |
Robust function calling and structured interactions | Customer support |
Reliable, predictable responses | Sales and inbound triage |
Support for extended conversational context | Scenarios that involve transcripts and scripted responses |
Establishing a connection for realtime data transfer
Creating a realtime session with the Realtime API
Using an OpenAI model with realtime audio input and output capabilities
gpt-4o-realtime-preview
and gpt-4o-mini-realtime-preview
.