Learn about automatic voice activity detection in the Realtime API.
input_audio_buffer.speech_started
: The start of a speech turninput_audio_buffer.speech_stopped
: The end of a speech turnturn_detection
property of the session.update
event to configure how audio is chunked within each speech-to-text sample.
There are two modes for VAD:
server_vad
: Automatically chunks the audio based on periods of silence.semantic_vad
: Chunks the audio when the model believes based on the words said by the user that they have completed their utterance.server_vad
.
Read below to learn more about the different modes.
threshold
: Activation threshold (0 to 1). A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.prefix_padding_ms
: Amount of audio (in milliseconds) to include before the VAD detected speech.silence_duration_ms
: Duration of silence (in milliseconds) to detect speech stop. With shorter values turns will be detected more quickly.turn_detection.type
to semantic_vad
in a session.update
event.
It can be configured like this:
eagerness
property is a way to control how eager the model is to interrupt the user, tuning the maximum wait timeout. In transcription mode, even if the model doesn’t reply, it affects how the audio is chunked.
auto
is the default value, and is equivalent to medium
.low
will let the user take their time to speak.high
will chunk the audio as soon as possible.eagerness
to high
.
On the other hand, if you want to let the user speak uninterrupted in conversation mode, or if you would like larger transcript chunks in transcription mode, you can set eagerness
to low
.