POST
/
realtime
/
transcription_sessions
Create transcription session
curl --request POST \
  --url https://api.openai.com/v1/realtime/transcription_sessions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "modalities": [
    [
      "text",
      "audio"
    ]
  ],
  "input_audio_format": "pcm16",
  "input_audio_transcription": {
    "model": "gpt-4o-transcribe",
    "language": "<string>",
    "prompt": "<string>"
  },
  "turn_detection": {
    "type": "server_vad",
    "eagerness": "auto",
    "threshold": 123,
    "prefix_padding_ms": 123,
    "silence_duration_ms": 123,
    "create_response": true,
    "interrupt_response": true
  },
  "input_audio_noise_reduction": null,
  "include": [
    "<string>"
  ]
}'
{
  "client_secret": {
    "value": "<string>",
    "expires_at": 123
  },
  "modalities": [
    "text"
  ],
  "input_audio_format": "<string>",
  "input_audio_transcription": {
    "model": "gpt-4o-transcribe",
    "language": "<string>",
    "prompt": "<string>"
  },
  "turn_detection": {
    "type": "<string>",
    "threshold": 123,
    "prefix_padding_ms": 123,
    "silence_duration_ms": 123
  }
}

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

Create an ephemeral API key with the given session configuration.

Realtime transcription session object configuration.

modalities
enum<string>[]

The set of modalities the model can respond with. To disable audio, set this to ["text"].

input_audio_format
enum<string>
default:pcm16

The format of input audio. Options are pcm16, g711_ulaw, or g711_alaw. For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order.

Available options:
pcm16,
g711_ulaw,
g711_alaw
input_audio_transcription
object

Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

turn_detection
object

Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. Semantic VAD is more advanced and uses a turn detection model (in conjuction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

input_audio_noise_reduction
object

Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.

include
string[]

The set of items to include in the transcription. Current available items are:

  • item.input_audio_transcription.logprobs

Response

200 - application/json

Session created successfully.

A new Realtime transcription session configuration.

When a session is created on the server via REST API, the session object also contains an ephemeral key. Default TTL for keys is one minute. This property is not present when a session is updated via the WebSocket API.

client_secret
object
required

Ephemeral key returned by the API. Only present when the session is created on the server via REST API.

modalities
enum<string>[]

The set of modalities the model can respond with. To disable audio, set this to ["text"].

input_audio_format
string

The format of input audio. Options are pcm16, g711_ulaw, or g711_alaw.

input_audio_transcription
object

Configuration of the transcription model.

turn_detection
object

Configuration for turn detection. Can be set to null to turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.