> ## Documentation Index
> Fetch the complete documentation index at: https://openai-hd4n6.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Create session

> Create an ephemeral API token for use in client-side applications with the
Realtime API. Can be configured with the same session parameters as the
`session.update` client event.

It responds with a session object, plus a `client_secret` key which contains
a usable ephemeral API token that can be used to authenticate browser clients
for the Realtime API.




## OpenAPI

````yaml api-definition.yaml post /realtime/sessions
openapi: 3.0.0
info:
  title: OpenAI API
  description: >-
    The OpenAI REST API. Please see
    https://platform.openai.com/docs/api-reference for more details.
  version: 2.3.0
  termsOfService: https://openai.com/policies/terms-of-use
  contact:
    name: OpenAI Support
    url: https://help.openai.com/
  license:
    name: MIT
    url: https://github.com/openai/openai-openapi/blob/master/LICENSE
servers:
  - url: https://api.openai.com/v1
security:
  - ApiKeyAuth: []
tags:
  - name: Assistants
    description: Build Assistants that can call models and use tools.
  - name: Audio
    description: Turn audio into text or text into audio.
  - name: Chat
    description: >-
      Given a list of messages comprising a conversation, the model will return
      a response.
  - name: Completions
    description: >-
      Given a prompt, the model will return one or more predicted completions,
      and can also return the probabilities of alternative tokens at each
      position.
  - name: Embeddings
    description: >-
      Get a vector representation of a given input that can be easily consumed
      by machine learning models and algorithms.
  - name: Evals
    description: Manage and run evals in the OpenAI platform.
  - name: Fine-tuning
    description: Manage fine-tuning jobs to tailor a model to your specific training data.
  - name: Batch
    description: Create large batches of API requests to run asynchronously.
  - name: Files
    description: >-
      Files are used to upload documents that can be used with features like
      Assistants and Fine-tuning.
  - name: Uploads
    description: Use Uploads to upload large files in multiple parts.
  - name: Images
    description: Given a prompt and/or an input image, the model will generate a new image.
  - name: Models
    description: List and describe the various models available in the API.
  - name: Moderations
    description: >-
      Given text and/or image inputs, classifies if those inputs are potentially
      harmful.
  - name: Audit Logs
    description: List user actions and configuration changes within this organization.
paths:
  /realtime/sessions:
    post:
      tags:
        - Realtime
      summary: Create session
      description: >
        Create an ephemeral API token for use in client-side applications with
        the

        Realtime API. Can be configured with the same session parameters as the

        `session.update` client event.


        It responds with a session object, plus a `client_secret` key which
        contains

        a usable ephemeral API token that can be used to authenticate browser
        clients

        for the Realtime API.
      operationId: create-realtime-session
      requestBody:
        description: Create an ephemeral API key with the given session configuration.
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/RealtimeSessionCreateRequest'
      responses:
        '200':
          description: Session created successfully.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/RealtimeSessionCreateResponse'
components:
  schemas:
    RealtimeSessionCreateRequest:
      type: object
      description: Realtime session object configuration.
      properties:
        modalities:
          description: |
            The set of modalities the model can respond with. To disable audio,
            set this to ["text"].
          items:
            type: string
            default:
              - text
              - audio
            enum:
              - text
              - audio
        model:
          type: string
          description: |
            The Realtime model used for this session.
          enum:
            - gpt-4o-realtime-preview
            - gpt-4o-realtime-preview-2024-10-01
            - gpt-4o-realtime-preview-2024-12-17
            - gpt-4o-mini-realtime-preview
            - gpt-4o-mini-realtime-preview-2024-12-17
        instructions:
          type: string
          description: >
            The default system instructions (i.e. system message) prepended to
            model  calls. This field allows the client to guide the model on
            desired  responses. The model can be instructed on response content
            and format,  (e.g. "be extremely succinct", "act friendly", "here
            are examples of good  responses") and on audio behavior (e.g. "talk
            quickly", "inject emotion  into your voice", "laugh frequently").
            The instructions are not guaranteed  to be followed by the model,
            but they provide guidance to the model on the desired behavior.


            Note that the server sets default instructions which will be used if
            this  field is not set and are visible in the `session.created`
            event at the  start of the session.
        voice:
          $ref: '#/components/schemas/VoiceIdsShared'
          description: >
            The voice the model uses to respond. Voice cannot be changed during
            the 

            session once the model has responded with audio at least once.
            Current 

            voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`,
            `fable`,

            `onyx`, `nova`, `sage`, `shimmer`, and `verse`.
        input_audio_format:
          type: string
          default: pcm16
          enum:
            - pcm16
            - g711_ulaw
            - g711_alaw
          description: >
            The format of input audio. Options are `pcm16`, `g711_ulaw`, or
            `g711_alaw`.

            For `pcm16`, input audio must be 16-bit PCM at a 24kHz sample rate, 

            single channel (mono), and little-endian byte order.
        output_audio_format:
          type: string
          default: pcm16
          enum:
            - pcm16
            - g711_ulaw
            - g711_alaw
          description: >
            The format of output audio. Options are `pcm16`, `g711_ulaw`, or
            `g711_alaw`.

            For `pcm16`, output audio is sampled at a rate of 24kHz.
        input_audio_transcription:
          type: object
          description: >
            Configuration for input audio transcription, defaults to off and can
            be  set to `null` to turn off once on. Input audio transcription is
            not native to the model, since the model consumes audio directly.
            Transcription runs  asynchronously through [the
            /audio/transcriptions
            endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription)
            and should be treated as guidance of input audio content rather than
            precisely what the model heard. The client can optionally set the
            language and prompt for transcription, these offer additional
            guidance to the transcription service.
          properties:
            model:
              type: string
              description: >
                The model to use for transcription, current options are
                `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `whisper-1`.
            language:
              type: string
              description: >
                The language of the input audio. Supplying the input language in

                [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
                (e.g. `en`) format

                will improve accuracy and latency.
            prompt:
              type: string
              description: >
                An optional text to guide the model's style or continue a
                previous audio

                segment.

                For `whisper-1`, the [prompt is a list of
                keywords](/docs/guides/speech-to-text#prompting).

                For `gpt-4o-transcribe` models, the prompt is a free text
                string, for example "expect words related to technology".
        turn_detection:
          type: object
          description: >
            Configuration for turn detection, ether Server VAD or Semantic VAD.
            This can be set to `null` to turn off, in which case the client must
            manually trigger model response.

            Server VAD means that the model will detect the start and end of
            speech based on audio volume and respond at the end of user speech.

            Semantic VAD is more advanced and uses a turn detection model (in
            conjuction with VAD) to semantically estimate whether the user has
            finished speaking, then dynamically sets a timeout based on this
            probability. For example, if user audio trails off with "uhhm", the
            model will score a low probability of turn end and wait longer for
            the user to continue speaking. This can be useful for more natural
            conversations, but may have a higher latency.
          properties:
            type:
              type: string
              default: server_vad
              enum:
                - server_vad
                - semantic_vad
              description: |
                Type of turn detection.
            eagerness:
              type: string
              default: auto
              enum:
                - low
                - medium
                - high
                - auto
              description: >
                Used only for `semantic_vad` mode. The eagerness of the model to
                respond. `low` will wait longer for the user to continue
                speaking, `high` will respond more quickly. `auto` is the
                default and is equivalent to `medium`.
            threshold:
              type: number
              description: >
                Used only for `server_vad` mode. Activation threshold for VAD
                (0.0 to 1.0), this defaults to 0.5. A 

                higher threshold will require louder audio to activate the
                model, and 

                thus might perform better in noisy environments.
            prefix_padding_ms:
              type: integer
              description: >
                Used only for `server_vad` mode. Amount of audio to include
                before the VAD detected speech (in 

                milliseconds). Defaults to 300ms.
            silence_duration_ms:
              type: integer
              description: >
                Used only for `server_vad` mode. Duration of silence to detect
                speech stop (in milliseconds). Defaults 

                to 500ms. With shorter values the model will respond more
                quickly, 

                but may jump in on short pauses from the user.
            create_response:
              type: boolean
              default: true
              description: >
                Whether or not to automatically generate a response when a VAD
                stop event occurs.
            interrupt_response:
              type: boolean
              default: true
              description: >
                Whether or not to automatically interrupt any ongoing response
                with output to the default

                conversation (i.e. `conversation` of `auto`) when a VAD start
                event occurs.
        input_audio_noise_reduction:
          type: object
          default: null
          description: >
            Configuration for input audio noise reduction. This can be set to
            `null` to turn off.

            Noise reduction filters audio added to the input audio buffer before
            it is sent to VAD and the model.

            Filtering the audio can improve VAD and turn detection accuracy
            (reducing false positives) and model performance by improving
            perception of the input audio.
          properties:
            type:
              type: string
              enum:
                - near_field
                - far_field
              description: >
                Type of noise reduction. `near_field` is for close-talking
                microphones such as headphones, `far_field` is for far-field
                microphones such as laptop or conference room microphones.
        tools:
          type: array
          description: Tools (functions) available to the model.
          items:
            type: object
            properties:
              type:
                type: string
                enum:
                  - function
                description: The type of the tool, i.e. `function`.
                x-stainless-const: true
              name:
                type: string
                description: The name of the function.
              description:
                type: string
                description: >
                  The description of the function, including guidance on when
                  and how 

                  to call it, and guidance about what to tell the user when
                  calling 

                  (if anything).
              parameters:
                type: object
                description: Parameters of the function in JSON Schema.
        tool_choice:
          type: string
          default: auto
          description: >
            How the model chooses tools. Options are `auto`, `none`, `required`,
            or 

            specify a function.
        temperature:
          type: number
          default: 0.8
          description: >
            Sampling temperature for the model, limited to [0.6, 1.2]. For audio
            models a temperature of 0.8 is highly recommended for best
            performance.
        max_response_output_tokens:
          oneOf:
            - type: integer
            - type: string
              enum:
                - inf
              x-stainless-const: true
          description: |
            Maximum number of output tokens for a single assistant response,
            inclusive of tool calls. Provide an integer between 1 and 4096 to
            limit output tokens, or `inf` for the maximum available tokens for a
            given model. Defaults to `inf`.
    RealtimeSessionCreateResponse:
      type: object
      description: >
        A new Realtime session configuration, with an ephermeral key. Default
        TTL

        for keys is one minute.
      properties:
        client_secret:
          type: object
          description: Ephemeral key returned by the API.
          properties:
            value:
              type: string
              description: >
                Ephemeral key usable in client environments to authenticate
                connections

                to the Realtime API. Use this in client-side environments rather
                than

                a standard API token, which should only be used server-side.
            expires_at:
              type: integer
              description: >
                Timestamp for when the token expires. Currently, all tokens
                expire

                after one minute.
          required:
            - value
            - expires_at
        modalities:
          description: |
            The set of modalities the model can respond with. To disable audio,
            set this to ["text"].
          items:
            type: string
            enum:
              - text
              - audio
        instructions:
          type: string
          description: >
            The default system instructions (i.e. system message) prepended to
            model 

            calls. This field allows the client to guide the model on desired 

            responses. The model can be instructed on response content and
            format, 

            (e.g. "be extremely succinct", "act friendly", "here are examples of
            good 

            responses") and on audio behavior (e.g. "talk quickly", "inject
            emotion 

            into your voice", "laugh frequently"). The instructions are not
            guaranteed 

            to be followed by the model, but they provide guidance to the model
            on the 

            desired behavior.


            Note that the server sets default instructions which will be used if
            this 

            field is not set and are visible in the `session.created` event at
            the 

            start of the session.
        voice:
          $ref: '#/components/schemas/VoiceIdsShared'
          description: >
            The voice the model uses to respond. Voice cannot be changed during
            the 

            session once the model has responded with audio at least once.
            Current 

            voice options are `alloy`, `ash`, `ballad`, `coral`, `echo` `sage`, 

            `shimmer` and `verse`.
        input_audio_format:
          type: string
          description: >
            The format of input audio. Options are `pcm16`, `g711_ulaw`, or
            `g711_alaw`.
        output_audio_format:
          type: string
          description: >
            The format of output audio. Options are `pcm16`, `g711_ulaw`, or
            `g711_alaw`.
        input_audio_transcription:
          type: object
          description: >
            Configuration for input audio transcription, defaults to off and can
            be 

            set to `null` to turn off once on. Input audio transcription is not
            native 

            to the model, since the model consumes audio directly. Transcription
            runs 

            asynchronously through Whisper and should be treated as rough
            guidance 

            rather than the representation understood by the model.
          properties:
            model:
              type: string
              description: >
                The model to use for transcription, `whisper-1` is the only
                currently 

                supported model.
        turn_detection:
          type: object
          description: >
            Configuration for turn detection. Can be set to `null` to turn off.
            Server 

            VAD means that the model will detect the start and end of speech
            based on 

            audio volume and respond at the end of user speech.
          properties:
            type:
              type: string
              description: >
                Type of turn detection, only `server_vad` is currently
                supported.
            threshold:
              type: number
              description: >
                Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5.
                A 

                higher threshold will require louder audio to activate the
                model, and 

                thus might perform better in noisy environments.
            prefix_padding_ms:
              type: integer
              description: |
                Amount of audio to include before the VAD detected speech (in 
                milliseconds). Defaults to 300ms.
            silence_duration_ms:
              type: integer
              description: >
                Duration of silence to detect speech stop (in milliseconds).
                Defaults 

                to 500ms. With shorter values the model will respond more
                quickly, 

                but may jump in on short pauses from the user.
        tools:
          type: array
          description: Tools (functions) available to the model.
          items:
            type: object
            properties:
              type:
                type: string
                enum:
                  - function
                description: The type of the tool, i.e. `function`.
                x-stainless-const: true
              name:
                type: string
                description: The name of the function.
              description:
                type: string
                description: >
                  The description of the function, including guidance on when
                  and how 

                  to call it, and guidance about what to tell the user when
                  calling 

                  (if anything).
              parameters:
                type: object
                description: Parameters of the function in JSON Schema.
        tool_choice:
          type: string
          description: >
            How the model chooses tools. Options are `auto`, `none`, `required`,
            or 

            specify a function.
        temperature:
          type: number
          description: >
            Sampling temperature for the model, limited to [0.6, 1.2]. Defaults
            to 0.8.
        max_response_output_tokens:
          oneOf:
            - type: integer
            - type: string
              enum:
                - inf
              x-stainless-const: true
          description: |
            Maximum number of output tokens for a single assistant response,
            inclusive of tool calls. Provide an integer between 1 and 4096 to
            limit output tokens, or `inf` for the maximum available tokens for a
            given model. Defaults to `inf`.
      required:
        - client_secret
      x-oaiMeta:
        name: The session object
        group: realtime
        example: |
          {
            "id": "sess_001",
            "object": "realtime.session",
            "model": "gpt-4o-realtime-preview",
            "modalities": ["audio", "text"],
            "instructions": "You are a friendly assistant.",
            "voice": "alloy",
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": "whisper-1"
            },
            "turn_detection": null,
            "tools": [],
            "tool_choice": "none",
            "temperature": 0.7,
            "max_response_output_tokens": 200,
            "client_secret": {
              "value": "ek_abc123", 
              "expires_at": 1234567890
            }
          }
    VoiceIdsShared:
      example: ash
      anyOf:
        - type: string
        - type: string
          enum:
            - alloy
            - ash
            - ballad
            - coral
            - echo
            - fable
            - onyx
            - nova
            - sage
            - shimmer
            - verse
  securitySchemes:
    ApiKeyAuth:
      type: http
      scheme: bearer

````