How to Use the OpenAI Realtime API in Python: WebSocket Tutorial for Voice and Text Apps

If your Python app only needs one prompt in and one answer out, the Responses API is often the simpler place to start. The Realtime API is for a different kind of product. It keeps a live session open so your app can stream text or audio, react to partial output, and update the session while the conversation is still moving.

That makes it useful for voice assistants, live support tools, phone agents, and backends that should stay attached to a conversation instead of rebuilding context on every turn. OpenAI’s current gpt-realtime model is a general-availability realtime model with text and audio output, plus text, audio, and image input.

This tutorial shows how to connect from Python over WebSocket, send text and audio events, add server-side tools, mint ephemeral keys for browser clients, and decide when WebRTC is the better transport.

What you’ll learn:

When a Python backend should use WebSocket instead of WebRTC
How to open a Realtime session and stream text output
How to send audio chunks over the socket
How to keep tools and private business logic on the server
How to mint short-lived client secrets for browser apps

Time required: 40-55 minutes
Difficulty level: Intermediate

Prerequisites

Before you start, make sure you have:

Python 3.11 or newer
An OpenAI API key stored on your backend
Basic comfort with JSON events and HTTP APIs
A .wav sample file if you want to test audio input

Tools used in this guide:

websocket-client for the live socket connection
python-dotenv for local environment variables
numpy and soundfile for audio encoding
httpx and FastAPI for the token-minting example

If you want a cleaner Python setup, pair this tutorial with our uv guide.

Concept illustration of a Python app choosing between WebRTC in the browser and WebSocket on the server for the OpenAI Realtime API

Step 1: Choose the Right Transport First

The OpenAI Realtime API supports three connection styles: WebRTC, WebSocket, and SIP. The official docs make the split pretty clear.

Use WebSocket when your Python service is connecting from a secure backend. This is the recommended path for server-to-server applications, because your standard API key stays on infrastructure you control. It is also the lowest-level interface, which means you get full control over JSON events, audio buffering, logging, and tool execution.

Use WebRTC when the user-facing client is a browser or mobile app. OpenAI recommends WebRTC over WebSockets for those clients because media handling is more reliable under real network conditions. In plain terms, if the microphone and speaker live in the browser, let WebRTC carry that media.

Here is the practical version:

Scenario	Better fit	Why
Python backend, worker, or internal service	WebSocket	Standard API key stays server-side and event handling is simple
Browser voice UI	WebRTC	Better media transport and client-side device handling
Browser voice UI with private tools	WebRTC + sideband WebSocket	The browser handles media, your server handles tools and control
Telephony	SIP or SIP + sideband WebSocket	Designed for call workflows

Two more limits are worth knowing before you write code:

A Realtime session can last up to 60 minutes.
gpt-realtime currently offers a 32,000-token context window and up to 4,096 output tokens.

That is enough for a long voice interaction, but it is not infinite. You still need to think about how long you keep a session alive.

Step 2: Install the Python Packages

For the text, audio, and browser-token examples in this article, start with these dependencies:

uv init realtime-python-demo
cd realtime-python-demo

uv add websocket-client python-dotenv numpy soundfile httpx fastapi uvicorn

If you prefer pip, this is the equivalent:

python -m venv .venv
source .venv/bin/activate
pip install websocket-client python-dotenv numpy soundfile httpx fastapi uvicorn

Create a local .env file:

OPENAI_API_KEY=your_server_side_key_here

A quick note on package choices:

websocket-client keeps the WebSocket example short and readable.
numpy plus soundfile make it easy to convert audio into PCM16 bytes.
httpx is a clean way to call the client secret endpoint from Python.
FastAPI is only needed if you want the browser-token route from Step 6.

Step 3: Open a WebSocket Session and Stream Text Output

The WebSocket guide shows the basic connection URL: wss://api.openai.com/v1/realtime?model=gpt-realtime. After the socket opens, you send JSON events such as session.update, conversation.item.create, and response.create. The server then returns lifecycle events like response.output_text.delta and response.done.

A minimal text-first script looks like this:

import json
import os

from dotenv import load_dotenv
from websocket import WebSocketApp

load_dotenv()

URL = "wss://api.openai.com/v1/realtime?model=gpt-realtime"


def on_open(ws: WebSocketApp) -> None:
    print("Connected to Realtime API.")

    ws.send(
        json.dumps(
            {
                "type": "session.update",
                "session": {
                    "type": "realtime",
                    "model": "gpt-realtime",
                    "output_modalities": ["text"],
                    "instructions": (
                        "You are a concise Python assistant. "
                        "Answer in short paragraphs and end with one practical next step."
                    ),
                },
            }
        )
    )

    ws.send(
        json.dumps(
            {
                "type": "conversation.item.create",
                "item": {
                    "type": "message",
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": (
                                "Explain when a Python backend should use WebSocket "
                                "instead of WebRTC for the OpenAI Realtime API."
                            ),
                        }
                    ],
                },
            }
        )
    )

    ws.send(
        json.dumps(
            {
                "type": "response.create",
                "response": {
                    "output_modalities": ["text"],
                },
            }
        )
    )


def on_message(ws: WebSocketApp, message: str) -> None:
    event = json.loads(message)
    event_type = event.get("type")

    if event_type == "response.output_text.delta":
        print(event["delta"], end="", flush=True)
    elif event_type == "response.done":
        print("\n\nResponse complete.")
        ws.close()
    elif event_type == "error":
        print("\nRealtime error:", event)


def on_error(ws: WebSocketApp, error: Exception) -> None:
    print("WebSocket error:", error)


def on_close(ws: WebSocketApp, close_status_code, close_msg) -> None:
    print(f"Socket closed: {close_status_code} {close_msg}")


ws = WebSocketApp(
    URL,
    header=[f"Authorization: Bearer {os.environ['OPENAI_API_KEY']}"],
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close,
)

ws.run_forever()

When you run it, the sequence is straightforward:

session.update defines how this session should behave.
conversation.item.create adds a user message to the current conversation.
response.create tells the model to answer.
response.output_text.delta streams the answer as it is generated.
response.done marks the final server event for that response.

This split is useful in practice. If you want a typing effect or a live terminal display, consume the delta events. If you only care about the finished answer, ignore the deltas and read response.done.

A common first mistake is forgetting that conversation.item.create only adds input. The model does not speak until you send response.create, unless your session mode is set up to auto-generate responses from voice input.

Step 4: Stream Audio When You Need Voice Input

This is where Realtime starts to feel different from normal request-response APIs. The conversations guide points out that WebSocket audio handling is manual. You send Base64-encoded audio bytes into the input buffer yourself. Each chunk must stay under 15 MB.

If you are doing voice work from Python, start with a file-based test before you reach for a microphone stream. It is easier to debug and easier to replay.

import base64
import json

import numpy as np
import soundfile as sf
from websocket import WebSocketApp


def float32_to_pcm16(audio: np.ndarray) -> bytes:
    clipped = np.clip(audio, -1, 1)
    return (clipped * 32767).astype("<i2").tobytes()


def encode_audio_file(path: str) -> str:
    data, sample_rate = sf.read(path, dtype="float32")
    channel_data = data[:, 0] if data.ndim > 1 else data
    pcm16_bytes = float32_to_pcm16(channel_data)
    return base64.b64encode(pcm16_bytes).decode("utf-8")


def send_audio_file(ws: WebSocketApp, path: str) -> None:
    payload = encode_audio_file(path)

    ws.send(
        json.dumps(
            {
                "type": "input_audio_buffer.append",
                "audio": payload,
            }
        )
    )

    ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
    ws.send(json.dumps({"type": "response.create"}))

A few details matter here:

input_audio_buffer.append sends raw audio into the current input buffer.
input_audio_buffer.commit turns that buffer into a user input item.
response.create asks the model to respond, which you need when VAD is disabled.
If VAD is enabled, the server can decide when speech has started and stopped, and may create responses automatically.

The docs also mention another option: instead of chunking audio into the buffer, you can create a full conversation item with input_audio content. That is useful when you already have a fully recorded clip and want the message to arrive as one unit.

Concept illustration of the OpenAI Realtime event flow in a Python application, from session setup to streamed deltas and final response

One more production detail is easy to miss. If you need real audio bytes from the model over WebSocket, listen for response.output_audio.delta. The final response.done event contains transcriptions and metadata, not the raw audio chunks themselves.

Step 5: Keep Tools and Private Logic on the Server

The server-side controls guide makes an important recommendation: keep tool use and business logic on your application server. That rule matters even more for voice applications, because they often need CRM lookups, billing checks, policy gates, or moderation steps that should never live in browser code.

If your whole product is backend-driven, this is easy. Your Python service owns the WebSocket connection, registers tools, handles function calls, and returns the model output to whatever frontend you already have.

If your frontend uses WebRTC, the best pattern is usually a sideband connection:

The browser owns the live audio stream over WebRTC.
Your server opens a second connection to the same Realtime session.
The server updates instructions, answers tool calls, and keeps private logic off the client.

At the session level, tool configuration looks like this:

tools_event = {
    "type": "session.update",
    "session": {
        "tools": [
            {
                "type": "function",
                "name": "lookup_order_status",
                "description": "Return shipping status for an order id.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "order_id": {
                            "type": "string",
                            "description": "The customer's order number.",
                        }
                    },
                    "required": ["order_id"],
                },
            }
        ],
        "tool_choice": "auto",
    },
}

That does not execute the function by itself. It only tells the model which tools exist. Your Python application still needs to watch the conversation or response events, detect when the model wants a function call, run the actual code, and post the tool result back into the session.

This is also where sideband control shines. The browser can stay focused on media. The Python server can stay focused on business logic.

Step 6: Mint Ephemeral Tokens for Browser Clients From Python

The WebRTC guide recommends ephemeral client secrets when a browser or mobile client connects directly to Realtime. The API reference says these tokens expire after one minute, and they are meant for client environments. Your standard API key should remain on the server.

Here is the Python equivalent of the token-minting flow shown in the docs:

import os

import httpx
from dotenv import load_dotenv
from fastapi import FastAPI, HTTPException

load_dotenv()

app = FastAPI()
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]


@app.get("/token")
async def create_realtime_token():
    payload = {
        "session": {
            "type": "realtime",
            "model": "gpt-realtime",
            "audio": {
                "output": {
                    "voice": "marin",
                }
            },
        }
    }

    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "https://api.openai.com/v1/realtime/client_secrets",
            headers={
                "Authorization": f"Bearer {OPENAI_API_KEY}",
                "Content-Type": "application/json",
            },
            json=payload,
        )

    try:
        response.raise_for_status()
    except httpx.HTTPStatusError as exc:
        raise HTTPException(
            status_code=exc.response.status_code,
            detail=exc.response.text,
        ) from exc

    return response.json()

Start it like this:

uvicorn token_server:app --reload

Your browser can then call /token, receive a short-lived client_secret, and use that to establish a WebRTC session directly with OpenAI. If you want the unified WebRTC flow instead, your backend can also create the session by POSTing SDP to /v1/realtime/calls.

The good design rule is simple:

Standard API key: backend only
Ephemeral client secret: browser or mobile client
Private tools and policy logic: backend only

Advanced Tips

Now that the basic flow works, these three habits will save you time later.

Tip 1: Start with text-only responses while debugging

The docs note that sessions can mix text and audio. That is useful, but it can make debugging noisy. Start with output_modalities: ["text"] until your event flow is solid. Once the socket lifecycle feels predictable, switch on audio.

Tip 2: Treat session state as part of your architecture

Realtime is stateful. That is the whole point. The session holds configuration, the conversation keeps prior items, and each new response can build on earlier turns. This is great for voice UX, but it also means reconnect behavior, session renewal, and context trimming are part of the design, not cleanup work.

Tip 3: Watch costs on long sessions

OpenAI’s cost guide says the whole conversation is considered for later responses, which means turns get more expensive as the session grows. The same guide also gives concrete audio token math: user audio is billed at 1 token per 100 ms, and assistant audio at 1 token per 50 ms. If you keep a session open for a long support call, track that growth early instead of discovering it in billing later.

Common Problems and Solutions

Problem 1: The browser gets a 401 or a failed handshake

Solution: Do not send your standard API key to the browser. Use your backend to mint a client secret, then connect the browser over WebRTC or another approved client flow.

Problem 2: Audio uploads but the model never answers

Solution: Check your VAD setting. If VAD is off, you must send both input_audio_buffer.commit and response.create. Appending audio alone is not enough.

Problem 3: You cannot switch voices midway through a call

Solution: Set the voice early. The conversations guide notes that once the model has already produced audio in a session, the voice cannot be changed for that session.

Problem 4: The final event has text, but no playable audio bytes

Solution: Listen for response.output_audio.delta. The docs are explicit here: response.done and response.output_audio.done do not carry the full audio bytes you need for playback or file output.

Conclusion

The Python story for OpenAI Realtime is better than it first looks. Once you understand the event model, a backend WebSocket client is not complicated. It is just explicit. You open a live connection, configure a session, add items to the conversation, and tell the model when to respond.

That makes WebSocket a strong choice for secure Python backends, worker services, and any system that needs tight control over tools or session state. When the user interface lives in the browser, WebRTC is still the better transport for media, with your Python server acting as the place where secrets, tools, and policy checks stay private.

If you build this in stages, text first, audio second, browser tokens third, you will avoid most of the pain people run into with realtime systems.

For related reading, see our guides on FastAPI WebSockets, FastAPI async patterns, and Python automation libraries for AI workflows.

Sources:

How to Use the OpenAI Realtime API in Python: WebSocket Tutorial for Voice and Text Apps

Prerequisites

Step 1: Choose the Right Transport First

Step 2: Install the Python Packages

Step 3: Open a WebSocket Session and Stream Text Output

Step 4: Stream Audio When You Need Voice Input

Step 5: Keep Tools and Private Logic on the Server

Step 6: Mint Ephemeral Tokens for Browser Clients From Python

Advanced Tips

Tip 1: Start with text-only responses while debugging

Tip 2: Treat session state as part of your architecture

Tip 3: Watch costs on long sessions

Common Problems and Solutions

Problem 1: The browser gets a 401 or a failed handshake

Problem 2: Audio uploads but the model never answers

Problem 3: You cannot switch voices midway through a call

Problem 4: The final event has text, but no playable audio bytes

Conclusion

Leave a comment

No comments yet

Prerequisites

Step 1: Choose the Right Transport First

Step 2: Install the Python Packages

Step 3: Open a WebSocket Session and Stream Text Output

Step 4: Stream Audio When You Need Voice Input

Step 5: Keep Tools and Private Logic on the Server

Step 6: Mint Ephemeral Tokens for Browser Clients From Python

Advanced Tips

Tip 1: Start with text-only responses while debugging

Tip 2: Treat session state as part of your architecture

Tip 3: Watch costs on long sessions

Common Problems and Solutions

Problem 1: The browser gets a 401 or a failed handshake

Problem 2: Audio uploads but the model never answers

Problem 3: You cannot switch voices midway through a call

Problem 4: The final event has text, but no playable audio bytes

Conclusion

Share this guide

Leave a comment

No comments yet

Related Articles

How to Connect a Pydantic AI Agent to MCP Servers with FastMCPToolset

How to Connect a Pydantic AI Agent to MCP Servers with MCPServer

How to Build a Python MCP Server with FastMCP: Tools, Resources, and Prompts in One Tutorial