This API provides streaming speech-to-text transcriptions using WebSockets.
Endpoint:wss://api.sully.ai/v1/audio/transcriptions/stream?account_id=1234567890&api_token=1234567890&sample_rate=16000&language=en
The API key to use for authentication. Required if X-API-TOKEN
is not provided.
The API token to use for authentication. Required if X-API-KEY
is not provided.
The account ID to use for authentication.
The language of your submitted audio. See our Supported Languages documentation for a complete list of language options.
The sample rate of your submitted audio.
Specifies the encoding format of the audio being sent.
Important: This parameter is required when transmitting raw, unstructured audio packets without headers. If the audio data is encapsulated within a container format, this parameter should be omitted.
Supported formats:linear16
: 16-bit, little-endian PCM audioflac
: Free Lossless Audio Codec (FLAC)mulaw
: Mu-law encoded WAVamr-nb
: Adaptive Multi-Rate, narrowbandamr-wb
: Adaptive Multi-Rate, widebandopus
: Ogg Opus codecspeex
: Speex codecg729
: G729 codec (usable with raw or containerized audio)The account ID to use for authentication. Required if X-ACCOUNT-ID
is not provided.
A temporary authentication token. Required if X-API-KEY
is not provided.
The Speech-to-Text Websockets API is designed to generate text from partial audio input. It’s well-suited for scenarios where the input audio is being streamed or generated in chunks.
The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.
Upon successful connection, the server sends a status message:
When the connection closes:
Important: Wait for the “status”: “connected” message before sending audio data. This ensures the server is ready to process your stream.
The client can send messages with audio input to the server. The messages can contain the following fields:
A generated partial audio chunk encoded as a base64 string.
Browser MediaRecorder Notice: When using Chrome’s MediaRecorder API, the first audio chunk contains critical header information. Always send this first chunk for proper audio processing. Failing to include header information may result in transcription errors or complete failure.
The server will always respond with a message containing the following fields:
The type of response, will be “transcript” for transcription results.
Start time of the audio segment in seconds.
End time of the audio segment in seconds.
Duration of the audio segment in seconds.
The processed text sequence.
Indicates if the generation is complete. Deprecated: Use is_final
instead.
Indicates if the generation is complete.
Array of word objects with text content and timing information:
word
: The raw word as recognizedstart
: Start time of the word in secondsend
: End time of the word in secondsconfidence
: Confidence score between 0-1 for the word recognitionpunctuated_word
: The word with proper capitalization and punctuationISO-formatted timestamp when the response was generated.
This API provides streaming speech-to-text transcriptions using WebSockets.
Endpoint:wss://api.sully.ai/v1/audio/transcriptions/stream?account_id=1234567890&api_token=1234567890&sample_rate=16000&language=en
The API key to use for authentication. Required if X-API-TOKEN
is not provided.
The API token to use for authentication. Required if X-API-KEY
is not provided.
The account ID to use for authentication.
The language of your submitted audio. See our Supported Languages documentation for a complete list of language options.
The sample rate of your submitted audio.
Specifies the encoding format of the audio being sent.
Important: This parameter is required when transmitting raw, unstructured audio packets without headers. If the audio data is encapsulated within a container format, this parameter should be omitted.
Supported formats:linear16
: 16-bit, little-endian PCM audioflac
: Free Lossless Audio Codec (FLAC)mulaw
: Mu-law encoded WAVamr-nb
: Adaptive Multi-Rate, narrowbandamr-wb
: Adaptive Multi-Rate, widebandopus
: Ogg Opus codecspeex
: Speex codecg729
: G729 codec (usable with raw or containerized audio)The account ID to use for authentication. Required if X-ACCOUNT-ID
is not provided.
A temporary authentication token. Required if X-API-KEY
is not provided.
The Speech-to-Text Websockets API is designed to generate text from partial audio input. It’s well-suited for scenarios where the input audio is being streamed or generated in chunks.
The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.
Upon successful connection, the server sends a status message:
When the connection closes:
Important: Wait for the “status”: “connected” message before sending audio data. This ensures the server is ready to process your stream.
The client can send messages with audio input to the server. The messages can contain the following fields:
A generated partial audio chunk encoded as a base64 string.
Browser MediaRecorder Notice: When using Chrome’s MediaRecorder API, the first audio chunk contains critical header information. Always send this first chunk for proper audio processing. Failing to include header information may result in transcription errors or complete failure.
The server will always respond with a message containing the following fields:
The type of response, will be “transcript” for transcription results.
Start time of the audio segment in seconds.
End time of the audio segment in seconds.
Duration of the audio segment in seconds.
The processed text sequence.
Indicates if the generation is complete. Deprecated: Use is_final
instead.
Indicates if the generation is complete.
Array of word objects with text content and timing information:
word
: The raw word as recognizedstart
: Start time of the word in secondsend
: End time of the word in secondsconfidence
: Confidence score between 0-1 for the word recognitionpunctuated_word
: The word with proper capitalization and punctuationISO-formatted timestamp when the response was generated.