Speech to Text

Transcribe audio files to text using OpenAI Whisper and Deepgram models for automated audio processing in your workflows.

The Speech to Text node transcribes audio files into text. Upload an audio file through a form, pass its URL to this node, and receive a text transcription that you can route to downstream nodes for classification, summarization, translation, or storage. This is essential for workflows that process voicemails, recorded interviews, podcast submissions, call recordings, or voice notes.

Supported Providers and Models

Provider	Models	Strengths
OpenAI	Whisper v3	Broad language support, industry standard
Deepgram	Nova-3, Nova-3 Medical, Nova-2, Nova-2 General	Fast inference, specialized models

Model Comparison

Model	Provider	Accuracy	Speed	Specialization
Whisper v3 (whisper-1)	OpenAI	Excellent	Moderate	General-purpose, 50+ languages
Nova-3	Deepgram	Excellent	Fast	Latest generation, highest accuracy
Nova-3 Medical	Deepgram	Excellent	Fast	Medical terminology, clinical notes
Nova-2	Deepgram	Very good	Fast	Previous generation, reliable
Nova-2 General	Deepgram	Very good	Fast	General-purpose

Choose Whisper v3 for broad language coverage and when you need a single model that handles most languages and accents well.

Choose Nova-3 for English-primary transcription where speed and accuracy are critical.

Choose Nova-3 Medical for healthcare workflows involving medical dictation, clinical notes, or patient recordings where medical terminology must be transcribed accurately.

Configuration

Provider

Select OpenAI (Whisper) or Deepgram. The model list updates accordingly.

Model

Choose the transcription model. See the comparison table above for guidance on model selection.

Credential

Select a saved API key for the chosen provider. See Credential Management for setup instructions.

Audio URL

The URL of the audio file to transcribe. This field supports template variables, which is the primary way to use it: reference the URL from a file upload form field.

Example workflow: a form with a file upload field labeled "Voice Note" produces a URL. Insert that variable here so the Speech to Text node transcribes whatever audio file the user uploads.

Supported audio formats:

Format	Extension	Notes
MP3	.mp3	Most common, widely supported
WAV	.wav	Uncompressed, highest quality input
M4A	.m4a	Common from mobile recordings
FLAC	.flac	Lossless compression
OGG	.ogg	Open format
WebM	.webm	Browser recording format
MP4	.mp4	Video files (audio track extracted)

The audio file must be accessible via a public or signed URL. Buildorado file uploads automatically generate accessible URLs. Maximum file size is 25MB. There is a 60-second fetch timeout on downloading the audio file, so very large files on slow hosts may fail.

Language

Specify the spoken language in the audio, or leave it set to Auto-detect.

Available language options:

Auto-detect
English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese, Arabic, Hindi

Whisper supports 50+ languages beyond the ones listed in the dropdown. Deepgram language support varies by model. Nova-3 supports the broadest range.

Timestamps

A toggle switch. When enabled, the transcription output includes timing information for each segment:

Segment-level timestamps -- Each segment gets a text, start, and end time.

Timestamps are useful for:

Syncing transcriptions with video playback
Identifying specific moments in long recordings
Creating subtitles or captions
Analyzing speaking pace and pauses

When disabled, the output contains only the transcribed text without timing information.

Output

The Speech to Text node produces:

Field	Type	Description
`text`	string	The full transcribed text
`language`	string	Detected or specified language code
`durationMs`	number	Duration of the audio in milliseconds
`segments`	array	Array of segment objects with `text`, `start`, and `end` (when timestamps enabled)
`confidence`	number	Confidence score for the transcription (when available)
`model`	string	The model that was used
`provider`	string	The provider that was used

The transcribed text is available to downstream nodes via template variables. Common downstream uses include feeding the text into an Agent node for analysis, storing it in a spreadsheet, or sending it in an email notification.

Use Cases

Voicemail Processing

Automatically transcribe and route voicemails:

A phone system sends voicemail recordings to a webhook.
The Speech to Text node transcribes the audio.
An Agent node classifies the voicemail (sales inquiry, support request, spam).
A Branch node routes the transcription to the appropriate team's Slack channel or email.

Interview Transcription

Process recorded interviews submitted through a form:

An HR form includes a file upload field for interview recordings.
Speech to Text transcribes the recording.
An Agent node summarizes key points, candidate strengths, and red flags.
The summary and full transcription are stored in Google Sheets.

Medical Documentation

Transcribe clinical dictation:

A medical professional submits a voice recording via a form.
The Nova-3 Medical model transcribes with accurate medical terminology.
The transcription is stored as a patient note in the connected system.
Template variables insert patient ID and date for record-keeping.

Podcast and Content Processing

Process audio content submissions:

A content creator submits a podcast episode via file upload.
Speech to Text transcribes the full episode.
An Agent node creates a summary, show notes, and key timestamps.
The content is pushed to a CMS or Google Doc.

Meeting Notes

Automate meeting documentation:

After a meeting, the recording is uploaded through an internal form.
Speech to Text transcribes the conversation.
An Agent node extracts action items, decisions, and owners.
Results are posted to Slack and stored in Notion.

Multilingual Support

Process submissions in multiple languages:

A global customer form accepts voice messages in any language.
Speech to Text transcribes with auto-detect enabled.
The detected language is used to route to the appropriate support team.
Optionally, an Agent node translates the transcription to English for a unified support queue.

Provider Comparison

Feature	OpenAI (Whisper)	Deepgram
Language support	50+ languages	Varies by model
Medical specialization	No	Yes (Nova-3 Medical)
Timestamps	Yes	Yes
Speed	Moderate	Fast
Audio format support	Broad	Broad
Auto-language detection	Yes	Yes
Pricing model	Per minute of audio	Per minute of audio

Best Practices

Specify the language when you know it in advance. Auto-detection is good but not perfect, especially for short audio clips or noisy recordings.
Use Nova-3 Medical for any healthcare-related transcription. General-purpose models frequently misspell medical terms, drug names, and procedures.
Enable timestamps when the downstream workflow needs to reference specific moments in the audio (e.g., generating subtitles, creating time-stamped notes).
Check audio quality. Transcription accuracy degrades significantly with background noise, multiple overlapping speakers, or very low-quality recordings. Clear audio produces dramatically better results.
Handle empty uploads. Add conditional logic before the Speech to Text node to verify that the file upload field is not empty, preventing errors from missing audio files.
Consider file size. Audio files are limited to 25MB. For long recordings, consider splitting the audio before upload.
Combine with AI analysis. Transcription alone is rarely the final step. Feed the text into an Agent node to extract insights, classify, summarize, or translate.

Limitations

Audio files must be accessible via URL. Direct file path references are not supported.
Maximum file size is 25MB.
There is a 60-second timeout on fetching the audio file from its URL.
Transcription accuracy depends on audio quality. Heavy background noise, multiple speakers, and strong accents reduce accuracy.
The node processes one audio file per execution. For batch processing, use a Loop node.
Real-time streaming transcription is not supported. The node processes complete audio files only.
Execution is subject to a 5-minute timeout (300 seconds) for the overall node execution to accommodate large audio files, separate from the 60-second audio fetch timeout.

On this page