Sign In

Speech to Text

Transcribe audio files to text using OpenAI Whisper and Deepgram models for automated audio processing in your workflows.

The Speech to Text node transcribes audio files into text. Upload an audio file through a form, pass its URL to this node, and receive a text transcription that you can route to downstream nodes for classification, summarization, translation, or storage. This is essential for workflows that process voicemails, recorded interviews, podcast submissions, call recordings, or voice notes.

Supported Providers and Models

ProviderModelsStrengths
OpenAIWhisper v3Broad language support, industry standard
DeepgramNova-3, Nova-3 Medical, Nova-2, Nova-2 GeneralFast inference, specialized models

Model Comparison

ModelProviderAccuracySpeedSpecialization
Whisper v3 (whisper-1)OpenAIExcellentModerateGeneral-purpose, 50+ languages
Nova-3DeepgramExcellentFastLatest generation, highest accuracy
Nova-3 MedicalDeepgramExcellentFastMedical terminology, clinical notes
Nova-2DeepgramVery goodFastPrevious generation, reliable
Nova-2 GeneralDeepgramVery goodFastGeneral-purpose

Choose Whisper v3 for broad language coverage and when you need a single model that handles most languages and accents well.

Choose Nova-3 for English-primary transcription where speed and accuracy are critical.

Choose Nova-3 Medical for healthcare workflows involving medical dictation, clinical notes, or patient recordings where medical terminology must be transcribed accurately.

Configuration

Provider

Select OpenAI (Whisper) or Deepgram. The model list updates accordingly.

Model

Choose the transcription model. See the comparison table above for guidance on model selection.

Credential

Select a saved API key for the chosen provider. See Credential Management for setup instructions.

Audio URL

The URL of the audio file to transcribe. This field supports template variables, which is the primary way to use it: reference the URL from a file upload form field.

Example workflow: a form with a file upload field labeled "Voice Note" produces a URL. Insert that variable here so the Speech to Text node transcribes whatever audio file the user uploads.

Supported audio formats:

FormatExtensionNotes
MP3.mp3Most common, widely supported
WAV.wavUncompressed, highest quality input
M4A.m4aCommon from mobile recordings
FLAC.flacLossless compression
OGG.oggOpen format
WebM.webmBrowser recording format
MP4.mp4Video files (audio track extracted)

The audio file must be accessible via a public or signed URL. Buildorado file uploads automatically generate accessible URLs. Maximum file size is 25MB. There is a 60-second fetch timeout on downloading the audio file, so very large files on slow hosts may fail.

Language

Specify the spoken language in the audio, or leave it set to Auto-detect.

Available language options:

  • Auto-detect
  • English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese, Arabic, Hindi

Whisper supports 50+ languages beyond the ones listed in the dropdown. Deepgram language support varies by model. Nova-3 supports the broadest range.

Timestamps

A toggle switch. When enabled, the transcription output includes timing information for each segment:

  • Segment-level timestamps -- Each segment gets a text, start, and end time.

Timestamps are useful for:

  • Syncing transcriptions with video playback
  • Identifying specific moments in long recordings
  • Creating subtitles or captions
  • Analyzing speaking pace and pauses

When disabled, the output contains only the transcribed text without timing information.

Output

The Speech to Text node produces:

FieldTypeDescription
textstringThe full transcribed text
languagestringDetected or specified language code
durationMsnumberDuration of the audio in milliseconds
segmentsarrayArray of segment objects with text, start, and end (when timestamps enabled)
confidencenumberConfidence score for the transcription (when available)
modelstringThe model that was used
providerstringThe provider that was used

The transcribed text is available to downstream nodes via template variables. Common downstream uses include feeding the text into an Agent node for analysis, storing it in a spreadsheet, or sending it in an email notification.

Use Cases

Voicemail Processing

Automatically transcribe and route voicemails:

  • A phone system sends voicemail recordings to a webhook.
  • The Speech to Text node transcribes the audio.
  • An Agent node classifies the voicemail (sales inquiry, support request, spam).
  • A Branch node routes the transcription to the appropriate team's Slack channel or email.

Interview Transcription

Process recorded interviews submitted through a form:

  • An HR form includes a file upload field for interview recordings.
  • Speech to Text transcribes the recording.
  • An Agent node summarizes key points, candidate strengths, and red flags.
  • The summary and full transcription are stored in Google Sheets.

Medical Documentation

Transcribe clinical dictation:

  • A medical professional submits a voice recording via a form.
  • The Nova-3 Medical model transcribes with accurate medical terminology.
  • The transcription is stored as a patient note in the connected system.
  • Template variables insert patient ID and date for record-keeping.

Podcast and Content Processing

Process audio content submissions:

  • A content creator submits a podcast episode via file upload.
  • Speech to Text transcribes the full episode.
  • An Agent node creates a summary, show notes, and key timestamps.
  • The content is pushed to a CMS or Google Doc.

Meeting Notes

Automate meeting documentation:

  • After a meeting, the recording is uploaded through an internal form.
  • Speech to Text transcribes the conversation.
  • An Agent node extracts action items, decisions, and owners.
  • Results are posted to Slack and stored in Notion.

Multilingual Support

Process submissions in multiple languages:

  • A global customer form accepts voice messages in any language.
  • Speech to Text transcribes with auto-detect enabled.
  • The detected language is used to route to the appropriate support team.
  • Optionally, an Agent node translates the transcription to English for a unified support queue.

Provider Comparison

FeatureOpenAI (Whisper)Deepgram
Language support50+ languagesVaries by model
Medical specializationNoYes (Nova-3 Medical)
TimestampsYesYes
SpeedModerateFast
Audio format supportBroadBroad
Auto-language detectionYesYes
Pricing modelPer minute of audioPer minute of audio

Best Practices

  • Specify the language when you know it in advance. Auto-detection is good but not perfect, especially for short audio clips or noisy recordings.
  • Use Nova-3 Medical for any healthcare-related transcription. General-purpose models frequently misspell medical terms, drug names, and procedures.
  • Enable timestamps when the downstream workflow needs to reference specific moments in the audio (e.g., generating subtitles, creating time-stamped notes).
  • Check audio quality. Transcription accuracy degrades significantly with background noise, multiple overlapping speakers, or very low-quality recordings. Clear audio produces dramatically better results.
  • Handle empty uploads. Add conditional logic before the Speech to Text node to verify that the file upload field is not empty, preventing errors from missing audio files.
  • Consider file size. Audio files are limited to 25MB. For long recordings, consider splitting the audio before upload.
  • Combine with AI analysis. Transcription alone is rarely the final step. Feed the text into an Agent node to extract insights, classify, summarize, or translate.

Limitations

  • Audio files must be accessible via URL. Direct file path references are not supported.
  • Maximum file size is 25MB.
  • There is a 60-second timeout on fetching the audio file from its URL.
  • Transcription accuracy depends on audio quality. Heavy background noise, multiple speakers, and strong accents reduce accuracy.
  • The node processes one audio file per execution. For batch processing, use a Loop node.
  • Real-time streaming transcription is not supported. The node processes complete audio files only.
  • Execution is subject to a 5-minute timeout (300 seconds) for the overall node execution to accommodate large audio files, separate from the 60-second audio fetch timeout.

On this page

Speech to Text | Buildorado