Sign In

Text to Speech

Convert text to natural-sounding audio using OpenAI and ElevenLabs voice synthesis models in your workflows.

The Text to Speech node converts written text into spoken audio. Pass any text -- including form submission data via template variables -- and the node returns an audio file. This is useful for creating voice responses, accessibility features, audio notifications, podcast generation, and any workflow that needs to produce spoken content from text data.

Supported Providers and Models

ProviderModelsStrengths
OpenAITTS-1, TTS-1 HDFast, consistent, 6 built-in voices, speed control
ElevenLabsEleven v3, Multilingual v2, Turbo v2.5Highly natural voices, multiple model options

Model Comparison

ModelProviderQualitySpeedBest For
TTS-1OpenAIGoodFastFast, cost-effective voice generation
TTS-1 HDOpenAIExcellentModerateCustomer-facing audio, polished content
Eleven v3ElevenLabsExcellentModerateHighest quality ElevenLabs output
Multilingual v2ElevenLabsExcellentModerateMulti-language content, natural speech
Turbo v2.5ElevenLabsGoodFastLow-latency generation

Configuration

Provider

Select OpenAI or ElevenLabs. The available models, voices, and options change based on the provider.

Model

Choose the text-to-speech model:

  • TTS-1 (OpenAI) -- Standard quality, optimized for speed and low latency.
  • TTS-1 HD (OpenAI) -- High-definition audio. Slower to generate but noticeably smoother and more natural.
  • Eleven v3 (ElevenLabs) -- The latest and highest-quality ElevenLabs model.
  • Multilingual v2 (ElevenLabs) -- Flagship multilingual model for natural-sounding speech across many languages.
  • Turbo v2.5 (ElevenLabs) -- Optimized for low latency and fast generation.

Credential

Select a saved API key for the chosen provider. See Credential Management for setup.

Text

The text to convert to speech. This field supports template variables, so you can dynamically generate audio from form submissions or upstream node outputs.

Example uses:

Thank you for your submission, [Name variable].
Your request number is [Request ID variable].
We will get back to you within 24 hours.
[Full transcript variable from an Agent node]

Text length limits vary by provider. OpenAI supports up to 4096 characters per request. ElevenLabs limits depend on your subscription tier. For longer text, consider splitting into multiple segments.

Voice

Select the voice used for speech synthesis. Available voices depend on the provider:

OpenAI Voices:

VoiceDescriptionBest For
AlloyNeutral, balancedGeneral-purpose, professional
EchoWarm, conversationalCustomer interactions, friendly tone
FableExpressive, storytellingNarratives, creative content
OnyxDeep, authoritativeAnnouncements, formal content
NovaBright, energeticMarketing, upbeat content
ShimmerSoft, clearGentle notifications, instructions

All OpenAI voices support the same languages and produce consistent quality regardless of the voice selected.

ElevenLabs Voices:

The editor provides 4 built-in voice options:

VoiceVoice ID
Rachel21m00Tcm4TlvDq8ikWAM
AntoniErXwobaYiN019PkySvjV
ElliMF3mGyEYCl7XYWbV9V6O
JoshTxGEqnHWrfWFTfGW9XjX

These are hardcoded voice presets. To use other ElevenLabs voices (custom clones, community voices), you would need to configure them via the API directly.

Speed (OpenAI Only)

Adjust the speaking speed on a scale from 0.25x to 4.0x (step: 0.25x):

SpeedMultiplierUse Case
Slow0.25x - 0.75xDictation, accessibility, language learning
Normal1.0xDefault speaking rate
Fast1.25x - 2.0xSummary playback, time-sensitive notifications
Very fast2.0x - 4.0xRapid review, speed listening

Speed adjustment is not available with ElevenLabs. ElevenLabs determines pacing from the text content and voice characteristics.

Output Format

The output audio format. Four formats are available in the dropdown:

FormatNotes
MP3Default. Most widely compatible.
WAVUncompressed, highest quality.
OGGOpen format. Mapped to Opus codec for OpenAI.
FLACLossless compression.

OpenAI supports all four formats. When OGG is selected, the actual API request uses the Opus codec (OpenAI's Ogg Opus format), and the output file has an .ogg extension.

ElevenLabs forces MP3 output regardless of the format selected. ElevenLabs only supports MP3 output through this integration.

Output

The Text to Speech node produces:

FieldTypeDescription
fileobjectFile reference with url, key, mimeType, sizeBytes, and filename
durationMsnumberDuration of the generated audio in milliseconds (when available)
characterCountnumberNumber of characters in the input text
modelstringThe model that was used
providerstringThe provider that was used

Access the audio URL via file.url in downstream template variables.

Use Cases

Automated Voice Responses

Generate audio confirmations for form submissions:

  • A customer submits a service request form.
  • An Agent node creates a personalized confirmation message.
  • The Text to Speech node converts the message to audio.
  • The audio URL is included in a confirmation email or SMS.

Accessibility Enhancement

Create audio versions of text content:

  • A content submission form collects article text.
  • Text to Speech converts the article to an audio file.
  • The audio file is attached to the content record for accessibility compliance.

Podcast Generation

Automate podcast episode creation:

  • A content creator submits a script through a form.
  • Text to Speech generates the audio narration.
  • The audio file is stored in Google Drive and linked in a Notion database.
  • A notification is sent to the production team for final review.

Interactive Voice Notifications

Create voice messages for notification workflows:

  • A monitoring form triggers when certain conditions are met.
  • An Agent node drafts a voice notification message.
  • Text to Speech converts it to audio.
  • The audio is delivered via a webhook to a phone system or notification service.

Language Learning Content

Generate pronunciation guides:

  • A vocabulary form collects words and phrases in a target language.
  • Text to Speech generates audio for each word using an appropriate voice.
  • Audio files are stored and linked to the vocabulary list.
  • Learners access audio alongside written content.

Multi-Language Support

Generate audio in multiple languages:

  • A form collects content and a target language selection.
  • An Agent node translates the content if needed.
  • Text to Speech generates audio in the target language using an appropriate voice.
  • The audio URL is included in the localized response.

Provider Comparison

FeatureOpenAIElevenLabs
Voice count6 built-in4 built-in presets
Speed controlYes (0.25x - 4.0x)No
Models2 (TTS-1, TTS-1 HD)3 (Eleven v3, Multilingual v2, Turbo v2.5)
Output formatsMP3, WAV, OGG (Opus), FLACMP3 only
Audio qualityGood (TTS-1), Excellent (TTS-1 HD)Excellent
PricingPer characterPer character (tier-dependent)
Natural soundGoodExcellent

Choose OpenAI when you need speed control, multiple output formats, predictable pricing, and reliable quality with simple voice selection.

Choose ElevenLabs when voice quality and naturalness are the top priority, or you need one of their specialized models.

Best Practices

  • Choose the right voice early. Preview different voices with sample text before committing to a workflow. Voice characteristics significantly impact the end-user experience.
  • Keep text concise. Long text inputs increase generation time, cost, and file size. Split long content into shorter segments when possible.
  • Use TTS-1 for drafts and testing. Switch to TTS-1 HD for production when audio quality matters to end users.
  • Test with actual content. Synthesized speech can sound different depending on content type. Test with representative text from real form submissions, not just placeholder text.
  • Match voice to context. Use a professional, neutral voice (Alloy, Onyx) for business communications and a warmer voice (Echo, Nova) for customer-facing interactions.
  • Store audio files. Generated audio URLs point to S3-hosted files. Use a Google Drive or file storage action node to persist important audio files if needed.
  • Handle empty text. If the text field could be empty (e.g., an optional form field), add conditional logic to skip the Text to Speech node when there is nothing to convert.

Limitations

  • Text length is limited by the provider. OpenAI supports up to 4096 characters per request. Longer text must be split across multiple requests.
  • ElevenLabs character limits depend on your subscription tier.
  • Speed control is available only with OpenAI models. ElevenLabs does not support programmatic speed adjustment.
  • ElevenLabs always outputs MP3, regardless of the format selected in the configuration.
  • The node generates audio from text only. It does not support audio-to-audio conversion, voice modification, or audio mixing.
  • Execution is subject to a 120-second timeout.

On this page

Text to Speech | Buildorado