Text to Speech

Convert text to natural-sounding audio using OpenAI and ElevenLabs voice synthesis models in your workflows.

The Text to Speech node converts written text into spoken audio. Pass any text -- including form submission data via template variables -- and the node returns an audio file. This is useful for creating voice responses, accessibility features, audio notifications, podcast generation, and any workflow that needs to produce spoken content from text data.

Supported Providers and Models

Provider	Models	Strengths
OpenAI	TTS-1, TTS-1 HD	Fast, consistent, 6 built-in voices, speed control
ElevenLabs	Eleven v3, Multilingual v2, Turbo v2.5	Highly natural voices, multiple model options

Model Comparison

Model	Provider	Quality	Speed	Best For
TTS-1	OpenAI	Good	Fast	Fast, cost-effective voice generation
TTS-1 HD	OpenAI	Excellent	Moderate	Customer-facing audio, polished content
Eleven v3	ElevenLabs	Excellent	Moderate	Highest quality ElevenLabs output
Multilingual v2	ElevenLabs	Excellent	Moderate	Multi-language content, natural speech
Turbo v2.5	ElevenLabs	Good	Fast	Low-latency generation

Configuration

Provider

Select OpenAI or ElevenLabs. The available models, voices, and options change based on the provider.

Model

Choose the text-to-speech model:

TTS-1 (OpenAI) -- Standard quality, optimized for speed and low latency.
TTS-1 HD (OpenAI) -- High-definition audio. Slower to generate but noticeably smoother and more natural.
Eleven v3 (ElevenLabs) -- The latest and highest-quality ElevenLabs model.
Multilingual v2 (ElevenLabs) -- Flagship multilingual model for natural-sounding speech across many languages.
Turbo v2.5 (ElevenLabs) -- Optimized for low latency and fast generation.

Credential

Select a saved API key for the chosen provider. See Credential Management for setup.

Text

The text to convert to speech. This field supports template variables, so you can dynamically generate audio from form submissions or upstream node outputs.

Example uses:

Thank you for your submission, [Name variable].
Your request number is [Request ID variable].
We will get back to you within 24 hours.

[Full transcript variable from an Agent node]

Text length limits vary by provider. OpenAI supports up to 4096 characters per request. ElevenLabs limits depend on your subscription tier. For longer text, consider splitting into multiple segments.

Voice

Select the voice used for speech synthesis. Available voices depend on the provider:

OpenAI Voices:

Voice	Description	Best For
Alloy	Neutral, balanced	General-purpose, professional
Echo	Warm, conversational	Customer interactions, friendly tone
Fable	Expressive, storytelling	Narratives, creative content
Onyx	Deep, authoritative	Announcements, formal content
Nova	Bright, energetic	Marketing, upbeat content
Shimmer	Soft, clear	Gentle notifications, instructions

All OpenAI voices support the same languages and produce consistent quality regardless of the voice selected.

ElevenLabs Voices:

The editor provides 4 built-in voice options:

Voice	Voice ID
Rachel	21m00Tcm4TlvDq8ikWAM
Antoni	ErXwobaYiN019PkySvjV
Elli	MF3mGyEYCl7XYWbV9V6O
Josh	TxGEqnHWrfWFTfGW9XjX

These are hardcoded voice presets. To use other ElevenLabs voices (custom clones, community voices), you would need to configure them via the API directly.

Speed (OpenAI Only)

Adjust the speaking speed on a scale from 0.25x to 4.0x (step: 0.25x):

Speed	Multiplier	Use Case
Slow	0.25x - 0.75x	Dictation, accessibility, language learning
Normal	1.0x	Default speaking rate
Fast	1.25x - 2.0x	Summary playback, time-sensitive notifications
Very fast	2.0x - 4.0x	Rapid review, speed listening

Speed adjustment is not available with ElevenLabs. ElevenLabs determines pacing from the text content and voice characteristics.

Output Format

The output audio format. Four formats are available in the dropdown:

Format	Notes
MP3	Default. Most widely compatible.
WAV	Uncompressed, highest quality.
OGG	Open format. Mapped to Opus codec for OpenAI.
FLAC	Lossless compression.

OpenAI supports all four formats. When OGG is selected, the actual API request uses the Opus codec (OpenAI's Ogg Opus format), and the output file has an .ogg extension.

ElevenLabs forces MP3 output regardless of the format selected. ElevenLabs only supports MP3 output through this integration.

Output

The Text to Speech node produces:

Field	Type	Description
`file`	object	File reference with `url`, `key`, `mimeType`, `sizeBytes`, and `filename`
`durationMs`	number	Duration of the generated audio in milliseconds (when available)
`characterCount`	number	Number of characters in the input text
`model`	string	The model that was used
`provider`	string	The provider that was used

Access the audio URL via file.url in downstream template variables.

Use Cases

Automated Voice Responses

Generate audio confirmations for form submissions:

A customer submits a service request form.
An Agent node creates a personalized confirmation message.
The Text to Speech node converts the message to audio.
The audio URL is included in a confirmation email or SMS.

Accessibility Enhancement

Create audio versions of text content:

A content submission form collects article text.
Text to Speech converts the article to an audio file.
The audio file is attached to the content record for accessibility compliance.

Podcast Generation

Automate podcast episode creation:

A content creator submits a script through a form.
Text to Speech generates the audio narration.
The audio file is stored in Google Drive and linked in a Notion database.
A notification is sent to the production team for final review.

Interactive Voice Notifications

Create voice messages for notification workflows:

A monitoring form triggers when certain conditions are met.
An Agent node drafts a voice notification message.
Text to Speech converts it to audio.
The audio is delivered via a webhook to a phone system or notification service.

Language Learning Content

Generate pronunciation guides:

A vocabulary form collects words and phrases in a target language.
Text to Speech generates audio for each word using an appropriate voice.
Audio files are stored and linked to the vocabulary list.
Learners access audio alongside written content.

Multi-Language Support

Generate audio in multiple languages:

A form collects content and a target language selection.
An Agent node translates the content if needed.
Text to Speech generates audio in the target language using an appropriate voice.
The audio URL is included in the localized response.

Provider Comparison

Feature	OpenAI	ElevenLabs
Voice count	6 built-in	4 built-in presets
Speed control	Yes (0.25x - 4.0x)	No
Models	2 (TTS-1, TTS-1 HD)	3 (Eleven v3, Multilingual v2, Turbo v2.5)
Output formats	MP3, WAV, OGG (Opus), FLAC	MP3 only
Audio quality	Good (TTS-1), Excellent (TTS-1 HD)	Excellent
Pricing	Per character	Per character (tier-dependent)
Natural sound	Good	Excellent

Choose OpenAI when you need speed control, multiple output formats, predictable pricing, and reliable quality with simple voice selection.

Choose ElevenLabs when voice quality and naturalness are the top priority, or you need one of their specialized models.

Best Practices

Choose the right voice early. Preview different voices with sample text before committing to a workflow. Voice characteristics significantly impact the end-user experience.
Keep text concise. Long text inputs increase generation time, cost, and file size. Split long content into shorter segments when possible.
Use TTS-1 for drafts and testing. Switch to TTS-1 HD for production when audio quality matters to end users.
Test with actual content. Synthesized speech can sound different depending on content type. Test with representative text from real form submissions, not just placeholder text.
Match voice to context. Use a professional, neutral voice (Alloy, Onyx) for business communications and a warmer voice (Echo, Nova) for customer-facing interactions.
Store audio files. Generated audio URLs point to S3-hosted files. Use a Google Drive or file storage action node to persist important audio files if needed.
Handle empty text. If the text field could be empty (e.g., an optional form field), add conditional logic to skip the Text to Speech node when there is nothing to convert.

Limitations

Text length is limited by the provider. OpenAI supports up to 4096 characters per request. Longer text must be split across multiple requests.
ElevenLabs character limits depend on your subscription tier.
Speed control is available only with OpenAI models. ElevenLabs does not support programmatic speed adjustment.
ElevenLabs always outputs MP3, regardless of the format selected in the configuration.
The node generates audio from text only. It does not support audio-to-audio conversion, voice modification, or audio mixing.
Execution is subject to a 120-second timeout.

On this page