Text to Speech
Convert text to natural-sounding audio using OpenAI and ElevenLabs voice synthesis models in your workflows.
The Text to Speech node converts written text into spoken audio. Pass any text -- including form submission data via template variables -- and the node returns an audio file. This is useful for creating voice responses, accessibility features, audio notifications, podcast generation, and any workflow that needs to produce spoken content from text data.
Supported Providers and Models
| Provider | Models | Strengths |
|---|---|---|
| OpenAI | TTS-1, TTS-1 HD | Fast, consistent, 6 built-in voices, speed control |
| ElevenLabs | Eleven v3, Multilingual v2, Turbo v2.5 | Highly natural voices, multiple model options |
Model Comparison
| Model | Provider | Quality | Speed | Best For |
|---|---|---|---|---|
| TTS-1 | OpenAI | Good | Fast | Fast, cost-effective voice generation |
| TTS-1 HD | OpenAI | Excellent | Moderate | Customer-facing audio, polished content |
| Eleven v3 | ElevenLabs | Excellent | Moderate | Highest quality ElevenLabs output |
| Multilingual v2 | ElevenLabs | Excellent | Moderate | Multi-language content, natural speech |
| Turbo v2.5 | ElevenLabs | Good | Fast | Low-latency generation |
Configuration
Provider
Select OpenAI or ElevenLabs. The available models, voices, and options change based on the provider.
Model
Choose the text-to-speech model:
- TTS-1 (OpenAI) -- Standard quality, optimized for speed and low latency.
- TTS-1 HD (OpenAI) -- High-definition audio. Slower to generate but noticeably smoother and more natural.
- Eleven v3 (ElevenLabs) -- The latest and highest-quality ElevenLabs model.
- Multilingual v2 (ElevenLabs) -- Flagship multilingual model for natural-sounding speech across many languages.
- Turbo v2.5 (ElevenLabs) -- Optimized for low latency and fast generation.
Credential
Select a saved API key for the chosen provider. See Credential Management for setup.
Text
The text to convert to speech. This field supports template variables, so you can dynamically generate audio from form submissions or upstream node outputs.
Example uses:
Thank you for your submission, [Name variable].
Your request number is [Request ID variable].
We will get back to you within 24 hours.[Full transcript variable from an Agent node]Text length limits vary by provider. OpenAI supports up to 4096 characters per request. ElevenLabs limits depend on your subscription tier. For longer text, consider splitting into multiple segments.
Voice
Select the voice used for speech synthesis. Available voices depend on the provider:
OpenAI Voices:
| Voice | Description | Best For |
|---|---|---|
| Alloy | Neutral, balanced | General-purpose, professional |
| Echo | Warm, conversational | Customer interactions, friendly tone |
| Fable | Expressive, storytelling | Narratives, creative content |
| Onyx | Deep, authoritative | Announcements, formal content |
| Nova | Bright, energetic | Marketing, upbeat content |
| Shimmer | Soft, clear | Gentle notifications, instructions |
All OpenAI voices support the same languages and produce consistent quality regardless of the voice selected.
ElevenLabs Voices:
The editor provides 4 built-in voice options:
| Voice | Voice ID |
|---|---|
| Rachel | 21m00Tcm4TlvDq8ikWAM |
| Antoni | ErXwobaYiN019PkySvjV |
| Elli | MF3mGyEYCl7XYWbV9V6O |
| Josh | TxGEqnHWrfWFTfGW9XjX |
These are hardcoded voice presets. To use other ElevenLabs voices (custom clones, community voices), you would need to configure them via the API directly.
Speed (OpenAI Only)
Adjust the speaking speed on a scale from 0.25x to 4.0x (step: 0.25x):
| Speed | Multiplier | Use Case |
|---|---|---|
| Slow | 0.25x - 0.75x | Dictation, accessibility, language learning |
| Normal | 1.0x | Default speaking rate |
| Fast | 1.25x - 2.0x | Summary playback, time-sensitive notifications |
| Very fast | 2.0x - 4.0x | Rapid review, speed listening |
Speed adjustment is not available with ElevenLabs. ElevenLabs determines pacing from the text content and voice characteristics.
Output Format
The output audio format. Four formats are available in the dropdown:
| Format | Notes |
|---|---|
| MP3 | Default. Most widely compatible. |
| WAV | Uncompressed, highest quality. |
| OGG | Open format. Mapped to Opus codec for OpenAI. |
| FLAC | Lossless compression. |
OpenAI supports all four formats. When OGG is selected, the actual API request uses the Opus codec (OpenAI's Ogg Opus format), and the output file has an .ogg extension.
ElevenLabs forces MP3 output regardless of the format selected. ElevenLabs only supports MP3 output through this integration.
Output
The Text to Speech node produces:
| Field | Type | Description |
|---|---|---|
file | object | File reference with url, key, mimeType, sizeBytes, and filename |
durationMs | number | Duration of the generated audio in milliseconds (when available) |
characterCount | number | Number of characters in the input text |
model | string | The model that was used |
provider | string | The provider that was used |
Access the audio URL via file.url in downstream template variables.
Use Cases
Automated Voice Responses
Generate audio confirmations for form submissions:
- A customer submits a service request form.
- An Agent node creates a personalized confirmation message.
- The Text to Speech node converts the message to audio.
- The audio URL is included in a confirmation email or SMS.
Accessibility Enhancement
Create audio versions of text content:
- A content submission form collects article text.
- Text to Speech converts the article to an audio file.
- The audio file is attached to the content record for accessibility compliance.
Podcast Generation
Automate podcast episode creation:
- A content creator submits a script through a form.
- Text to Speech generates the audio narration.
- The audio file is stored in Google Drive and linked in a Notion database.
- A notification is sent to the production team for final review.
Interactive Voice Notifications
Create voice messages for notification workflows:
- A monitoring form triggers when certain conditions are met.
- An Agent node drafts a voice notification message.
- Text to Speech converts it to audio.
- The audio is delivered via a webhook to a phone system or notification service.
Language Learning Content
Generate pronunciation guides:
- A vocabulary form collects words and phrases in a target language.
- Text to Speech generates audio for each word using an appropriate voice.
- Audio files are stored and linked to the vocabulary list.
- Learners access audio alongside written content.
Multi-Language Support
Generate audio in multiple languages:
- A form collects content and a target language selection.
- An Agent node translates the content if needed.
- Text to Speech generates audio in the target language using an appropriate voice.
- The audio URL is included in the localized response.
Provider Comparison
| Feature | OpenAI | ElevenLabs |
|---|---|---|
| Voice count | 6 built-in | 4 built-in presets |
| Speed control | Yes (0.25x - 4.0x) | No |
| Models | 2 (TTS-1, TTS-1 HD) | 3 (Eleven v3, Multilingual v2, Turbo v2.5) |
| Output formats | MP3, WAV, OGG (Opus), FLAC | MP3 only |
| Audio quality | Good (TTS-1), Excellent (TTS-1 HD) | Excellent |
| Pricing | Per character | Per character (tier-dependent) |
| Natural sound | Good | Excellent |
Choose OpenAI when you need speed control, multiple output formats, predictable pricing, and reliable quality with simple voice selection.
Choose ElevenLabs when voice quality and naturalness are the top priority, or you need one of their specialized models.
Best Practices
- Choose the right voice early. Preview different voices with sample text before committing to a workflow. Voice characteristics significantly impact the end-user experience.
- Keep text concise. Long text inputs increase generation time, cost, and file size. Split long content into shorter segments when possible.
- Use TTS-1 for drafts and testing. Switch to TTS-1 HD for production when audio quality matters to end users.
- Test with actual content. Synthesized speech can sound different depending on content type. Test with representative text from real form submissions, not just placeholder text.
- Match voice to context. Use a professional, neutral voice (Alloy, Onyx) for business communications and a warmer voice (Echo, Nova) for customer-facing interactions.
- Store audio files. Generated audio URLs point to S3-hosted files. Use a Google Drive or file storage action node to persist important audio files if needed.
- Handle empty text. If the text field could be empty (e.g., an optional form field), add conditional logic to skip the Text to Speech node when there is nothing to convert.
Limitations
- Text length is limited by the provider. OpenAI supports up to 4096 characters per request. Longer text must be split across multiple requests.
- ElevenLabs character limits depend on your subscription tier.
- Speed control is available only with OpenAI models. ElevenLabs does not support programmatic speed adjustment.
- ElevenLabs always outputs MP3, regardless of the format selected in the configuration.
- The node generates audio from text only. It does not support audio-to-audio conversion, voice modification, or audio mixing.
- Execution is subject to a 120-second timeout.