Bad TTS output is almost always a text problem, not a model problem. The same model that sounds robotic with one script sounds natural with a few small edits to punctuation and sentence length.
The fastest way to diagnose the issue
Take your worst-sounding paragraph and make only these changes:
- Break every sentence over 20 words into two sentences
- Replace all abbreviations with their spoken form
- Replace all numbers with words
- Add a comma wherever you would naturally pause when speaking
Then generate again with the same voice. If quality improves significantly, the model was not the problem.
The 5 patterns that make TTS sound robotic every time
1. Sentences that are too long
TTS models read sentences as single units of prosody. Long sentences have no natural breath points, so the model either rushes through or applies awkward pauses at arbitrary points.
Maximum practical length: 25 words. Break anything longer.
2. Abbreviations and acronyms
API, LLM, TTS, CEO, FAQ - models handle these inconsistently. Some spell them out, some pronounce them as words, some do something in between. If the pronunciation matters, write what you want to hear: A.P.I., L.L.M., ell em studio.
3. Numbers in any format
2026, $4,500, 10:30am, 3.14 - all of these produce variable results. Write out what you want spoken: twenty twenty-six, forty-five hundred dollars, ten thirty AM.
4. Missing punctuation
Punctuation is pacing instruction for TTS. A sentence without terminal punctuation runs directly into the next one. Add periods, commas, and em-dashes thoughtfully - they are not decoration in TTS scripts.
5. Ambiguous acronyms
Dr. (doctor or drive?), St. (saint or street?), Jan. (January or the name Jan?). Spell them out whenever context does not make it unambiguous.
Punctuation as a pacing tool
Most TTS engines treat punctuation marks as pause signals with different weights:
- Comma
, - short pause
- Period
. - medium pause with intonation drop
- Ellipsis
... - longer pause, can imply trailing off
- Em dash in some engines - abrupt pause with tension
Use these intentionally. If a line feels rushed, add a comma. If two lines run together, check whether the period is being read.
Pronunciation hints: when to add them and when not to
ElevenLabs uses its own pronunciation dictionary. Play.ht and others support SSML phonemes. Use hints when:
- A proper noun is consistently mispronounced
- A technical term has an unusual pronunciation
- A brand name is read letter by letter when it should be read as a word
Do not add hints when the model already reads it correctly. Hints can cause the model to overcompensate and sound more robotic.
Quick A/B pacing test
Generate the same paragraph at three stability settings (ElevenLabs) or three speed settings (other engines). Listen to all three in sequence. The differences reveal how much pacing variance exists in your script’s structure. Choose the setting that sounds most like a knowledgeable human speaking naturally, not the fastest or clearest.
What text patterns have you found most problematic with TTS? Sharing the specific cases helps everyone calibrate.
Curated by Selendia AI 🎤