Table of Contents
I wanted to self-host a TTS model for my Winterfell cluster. Compared to reading (say a blog post or an article), I prefer listening while walking. I have a pipeline in the works which takes in a URL and goes through a series of steps - scrapes it, summarizes it, and so on - more on that in a separate post, but I thought adding a TTS step would be helpful.
A cursory search revealed there are four main contenders in this space:
- Kokoro TTS
- F5 TTS
- StyleTTS 2
- Fish Audio S1
- Index TTS 2
These are my notes on reading about and trying out each of them.
The Sample Text
I used the following text sample for running the test:
The wind was strong on November 4th, 2026, so Dr. Smith decided to wind up his presentation early. He paid $12.50 for a ticket—which seemed expensive!—and whispered to his neighbor, ‘I read that book yesterday.’ Wait, did he say 12:30 PM or 12:30 AM? Regardless, the server IP is 192.168.1.1.
Some interesting points about this particular test prompt:
Heteronym Handling
- “The wind (noun) was strong…” vs “…decided to wind (verb) up…”
- The model should change pronunciation based on context
Text Normalization
- ”…Dr. Smith…” vs ”…November 4th, 2026…” vs ”…$12.50…” vs “…192.168.1.1”
- The model should pick up on these different contexts
Pacing and Breath
- “…for a ticket—which seemed expensive!—and whispered…”
- The model should take a micro-pause at the em-dashes (—) or the exclamation mark
Kokoro TTS
Kokoro has been called the “LLaMA 3 moment” for TTS - it’s small, open, efficient, and shockingly good.
- Architecture: GAN / Discriminator (Modified StyleTTS 2)
- Size: ~82 Million Parameters (Tiny!)
Technically a “remix” of StyleTTS 2. The creators stripped out the heavy, complex components that weren’t strictly necessary for inference and trained it on a highly curated, ultra-clean dataset of permissive audio.
How it works
Unlike diffusion models that iteratively “denoise” static into sound (which takes time), Kokoro predicts the audio almost instantly in one forward pass. It uses a “Style Aligner” to ensure the prosody matches the text.
Strengths
It is shockingly fast. Even on a consumer CPU, it runs faster than real-time. It handles pronunciation extremely well because it uses a specialized phonemizer (G2P) that is hard-coded to handle American and British English quirks.
Weaknesses
It has a fixed set of voices. While you can mix them to create new ones, you cannot easily “clone” a random voice file you found on the internet without retraining.
Sample
Pronounced everything correctly, good enough emotions. This is insane for an 82M model.
F5-TTS
- Architecture: Flow Matching (Diffusion Transformer)
- Size: ~300-500 Million Parameters
How it works
F5-TTS abandons the old way of forcing text to align with audio frames (which often caused robotic skipping). Instead, it uses “Flow Matching”, a technique similar to how Stable Diffusion generates images. It treats speech generation as filling in the blanks.
You give it text and a reference audio file. It “pads” the text to match the length of the audio and effectively “inpaints” the speech. This allows for incredibly smooth breathing, pauses, and natural intonation.
Strengths
Unbeatable realism for short bursts.
Weaknesses
It is non-deterministic. If you run the same sentence twice, you get two different performances. Occasionally, it will “hallucinate” (add a word) or skip a number because diffusion models sometimes struggle with precise alignment of dense data (like IP addresses).
Sample
Mispronounced heteronyms - the second “wind”, ”$”, “12:30”. Also mispronounced the IP. No emotions.
StyleTTS 2
- Architecture: Adversarial Diffusion + Large Speech Language Models (SLM)
- Size: Varied (usually ~300M range)
How it works
It combines the best of both worlds: it uses diffusion to pick a “style” (how the sentence should be said) but uses a GAN (Generative Adversarial Network) to actually generate the audio waves.
Strengths
Consistency. It is much more stable than F5-TTS. It rarely hallucinates or skips words, making it the preferred choice for long-form content like reading entire book chapters.
Weaknesses
It can sound slightly “smoother” or more processed than F5-TTS. It lacks that raw, gritty, “microphone noise” reality that F5 has, sounding more like a perfect studio recording.
Sample
Fish Audio S1
- Architecture: Dual-Autoregressive Transformer (LLM-based)
- Size: 500M (S1-Mini) to 4B parameters
How it works
Fish Speech is fundamentally different; it treats audio tokens exactly like text tokens. It is literally an LLM that outputs sound instead of text. This allows it to understand semantic context better than any other model.
Strengths
Control. You can prompt it with [laughter], [sigh], or [angry], and it actually obeys.
Weaknesses
It is heavy.
Sample
Mispronounced dollars. Emotions aren’t that good.
Index TTS 2
- Architecture: Autoregressive Transformer (Text-to-Semantic) + GAN
- Size: ~1B - 2B parameters
How it works
Unlike standard TTS which just “reads” text, Index TTS 2 separates Duration from Content. It uses a “Duration Diffusion” or token-control mechanism that allows you to specify exactly how many seconds a sentence should take.
It disentangles “Timbre” (what the voice sounds like) from “Emotion” (how they feel), meaning you can have a deep male voice sound “terrified” without it morphing into a different person.
Strengths
Dubbing / Lip Sync: This is its killer feature. You can force the audio to match a specific length (e.g., “Say this sentence in exactly 3.5 seconds”) to match video lip movements.
Emotion separation: Excellent at applying a specific emotion (like “Cry” or “Whisper”) to a custom voice clone without breaking the voice’s identity.
Weaknesses
Speed: Because it is Autoregressive (like a text LLM), it is slower than “one-shot” models like Kokoro.
Sample
Pronounced the time correctly. Extremely expressive - a little bit too expressive. Carries the emotions well. IP address pronounced correctly.
What I Went With
At the end, I chose Kokoro TTS. There exists a ready to deploy FastAPI wrapper around Kokoro - Kokoro-FastAPI - which also provides a docker image.
I didn’t even bother to run it on a GPU, it runs so fast on a CPU itself!
That’s it for this post. If you’re looking to self-host TTS, I hope these notes were helpful. Thanks for reading!