Choosing a TTS Model for Self Hosting

Table of Contents

I wanted to self-host a TTS model for my Winterfell cluster. Compared to reading (say a blog post or an article), I prefer listening while walking. I have a pipeline in the works which takes in a URL and goes through a series of steps - scrapes it, summarizes it, and so on - more on that in a separate post, but I thought adding a TTS step would be helpful.

A cursory search revealed there are four main contenders in this space:

  • Kokoro TTS
  • F5 TTS
  • StyleTTS 2
  • Fish Audio S1
  • Index TTS 2

These are my notes on reading about and trying out each of them.

Abstract visualization of text transforming into audio waves
Text in, speech out.

# The Sample Text

I used the following text sample for running the test:

The wind was strong on November 4th, 2026, so Dr. Smith decided to wind up his presentation early. He paid $12.50 for a ticket—which seemed expensive!—and whispered to his neighbor, ‘I read that book yesterday.’ Wait, did he say 12:30 PM or 12:30 AM? Regardless, the server IP is 192.168.1.1.

Some interesting points about this particular test prompt:

Heteronym Handling

  • “The wind (noun) was strong…” vs “…decided to wind (verb) up…”
  • The model should change pronunciation based on context

Text Normalization

  • ”…Dr. Smith…” vs ”…November 4th, 2026…” vs ”…$12.50…” vs “…192.168.1.1
  • The model should pick up on these different contexts

Pacing and Breath

  • “…for a ticket—which seemed expensive!—and whispered…”
  • The model should take a micro-pause at the em-dashes (—) or the exclamation mark

# Kokoro TTS

Kokoro has been called the “LLaMA 3 moment” for TTS - it’s small, open, efficient, and shockingly good.

  • Architecture: GAN / Discriminator (Modified StyleTTS 2)
  • Size: ~82 Million Parameters (Tiny!)

Technically a “remix” of StyleTTS 2. The creators stripped out the heavy, complex components that weren’t strictly necessary for inference and trained it on a highly curated, ultra-clean dataset of permissive audio.

How it works

Unlike diffusion models that iteratively “denoise” static into sound (which takes time), Kokoro predicts the audio almost instantly in one forward pass. It uses a “Style Aligner” to ensure the prosody matches the text.

Strengths

It is shockingly fast. Even on a consumer CPU, it runs faster than real-time. It handles pronunciation extremely well because it uses a specialized phonemizer (G2P) that is hard-coded to handle American and British English quirks.

Weaknesses

It has a fixed set of voices. While you can mix them to create new ones, you cannot easily “clone” a random voice file you found on the internet without retraining.

Sample

Kokoro TTS
0:000:00

Pronounced everything correctly, good enough emotions. This is insane for an 82M model.

# F5-TTS

  • Architecture: Flow Matching (Diffusion Transformer)
  • Size: ~300-500 Million Parameters

How it works

F5-TTS abandons the old way of forcing text to align with audio frames (which often caused robotic skipping). Instead, it uses “Flow Matching”, a technique similar to how Stable Diffusion generates images. It treats speech generation as filling in the blanks.

You give it text and a reference audio file. It “pads” the text to match the length of the audio and effectively “inpaints” the speech. This allows for incredibly smooth breathing, pauses, and natural intonation.

Strengths

Unbeatable realism for short bursts.

Weaknesses

It is non-deterministic. If you run the same sentence twice, you get two different performances. Occasionally, it will “hallucinate” (add a word) or skip a number because diffusion models sometimes struggle with precise alignment of dense data (like IP addresses).

Sample

F5 TTS
0:000:00

Mispronounced heteronyms - the second “wind”, ”$”, “12:30”. Also mispronounced the IP. No emotions.

# StyleTTS 2

  • Architecture: Adversarial Diffusion + Large Speech Language Models (SLM)
  • Size: Varied (usually ~300M range)

How it works

It combines the best of both worlds: it uses diffusion to pick a “style” (how the sentence should be said) but uses a GAN (Generative Adversarial Network) to actually generate the audio waves.

Strengths

Consistency. It is much more stable than F5-TTS. It rarely hallucinates or skips words, making it the preferred choice for long-form content like reading entire book chapters.

Weaknesses

It can sound slightly “smoother” or more processed than F5-TTS. It lacks that raw, gritty, “microphone noise” reality that F5 has, sounding more like a perfect studio recording.

Sample

StyleTTS 2
0:000:00

# Fish Audio S1

  • Architecture: Dual-Autoregressive Transformer (LLM-based)
  • Size: 500M (S1-Mini) to 4B parameters

How it works

Fish Speech is fundamentally different; it treats audio tokens exactly like text tokens. It is literally an LLM that outputs sound instead of text. This allows it to understand semantic context better than any other model.

Strengths

Control. You can prompt it with [laughter], [sigh], or [angry], and it actually obeys.

Weaknesses

It is heavy.

Sample

Fish Audio S1
0:000:00

Mispronounced dollars. Emotions aren’t that good.

# Index TTS 2

  • Architecture: Autoregressive Transformer (Text-to-Semantic) + GAN
  • Size: ~1B - 2B parameters

How it works

Unlike standard TTS which just “reads” text, Index TTS 2 separates Duration from Content. It uses a “Duration Diffusion” or token-control mechanism that allows you to specify exactly how many seconds a sentence should take.

It disentangles “Timbre” (what the voice sounds like) from “Emotion” (how they feel), meaning you can have a deep male voice sound “terrified” without it morphing into a different person.

Strengths

Dubbing / Lip Sync: This is its killer feature. You can force the audio to match a specific length (e.g., “Say this sentence in exactly 3.5 seconds”) to match video lip movements.

Emotion separation: Excellent at applying a specific emotion (like “Cry” or “Whisper”) to a custom voice clone without breaking the voice’s identity.

Weaknesses

Speed: Because it is Autoregressive (like a text LLM), it is slower than “one-shot” models like Kokoro.

Sample

Index TTS 2
0:000:00

Pronounced the time correctly. Extremely expressive - a little bit too expressive. Carries the emotions well. IP address pronounced correctly.

# What I Went With

At the end, I chose Kokoro TTS. There exists a ready to deploy FastAPI wrapper around Kokoro - Kokoro-FastAPI - which also provides a docker image.

I didn’t even bother to run it on a GPU, it runs so fast on a CPU itself!

That’s it for this post. If you’re looking to self-host TTS, I hope these notes were helpful. Thanks for reading!

Raj Rajhans

Raj Rajhans

Product Engineer with expertise in Elixir & JavaScript ecosystem. Writing about web development, React, Elixir, and more.