AI Tools.

Search

text to speech

Kokoro-82M

Kokoro-82M is a compact 82-million-parameter text-to-speech model fine-tuned from StyleTTS2, targeting natural-sounding English speech synthesis at a size runnable on CPU or modest GPU. Released under Apache 2.0 with a HuggingFace DOI, it gained attention as a high-quality open TTS model at significantly smaller scale than most alternatives. It supports multiple English voice styles.

Last reviewed

Use cases

  • Local TTS for accessibility tools and screen readers without API cost
  • Podcast and audiobook content creation from text
  • Voice assistant response generation on-device or in lightweight servers
  • Narration generation for video content at low compute cost
  • Research into efficient TTS at sub-100M parameter scale

Pros

  • Apache 2.0 license for unrestricted commercial use
  • 82M parameters enables CPU and low-end GPU inference
  • Natural prosody quality for its parameter count, based on StyleTTS2
  • Multiple English voice styles available from a single checkpoint

Cons

  • English-only; no multilingual TTS capability
  • Prosody and naturalness below larger TTS models for demanding audiobook production
  • Limited control over speaking rate and emphasis compared to larger commercial TTS APIs
  • Community model without a major lab's production testing or SLA
  • Fine-tuning requires StyleTTS2 training expertise

FAQ

What is Kokoro-82M used for?

Local TTS for accessibility tools and screen readers without API cost. Podcast and audiobook content creation from text. Voice assistant response generation on-device or in lightweight servers. Narration generation for video content at low compute cost. Research into efficient TTS at sub-100M parameter scale.

Is Kokoro-82M free to use?

Kokoro-82M is an open-source model published on HuggingFace. License terms vary by model — check the model card for the specific license.

How do I run Kokoro-82M locally?

Most HuggingFace models can be loaded with transformers or the appropriate framework library. See the model card for framework-specific instructions and hardware requirements.

Tags

text-to-speechenarxiv:2306.07691arxiv:2203.02395base_model:yl4579/StyleTTS2-LJSpeechbase_model:finetune:yl4579/StyleTTS2-LJSpeechdoi:10.57967/hf/4329license:apache-2.0region:us