text to speech models

17 models · ranked by HuggingFace downloads

Kokoro-82M

Kokoro-82M is a compact 82-million-parameter text-to-speech model fine-tuned from StyleTTS2, targeting natural-sounding English speech synthesis at a size runnable on CPU or modest GPU. Released under Apache 2.0 with a HuggingFace DOI, it gained attention as a high-quality open TTS model at significantly smaller scale than most alternatives. It supports multiple English voice styles.

13,949,161 ↓ · 6,437 ♡

XTTS-v2

XTTS-v2 is Coqui's multilingual text-to-speech model supporting 17 languages with voice cloning from a short audio sample. It uses a GPT-style decoder for speech token generation, enabling zero-shot speaker cloning without fine-tuning. The model was released before Coqui's closure and remains available under a non-standard license.

9,303,058 ↓ · 3,635 ♡

chatterbox

Chatterbox is Resemble AI's open-source text-to-speech model offering voice cloning and expressive speech synthesis. It is designed as a production-grade TTS system with controllable prosody and emotion.

2,320,748 ↓ · 1,671 ♡

Qwen3-TTS-12Hz-1.7B-CustomVoice

Qwen3-TTS-12Hz-1.7B-CustomVoice synthesizes speech waveforms from text input. It produces natural-sounding audio and supports different speaking rates or voice styles depending on the variant.

2,001,636 ↓ · 1,674 ♡

Qwen3-TTS-12Hz-0.6B-CustomVoice

Qwen3-TTS CustomVoice is the 0.6B variant of Qwen's TTS family focused on voice customization from reference audio. At 12Hz token rate and 0.6B parameters, it's designed for constrained environments where a full 1.7B TTS model is too heavy. Supports 9 languages including CJK languages and major European languages.

1,211,933 ↓ · 164 ♡

MOSS-TTS

MOSS-TTS is OpenMOSS's multilingual text-to-speech model supporting 20 languages including Chinese, English, German, Japanese, Korean, Russian, and Hebrew. It uses a delay-based autoregressive architecture (moss_tts_delay) for high-quality speech synthesis with natural prosody. Apache-2.0 licensing makes it a viable open alternative to commercial TTS APIs for multilingual applications.

945,910 ↓ · 405 ♡

OmniVoice

OmniVoice from k2-fsa is a multilingual speech model targeting end-to-end ASR and voice processing tasks. Published as part of the k2/Lhotse/sherpa-onnx ecosystem for server and edge speech applications.

877,952 ↓ · 1,117 ♡

F5-TTS

F5-TTS synthesizes speech waveforms from text input. It produces natural-sounding audio and supports different speaking rates or voice styles depending on the variant.

785,125 ↓ · 1,185 ♡

indic-parler-tts

indic-parler-tts is a TTS model that generates audio directly from text tokens, enabling low-latency speech synthesis without a separate vocoder stage.

762,280 ↓ · 252 ♡

Qwen3-TTS-12Hz-1.7B-VoiceDesign

Qwen3-TTS VoiceDesign is a 1.7B text-to-speech model operating at 12Hz token rate, designed to support custom voice creation alongside standard TTS. It covers multiple languages and generates expressive speech from text input. Apache-2.0 licensed and part of Qwen's audio model family.

663,776 ↓ · 367 ♡

VoxCPM2

VoxCPM2 is a multilingual text-to-speech model from OpenBMB supporting over 35 languages, with explicit voice-cloning and voice-design capabilities built on a diffusion-based audio synthesis approach. It covers a wide geographic range including East Asian, Southeast Asian, European, and Middle Eastern languages. The model is released under Apache-2.0.

654,999 ↓ · 1,461 ♡

VibeVoice-Realtime-0.5B

VibeVoice-Realtime-0.5B synthesizes speech waveforms from text input. It produces natural-sounding audio and supports different speaking rates or voice styles depending on the variant.

630,635 ↓ · 1,236 ♡

Kokoro-82M-v1.0-ONNX

Kokoro-82M is a lightweight 82M-parameter text-to-speech model converted to ONNX by the HuggingFace ONNX community, enabling browser-based and edge TTS via Transformers.js. It uses the StyleTTS2 architecture, which separates style and content representations to produce expressive speech without large acoustic models. The ONNX conversion allows direct client-side inference without a server.

580,506 ↓ · 233 ♡

Qwen3-TTS-12Hz-0.6B-Base

Qwen3-TTS-12Hz-0.6B-Base is a TTS model that generates audio directly from text tokens, enabling low-latency speech synthesis without a separate vocoder stage.

558,415 ↓ · 259 ♡

mms-tts-hat

MMS-TTS-HAT is Meta's Massively Multilingual Speech TTS model for Haitian Creole (hat), part of the MMS project targeting 1000+ languages. It uses VITS architecture for end-to-end speech synthesis. CC-BY-NC-4.0 licensed — non-commercial use only.

444,952 ↓ · 4 ♡

s2-pro

s2-pro is Fish Audio's multilingual text-to-speech model supporting over 80 languages with instruction-following capabilities, described in arXiv:2603.08823. It is designed for zero-shot voice cloning and cross-lingual synthesis by conditioning on speaker reference audio and natural language prompts. The license is marked 'other', meaning specific usage restrictions apply beyond standard open-source terms.

434,111 ↓ · 1,071 ♡

Kokoro-82M-bf16

Kokoro-82M-bf16 synthesizes speech waveforms from text input. It produces natural-sounding audio and supports different speaking rates or voice styles depending on the variant.

432,388 ↓ · 52 ♡