TADA: Speech generation through text-acoustic synchronization

2026-03-115:4210226www.hume.ai

TADA (Text-Acoustic Dual Alignment) is Hume AI's open-source speech-language model that synchronizes text and audio one-to-one.

The future of voice AI hinges on sounding natural, fast, expressive, and free of quirks like hallucinated words or skipped content. Today's LLM-based TTS systems are forced to choose between speed, quality, and reliability because of a fundamental mismatch between how text and audio are represented inside language models.

TADA (Text-Acoustic Dual Alignment) resolves that mismatch with a novel tokenization schema that synchronizes text and speech one-to-one. The result: the fastest LLM-based TTS system available, with competitive voice quality, virtually zero content hallucinations, and a footprint light enough for on-device deployment.

Hume AI is open-sourcing TADA to accelerate progress toward efficient, reliable voice generation. Code and pre-trained models are available now.

live demo

Approach

For every second of spoken audio, the acoustic signal carries far more information than the corresponding text. A second of audio might be 2–3 text tokens but 12.5–25 acoustic frames. This mismatch means LLM-based TTS systems must manage sequences where audio tokens vastly outnumber text tokens — leading to longer context windows, higher memory consumption, slower inference, and more opportunities for the model to lose track of what it's supposed to say.

Most existing systems address this by reducing audio frame rates or introducing intermediate "semantic" tokens between text and audio. Both approaches introduce their own tradeoffs: degraded expressiveness, added complexity, or both.

TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.

For input audio, an encoder paired with an aligner extracts acoustic features from the audio segment corresponding to each text token. For output audio, the LLM's final hidden state serves as a conditioning vector for a flow-matching head, which generates acoustic features that are then decoded into audio and fed back into the model.

Since each LLM step corresponds to exactly one text token and one audio frame, TADA generates speech faster and with less computational effort. And because the architecture enforces a strict one-to-one mapping between text and audio, the model cannot skip or hallucinate content by construction.

Evaluation

Speed

TADA generates speech at a real-time factor (RTF) of 0.09 — more than 5x faster than similar grade LLM-based TTS systems. This is possible because TADA operates at just 2–3 frames (tokens) per second of audio, compared to 12.5–75 tokens per second in other approaches.

Hallucination

Our model was trained on large scale, in-the-wild data, without post-training, and achieves the same reliability as models trained on smaller curated datasets. We measured hallucination rate by flagging any sample with a character error rate (CER) above 0.15 — a threshold that captures unintelligible speech, skipped text, and inserted content. In the 1000+ test samples from LibriTTSR, TADA produced zero hallucinations.

Voice Quality

In human evaluation on expressive, long-form speech (EARS dataset), TADA scored 4.18/5.0 on speaker similarity and 3.78/5.0 on naturalness, placing second overall — ahead of several systems trained on significantly more data.

Potential Applications

On-device deployment: TADA is lightweight enough to run on mobile phones and edge devices without requiring cloud inference. For device manufacturers and app developers building voice interfaces, this means lower latency, better privacy, and no API dependency.

Long-form and conversational speech: TADA's synchronous tokenization is dramatically more context-efficient than existing approaches. Where a conventional system exhausts a 2048-token context window in about 70 seconds of audio, TADA can accommodate roughly 700 seconds in the same budget. This opens the door to long-form narration, extended dialogue, and multi-turn voice interactions.

Production reliability: Zero hallucinations in our tests suggests fewer edge cases to catch, fewer customer complaints, and less post-processing overhead in the product. This makes TADA well-suited for deploying voice in regulated or sensitive environments like healthcare, finance, and education.

Limitations and Future Work

Long-form degradation: While the model supports more than 10 minutes of context, we noticed occasional cases of speaker drift during long generations. Our online rejection sampling strategy reduces this significantly, but it's not fully resolved. We suggest resetting the context as an intermediate workaround.

The modality gap: When the model generates text alongside speech, language quality drops relative to text-only mode. We introduce Speech Free Guidance (SFG), a technique that blends logits from text-only and text-speech inference modes to help close this gap, but more work is required.

Use-cases: The model is only pre-trained on speech continuation; further fine-tuning is required for assistant scenarios. Get in touch to inquire about Hume's extensive library of fine-tuning data.

Scale: The current release covers English and seven additional languages, so there's clear room to expand. We're training larger models with broader language coverage with Hume AI data.

We're releasing TADA because we believe this architecture opens a productive direction for the field, and we want to accelerate progress. We invite researchers and developers to build on this work — whether that means extending the tokenizer to new modalities, solving the long-context problem, or adapting the framework for new applications.

Get Started

TADA is available now under an open-source license. We're releasing 1B and 3B parameter Llama-based models and the full audio tokenizer and decoder.

1B (English): huggingface.co/HumeAI/tada-1b

3B (multilingual): huggingface.co/HumeAI/tada-3b-ml

Demo: huggingface.co/spaces/HumeAI/tada

GitHub: github.com/HumeAI/tada

arXiv: https://arxiv.org/abs/2602.23068

Hume builds voice AI research infrastructure for frontier labs and AI-first enterprises. If you're working on voice models and need high-quality training data, evaluation systems, or reinforcement learning infrastructure, get in touch at hello@hume.ai.


Read the original article

Comments

  • By microtherion 2026-03-1112:152 reply

    To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on.

    The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.

    • By sharyphil 2026-03-1113:58

      > speaker has vocal fry to an extent that I find annoying.

      Was it trained on Sam Altman?

    • By sjcoles 2026-03-1120:59

      There's a subtle modulation that happens on all of the samples. It sounds almost like some kind of harmonic or phase shift? This is something I notice with every AI generated speech out there.

  • By mpalmer 2026-03-1112:191 reply

    "Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.

    • By ggus 2026-03-1115:271 reply

      "Vocal fry", aka "creaky voice". It's stereotypically associated with irritating young women.

      I like me a good rabbit hole that's interesting and also digs into stereotypes.

      Turns out, like many memes, it's not just that. It's (also?) a normal speech pattern, used by different genders, ages, and social groups, in many languages.

      This doesn't mean that vocal fry isn't used as social signaling. But complaining about it, well, isn't that social signalling too?

      Geoff Lindsey - Vocal Fry: what it is, who does it, and why people hate it! - https://www.youtube.com/watch?v=Q0yL2GezneU

      • By mpalmer 2026-03-1116:59

        Not the fry, the cadence that makes everything sound like the same list of three or four things

  • By earthnail 2026-03-1112:071 reply

    I don’t understand the approach

    > TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.

    So basically just concatenating the audio vectors without compression or discretization?

    I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.

    • By yorwba 2026-03-1113:11

      It's a variable-rate codec. The audio is still compressed, but by how much depends on the duration of the segment corresponding to a particular text token. The TTS model predicts one audio token per text token and its duration, and the audio decoder fills in a waveform of the appropriate length.

HackerNews