Parakeet.cpp – Parakeet ASR inference in pure C++ with Metal GPU acceleration

2026-02-273:4811431github.com

Ultra fast and portable Parakeet implementation for on-device inference in C++ using Axiom with MPS+Unified Memory and Cuda support - Frikallo/parakeet.cpp

Fast speech recognition with NVIDIA's Parakeet models in pure C++.

Built on axiom — a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.

~27ms encoder inference on Apple Silicon GPU for 10s audio (110M model) — 96x faster than CPU.

Model Class Size Type Description
tdt-ctc-110m ParakeetTDTCTC 110M Offline English, dual CTC/TDT decoder heads
tdt-600m ParakeetTDT 600M Offline Multilingual, TDT decoder
eou-120m ParakeetEOU 120M Streaming English, RNNT with end-of-utterance detection
nemotron-600m ParakeetNemotron 600M Streaming Multilingual, configurable latency (80ms–1120ms)
sortformer Sortformer 117M Streaming Speaker diarization (up to 4 speakers)

All ASR models share the same audio pipeline: 16kHz mono WAV → 80-bin Mel spectrogram → FastConformer encoder.

#include <parakeet/parakeet.hpp> parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu(); // optional — Metal acceleration auto result = t.transcribe("audio.wav");
std::cout << result.text << std::endl;

Choose decoder at call site:

auto result = t.transcribe("audio.wav", parakeet::Decoder::CTC); // fast greedy
auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT); // better accuracy (default)

Word-level timestamps:

auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT, /*timestamps=*/true);
for (const auto &w : result.word_timestamps) { std::cout << "[" << w.start << "s - " << w.end << "s] " << w.word << std::endl;
}
parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu();
auto result = t.transcribe("audio.wav");
parakeet::TDTTranscriber t("model.safetensors", "vocab.txt", parakeet::make_tdt_600m_config());
auto result = t.transcribe("audio.wav");
parakeet::StreamingTranscriber t("model.safetensors", "vocab.txt", parakeet::make_eou_120m_config()); // Feed audio chunks (e.g., from microphone)
while (auto chunk = get_audio_chunk()) { auto text = t.transcribe_chunk(chunk); if (!text.empty()) std::cout << text << std::flush;
}
std::cout << t.get_text() << std::endl;
// Latency modes: 0=80ms, 1=160ms, 6=560ms, 13=1120ms
auto cfg = parakeet::make_nemotron_600m_config(/*latency_frames=*/1);
parakeet::NemotronTranscriber t("model.safetensors", "vocab.txt", cfg); while (auto chunk = get_audio_chunk()) { auto text = t.transcribe_chunk(chunk); if (!text.empty()) std::cout << text << std::flush;
}

Identify who spoke when — detects up to 4 speakers with per-frame activity probabilities:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors")); auto wav = parakeet::read_wav("meeting.wav");
auto features = parakeet::preprocess_audio(wav.samples, {.normalize = false});
auto segments = model.diarize(features); for (const auto &seg : segments) { std::cout << "Speaker " << seg.speaker_id << ": [" << seg.start << "s - " << seg.end << "s]" << std::endl;
}
// Speaker 0: [0.56s - 2.96s]
// Speaker 0: [3.36s - 4.40s]
// Speaker 1: [4.80s - 6.24s]

Streaming diarization with arrival-order speaker tracking:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors")); parakeet::EncoderCache enc_cache;
parakeet::AOSCCache aosc_cache(4); // max 4 speakers while (auto chunk = get_audio_chunk()) { auto features = parakeet::preprocess_audio(chunk, {.normalize = false}); auto segments = model.diarize_chunk(features, enc_cache, aosc_cache); for (const auto &seg : segments) { std::cout << "Speaker " << seg.speaker_id << ": [" << seg.start << "s - " << seg.end << "s]" << std::endl;
    }
}

For full control over the pipeline:

CTC (English, punctuation & capitalization):

auto cfg = parakeet::make_110m_config();
parakeet::ParakeetTDTCTC model(cfg);
model.load_state_dict(axiom::io::safetensors::load("model.safetensors")); auto wav = parakeet::read_wav("audio.wav");
auto features = parakeet::preprocess_audio(wav.samples);
auto encoder_out = model.encoder()(features); auto log_probs = model.ctc_decoder()(encoder_out);
auto tokens = parakeet::ctc_greedy_decode(log_probs); parakeet::Tokenizer tokenizer;
tokenizer.load("vocab.txt");
std::cout << tokenizer.decode(tokens[0]) << std::endl;

TDT (Token-and-Duration Transducer):

auto encoder_out = model.encoder()(features);
auto tokens = parakeet::tdt_greedy_decode(model, encoder_out, cfg.durations);
std::cout << tokenizer.decode(tokens[0]) << std::endl;

Timestamps (CTC or TDT):

// CTC timestamps
auto ts = parakeet::ctc_greedy_decode_with_timestamps(log_probs); // TDT timestamps
auto ts = parakeet::tdt_greedy_decode_with_timestamps(model, encoder_out, cfg.durations); // Group into word-level timestamps
auto words = parakeet::group_timestamps(ts[0], tokenizer.pieces());

GPU acceleration (Metal):

model.to(axiom::Device::GPU);
auto features_gpu = features.gpu();
auto encoder_out = model.encoder()(features_gpu); // Decode on CPU
auto tokens = parakeet::ctc_greedy_decode(
    model.ctc_decoder()(encoder_out).cpu()
);
Usage: parakeet <model.safetensors> <audio.wav> [options]

Model types:
  --model TYPE     Model type (default: tdt-ctc-110m)
                   Types: tdt-ctc-110m, tdt-600m, eou-120m,
                          nemotron-600m, sortformer

Decoder options:
  --ctc            Use CTC decoder (default: TDT)
  --tdt            Use TDT decoder

Other options:
  --vocab PATH     SentencePiece vocab file
  --gpu            Run on Metal GPU
  --timestamps     Show word-level timestamps
  --streaming      Use streaming mode (eou/nemotron models)
  --latency N      Right context frames for nemotron (0/1/6/13)
  --features PATH  Load pre-computed features from .npy file

Examples:

# Basic transcription (TDT decoder, default)
./build/parakeet model.safetensors audio.wav --vocab vocab.txt # CTC decoder
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --ctc # GPU acceleration
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --gpu # Word-level timestamps
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --timestamps # 600M multilingual TDT model
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model tdt-600m # Streaming with EOU
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model eou-120m # Nemotron streaming with configurable latency
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model nemotron-600m --latency 6 # Speaker diarization
./build/parakeet sortformer.safetensors meeting.wav --model sortformer
# Speaker 0: [0.56s - 2.96s]
# Speaker 0: [3.36s - 4.40s]
# Speaker 1: [4.80s - 6.24s]

Requires C++20. Axiom is the only dependency (included as a submodule).

git clone --recursive https://github.com/noahkay13/parakeet.cpp
cd parakeet.cpp
make build

Download a NeMo checkpoint from NVIDIA and convert to safetensors:

# Download from HuggingFace (requires pip install huggingface_hub)
huggingface-cli download nvidia/parakeet-tdt_ctc-110m --include "*.nemo" --local-dir . # Convert to safetensors
pip install safetensors torch
python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensors

The converter supports all model types via the --model flag:

# 110M TDT-CTC (default)
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 110m-tdt-ctc # 600M multilingual TDT
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdt # 120M EOU streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model eou-120m # 600M Nemotron streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model nemotron-600m # 117M Sortformer diarization
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model sortformer

Also supports raw .ckpt files and inspection:

python scripts/convert_nemo.py model_weights.ckpt -o model.safetensors
python scripts/convert_nemo.py --dump model.nemo # inspect checkpoint keys

Grab the SentencePiece vocab from the same HuggingFace repo. The file is inside the .nemo archive, or download directly:

# Extract from .nemo
tar xf parakeet-tdt_ctc-110m.nemo ./tokenizer.model
# or use the vocab.txt from the HF files page

Built on a shared FastConformer encoder (Conv2d 8x subsampling → N Conformer blocks with relative positional attention):

Model Class Decoder Use case
CTC ParakeetCTC Greedy argmax Fast, English-only
RNNT ParakeetRNNT Autoregressive LSTM Streaming capable
TDT ParakeetTDT LSTM + duration prediction Better accuracy than RNNT
TDT-CTC ParakeetTDTCTC Both TDT and CTC heads Switch decoder at inference

Built on a cache-aware streaming FastConformer encoder with causal convolutions and bounded-context attention:

Model Class Decoder Use case
EOU ParakeetEOU Streaming RNNT End-of-utterance detection
Nemotron ParakeetNemotron Streaming TDT Configurable latency streaming
Model Class Architecture Use case
Sortformer Sortformer NEST encoder → Transformer → sigmoid Speaker diarization (up to 4 speakers)

Measured on Apple M3 16GB with simulated audio input (Tensor::randn). Times are per-encoder-forward-pass (Sortformer: full forward pass).

Encoder throughput — 10s audio:

Model Params CPU (ms) GPU (ms) GPU Speedup
110m (TDT-CTC) 110M 2,581 27 96x
tdt-600m 600M 10,779 520 21x
rnnt-600m 600M 10,648 1,468 7x
sortformer 117M 3,195 479 7x

110m GPU scaling across audio lengths:

Audio CPU (ms) GPU (ms) RTF Throughput
1s 262 24 0.024 41x
5s 1,222 26 0.005 190x
10s 2,581 27 0.003 370x
30s 10,061 32 0.001 935x
60s 26,559 72 0.001 833x

GPU acceleration powered by axiom's Metal graph compiler which fuses the full encoder into optimized MPSGraph operations.

# Full suite
make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors" # Single model
make bench-single ARGS="--110m=models/model.safetensors --benchmark_filter=110m" # Markdown table output
./build/parakeet_bench --110m=models/model.safetensors --markdown # Skip GPU benchmarks
./build/parakeet_bench --110m=models/model.safetensors --no-gpu

Available model flags: --110m, --tdt-600m, --rnnt-600m, --sortformer. All Google Benchmark flags (--benchmark_filter, --benchmark_format=json, --benchmark_repetitions=N) are passed through.

  • Audio: 16kHz mono WAV (16-bit PCM or 32-bit float)
  • Offline models have ~4-5 minute audio length limits; split longer files or use streaming models
  • Blank token ID is 1024 (110M) or 8192 (600M)
  • GPU acceleration requires Apple Silicon with Metal support
  • Timestamps use frame-level alignment: frame * 0.08s (8x subsampling × 160 hop / 16kHz)
  • Sortformer diarization uses unnormalized features (normalize = false) — this differs from ASR models

MIT


Read the original article

Comments

  • By noahkay13 2026-02-273:482 reply

    I built a C++ inference engine for NVIDIA's Parakeet speech recognition models using Axiom(https://github.com/Frikallo/axiom) my tensor library.

    What it does: - Runs 7 model families: offline transcription (CTC, RNNT, TDT, TDT-CTC), streaming (EOU, Nemotron), and speaker diarization (Sortformer) - Word-level timestamps - Streaming transcription from microphone input - Speaker diarization detecting up to 4 speakers

    • By aaronbrethorst 2026-02-277:521 reply

      I see a number of references to macOS support in your docs for Axiom. Can this run on iOS?

      • By noahkay13 2026-02-278:32

        Theoretically, yes? This hasent been tested but xcode has great c++ interop and the goal with Axiom and now parakeet.cpp is to be used for portable deployments so making that process easier is definitely on the roadmap.

    • By computerex 2026-02-2711:261 reply

      Oh hey I just implemented this in golang. Mine implementation heavily optimized for cpu.

      • By pdyc 2026-02-2714:53

        can you share your repo.

  • By ghostpepper 2026-02-274:38

    Off topic but if anyone is looking for a nice web-GUI frontend for a locally-hosted transcription engine, Scriberr is nice

    https://github.com/rishikanthc/Scriberr

  • By antirez 2026-02-278:202 reply

    Related:

    https://github.com/antirez/qwen-asr

    https://github.com/antirez/voxtral.c

    Qwen-asr can easily transcribe live radio (see README) in any random laptop. It looks like we are going to see really cool things on local inference, now that automatic programming makes a lot simpler to create solid pipelines for new models in C, C++, Rust, ..., in a matter of hours.

    • By T0mSIlver 2026-02-2710:141 reply

      Your voxtral.c work was a big motivator for me. I built a macOS menu bar dictation app (https://github.com/T0mSIlver/localvoxtral) around Voxtral Realtime, currently using a voxmlx fork with an OpenAI Realtime WebSocket server I added on top.

      The thing that sold me on Voxtral Realtime over Whisper-based models for dictation is the causal encoder. Text streaming in as you speak rather than appearing after you stop is a fundamentally different UX. On M1 Pro with a 4-bit quant through voxmlx it feels responsive enough for natural dictation, though I haven't done proper latency benchmarks yet.

      Integrating voxtral.c as a backend is on my roadmap, compiling to a single native binary makes it much easier to bundle into a macOS app than a Python-based backend.

      • By solarkraft 2026-02-2714:12

        > Text streaming in as you speak rather than appearing after you stop is a fundamentally different UX

        100%. I don’t understand how people are able to compromise on this.

    • By pjmlp 2026-02-279:431 reply

      Which is why long term current programming languages will eventually become less relevant in the whole programming stack, as in get the computer to automate tasks, regardless how.

      • By FpUser 2026-02-2710:042 reply

        Assuming RAM prices will not make it totally unaffordable. Current situation is atrocious and big infrastructure corps seem to love it, they do not want independent computing. Alternatively they might build specialized branded hardware which people could only use for what corps allow them to do for nice monthly fee.

        Another problem is too much abstraction on input spec level. The other day I asked Claude to generate few classes. When reviewing the code I noticed it doing full scan for ranges on one giant set. This would bring my backend to a halt. After pointing it out to Claude it had smartened up to start with lower_bound() call. When there are no people to notice such things what do you think we are going to have?

        • By pjmlp 2026-02-2710:251 reply

          Agreed, in regards to prices, it appears to be the new gold, lets see how this gets sorted out, with NPUs, FPGAs, analog (Cerebas),...

          Now the abstraction I am with you on that, I foresee a more formal way to give specifications, but more suitable for natural language as input, or even proper mathematics, than the languages we have been using thus far.

          Naturally we aren't there yet.

          • By FpUser 2026-02-2711:201 reply

            >"Naturally we aren't there yet."

            But we were. COBOL ;)

            On more serious note. Sure we need Spec development IDE which LLM would compile to a language of choice (or print ASIC). It would still not prevent that lower_bound things from happening and there will be no people to find out why

            • By pjmlp 2026-02-2711:54

              If you look at the amount of stuff people type on tiny chat windows, I would say AI is COBOL's revenge.

              Unfortunely that is already the case when debugging low code, no code tools, and good luck having any kind of versioning with those.

        • By MonkeyClub 2026-02-2711:07

          > Alternatively they might build specialized branded hardware which people could only use for what corps allow them to do for nice monthly fee.

          That's why I'm still holding on to a bulky Core 2 Duo Management Engine-free Fujitsu workstation, for when personal computing finally goes underground again.

HackerNews