Rust implementation of Mistral's Voxtral Mini 4B Realtime runs in your browser

2026-02-101:2640169github.com

Contribute to TrevorS/voxtral-mini-realtime-rs development by creating an account on GitHub.

HuggingFace Live Demo

Streaming speech recognition running natively and in the browser. A pure Rust implementation of Mistral's Voxtral Mini 4B Realtime model using the Burn ML framework.

The Q4 GGUF quantized path (2.5 GB) runs entirely client-side in a browser tab via WASM + WebGPU. Try it live.

# Download model weights (~9 GB)
uv run --with huggingface_hub \ hf download mistralai/Voxtral-Mini-4B-Realtime-2602 --local-dir models/voxtral # Transcribe an audio file (f32 SafeTensors path)
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \ --audio audio.wav --model models/voxtral # Or use the Q4 quantized path (~2.5 GB)
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
  --audio audio.wav --gguf models/voxtral-q4.gguf --tokenizer models/voxtral/tekken.json
# Build WASM package
wasm-pack build --target web --no-default-features --features wasm # Generate self-signed cert (WebGPU requires secure context)
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 \ -keyout /tmp/voxtral-key.pem -out /tmp/voxtral-cert.pem \ -days 7 -nodes -subj "/CN=localhost" # Start dev server
bun serve.mjs

Open https://localhost:8443, accept the certificate, and click Load from Server to download the model shards. Record from your microphone or upload a WAV file to transcribe.

Hosted demo on HuggingFace Spaces if you want to skip local setup.

Audio (16kHz mono)
  -> Mel spectrogram [B, 128, T]
    -> Causal encoder (32 layers, 1280 dim, sliding window 750)
      -> Conv 4x downsample -> Reshape [B, T/16, 5120]
        -> Adapter [B, T/16, 3072]
          -> Autoregressive decoder (26 layers, 3072 dim, GQA 32Q/8KV)
            -> Token IDs -> Text
F32 (native) Q4 GGUF (native + browser)
Weights SafeTensors (~9 GB) GGUF Q4_0 (~2.5 GB)
Linear ops Burn tensor matmul Custom WGSL shader (fused dequant + matmul)
Embeddings f32 tensor (1.5 GiB) Q4 on GPU (216 MB) + CPU bytes for lookups
Browser No Yes (WASM + WebGPU)

The upstream mistral-common library left-pads audio with 32 silence tokens (at 12.5 Hz). After the mel/conv/reshape pipeline, this covers only 16 of the 38 decoder prefix positions with silence — the remaining 22 contain actual audio. The f32 model handles this fine, but Q4_0 quantization makes the decoder sensitive to speech content in the prefix: audio that starts immediately with speech (mic recordings, clips with no leading silence) produces all-pad tokens instead of text.

The left padding is increased to 76 tokens, which maps to exactly 38 decoder tokens of silence and covers the full streaming prefix. See src/audio/pad.rs for details.

Running a 4B model in a browser tab required solving five hard constraints:

  1. 2 GB allocation limitShardedCursor reads across multiple Vec<u8> buffers
  2. 4 GB address space — Two-phase loading: parse weights, drop reader, then finalize
  3. 1.5 GiB embedding table — Q4 embeddings on GPU + CPU-side row lookups
  4. No sync GPU readback — All tensor reads use into_data_async().await
  5. 256 workgroup invocation limit — Patched cubecl-wgpu to cap reduce kernel workgroups
# Native (default features: wgpu + native-tokenizer)
cargo build --release # With all features
cargo build --release --features "wgpu,cli,hub" # WASM
wasm-pack build --target web --no-default-features --features wasm
Feature Description
wgpu (default) GPU backend via Burn/CubeCL (WebGPU, Vulkan, Metal)
native-tokenizer (default) Tekken tokenizer (C deps, not WASM-compatible)
wasm Browser support: wasm-bindgen, WebGPU device init, JS bindings
cli CLI binary with clap + indicatif
hub HuggingFace Hub model downloads
# Unit + integration tests (requires GPU for full suite)
cargo test --features "wgpu,cli,hub" # Lint
cargo clippy --features "wgpu,cli,hub" -- -D warnings
cargo clippy --no-default-features --features wasm --target wasm32-unknown-unknown -- -D warnings # E2E browser test (requires Playwright + model shards)
bunx playwright test tests/e2e_browser.spec.ts

GPU-dependent tests (model layer shapes, Q4 matmul, WGSL shader correctness) are skipped in CI since GitHub Actions runners lack a GPU adapter. These tests run locally on any machine with Vulkan, Metal, or WebGPU support.

The GGUF file must be split into shards of 512 MB or less to stay under the browser's ArrayBuffer limit:

split -b 512m models/voxtral-q4.gguf models/voxtral-q4-shards/shard-

The dev server and E2E test discover shards automatically from models/voxtral-q4-shards/.

Coming soon: accuracy (WER) and inference speed benchmarks across native and browser targets.

src/
  audio/          # Mel spectrogram, chunking, resampling, padding
  models/         # F32 model: encoder, decoder, adapter, attention, RoPE, KV cache
  gguf/           # Q4 GGUF: reader, loader, model, tensor, WGSL shader, tests
  web/            # WASM bindings: VoxtralQ4, initWgpuDevice, async decode loop
  tokenizer/      # Tekken tokenizer wrapper (native only)
  bin/transcribe  # CLI binary

web/              # Browser demo: index.html, worker.js, voxtral-client.js
tests/            # Integration tests + Playwright E2E spec
scripts/          # Dev scripts: reference implementations, weight inspection, E2E helpers
patches/          # cubecl-wgpu workgroup size fix for WebGPU

Apache-2.0


Read the original article

Comments

  • By HorizonXP 2026-02-104:562 reply

    If folks are interested, @antirez has opened a C implementation of Voxtral Mini 4B here: https://github.com/antirez/voxtral.c

    I have my own fork here: https://github.com/HorizonXP/voxtral.c where I’m working on a CUDA implementation, plus some other niceties. It’s working quite well so far, but I haven’t got it to match Mistral AI’s API endpoint speed just yet.

    • By Ygg2 2026-02-106:161 reply

      There is also another Mistral implementation: https://github.com/EricLBuehler/mistral.rs Not sure what the difference is, but it seems to be just be overall better received.

      • By NitpickLawyer 2026-02-106:39

        mistral.rs is more like llama.cpp, it's a full inference library written in rust that supports a ton of models and many hardware architectures, not just mistral models.

    • By kingreflex 2026-02-1012:113 reply

      hey,

      how does someone get started with doing things like these (writing inference code/ cuda etc..). any guidance is appreciated. i understand one doesn't just directly write these things and this would require some kind of reading. would be great to receive some pointers.

      • By HorizonXP 2026-02-110:511 reply

        You know, I love this comment because you are where I was 15 years ago when I naively decided that I wanted to do my master's in medical biophysics and try to use NVIDIA CUDA to help accelerate some of the work that we were doing. So I have a very... storied history with NVIDIA CUDA, but frankly, it's been years since I've actually written C code at all, let alone CUDA.

        I have to admit that I wrote none of the code in this repo. I asked Codex to go and do it for me. I did a lot of prompting and guidance through some of the benchmarking and tools that I expected it to use to get the result that I was looking for.

        Most of the plans that it generated were outside of my wheelhouse and not something I'm particularly familiar with, but I know it well enough to understand that its plan roughly made sense to me and I just let it go. So the fact that this worked at all is a miracle, but I cannot take credit for it other than telling the AI: what I wanted, how to do it, in loose terms, and helping it when it got stuck.

        BTW, everything above was dictated with the code we generated, except for this sentence. And I added breaklines for paragraphs. That's it.

      • By briandw 2026-02-1015:46

        These are good lectures and there is also a discord. https://github.com/gpu-mode/lectures

      • By Kilenaitor 2026-02-1012:35

        Same! Would love any resources. I'm interested more in making models run vs making the models themselves :)

  • By simonw 2026-02-106:152 reply

    I tried the demo and it looks like you have to click Mic, then record your audio, then click "Stop and transcribe" in order to see the result.

    Is it possible to rig this up so it really is realtime, displaying the transcription within a second or two of the user saying something out loud?

    The Hugging Face server-side demo at https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim... manages that, but it's using a much larger (~8.5GB) server-side model running on GPUs.

    • By refulgentis 2026-02-106:57

      It's not fast enough to be realtime, though you could do a more advanced UI and a ring buffer and have it as you describe. (ex. I do this with Whisper in Flutter, and also inference GGUFs in llama.cpp via Dart)

      This isn't even close to realtime on M4 Max. Whisper's ~realtime on any device post-2022 with an ONNX implementation. The extra inference cost isn't worth the WER decrease on consumer hardware, or at least, wouldn't be worth the time implementing.

    • By adefa 2026-02-121:05

      Hello, I pushed up and merged a PR that greatly improves performance on CUDA, Metal, and in WASM.

      Depending on your hardware, the model is definitely real time (able to transcribe audio faster than the length of the audio).

  • By mentalgear 2026-02-106:271 reply

    Kudos, this is were it's add: open-models running on-premise. Preferred by users and businesses. Glad Mistral's got that figured out.

    • By another_twist 2026-02-1014:17

      Mistral can really end up having its RedHat moment. I think open models will only get more interesting from here.

HackerNews