...

T0mSIlver

9

Karma

2026-02-24

Created

Recent Activity

  • Your voxtral.c work was a big motivator for me. I built a macOS menu bar dictation app (https://github.com/T0mSIlver/localvoxtral) around Voxtral Realtime, currently using a voxmlx fork with an OpenAI Realtime WebSocket server I added on top.

    The thing that sold me on Voxtral Realtime over Whisper-based models for dictation is the causal encoder. Text streaming in as you speak rather than appearing after you stop is a fundamentally different UX. On M1 Pro with a 4-bit quant through voxmlx it feels responsive enough for natural dictation, though I haven't done proper latency benchmarks yet.

    Integrating voxtral.c as a backend is on my roadmap, compiling to a single native binary makes it much easier to bundle into a macOS app than a Python-based backend.

  • Congrats on the results. The streaming aspect is what I find most exciting here.

    I built a macOS dictation app (https://github.com/T0mSIlver/localvoxtral) on top of Voxtral Realtime, and the UX difference between streaming and offline STT is night and day. Words appearing while you're still talking completely changes the feedback loop. You catch errors in real time, you can adjust what you're saying mid-sentence, and the whole thing feels more natural. Going back to "record then wait" feels broken after that.

    Curious how Moonshine's streaming latency compares in practice. Do you have numbers on time-to-first-token for the streaming mode? And on the serving side, do any of the integration options expose an OpenAI Realtime-compatible WebSocket endpoint?

  • Some technical context and where this is headed.

    Why streaming matters for dictation. Whisper and most open-source STT models use bidirectional attention, meaning they need the full audio clip before they can transcribe anything. You get your text after you stop talking, usually with a noticeable delay. Voxtral Realtime takes a different approach: it has a causal audio encoder that processes audio left-to-right as it arrives. At 480ms delay it matches offline models on accuracy (FLEURS benchmark), but you see text appearing while you're still mid-sentence. For dictation this changes a lot. You can catch mistakes in real time, and the feedback loop feels natural instead of disconnected.

    The app connects to backends via the OpenAI Realtime API WebSocket protocol. It captures audio from your mic, streams it over the WebSocket, and receives partial transcripts that get inserted into your active text field live. Any OpenAI Realtime-compatible server works.

    The voxmlx fork. The original voxmlx by Awni Hannun does local Voxtral inference on Apple Silicon via MLX, but it was CLI-only. I added a WebSocket server that speaks the OpenAI Realtime protocol so localvoxtral (or any compatible client) can connect to it. I also added memory management to avoid OOM on longer sessions. Fork is here: https://github.com/T0mSIlver/voxmlx. I'd like to get the server piece upstreamed eventually.

    Latency. On M1 Pro with a 4-bit quantized model, first words appear within roughly 200 to 400ms. On RTX 3090 via vLLM it's faster. Both feel responsive enough for natural dictation. What's next. Right now you have to start the server yourself before using the app. I want to add app-managed local serving (start/stop/model download) so it's truly one-click. If anyone has experience bundling Python/MLX processes into macOS apps cleanly, I'd love to hear your approach.

    Happy to answer questions.

HackerNews