Handy – Free open source speech-to-text app

2026-01-155:23247110github.com

A free, open source, and extensible speech-to-text application that works completely offline. - cjpais/Handy

Discord

A free, open source, and extensible speech-to-text application that works completely offline.

Handy is a cross-platform desktop application built with Tauri (Rust + React/TypeScript) that provides simple, privacy-focused speech transcription. Press a shortcut, speak, and have your words appear in any text field—all without sending your voice to the cloud.

Handy was created to fill the gap for a truly open source, extensible speech-to-text tool. As stated on handy.computer:

  • Free: Accessibility tooling belongs in everyone's hands, not behind a paywall
  • Open Source: Together we can build further. Extend Handy for yourself and contribute to something bigger
  • Private: Your voice stays on your computer. Get transcriptions without sending audio to the cloud
  • Simple: One tool, one job. Transcribe what you say and put it into a text box

Handy isn't trying to be the best speech-to-text app—it's trying to be the most forkable one.

  1. Press a configurable keyboard shortcut to start/stop recording (or use push-to-talk mode)
  2. Speak your words while the shortcut is active
  3. Release and Handy processes your speech using Whisper
  4. Get your transcribed text pasted directly into whatever app you're using

The process is entirely local:

  • Silence is filtered using VAD (Voice Activity Detection) with Silero
  • Transcription uses your choice of models:
    • Whisper models (Small/Medium/Turbo/Large) with GPU acceleration when available
    • Parakeet V3 - CPU-optimized model with excellent performance and automatic language detection
  • Works on Windows, macOS, and Linux
  1. Download the latest release from the releases page or the website
  2. Install the application following platform-specific instructions
  3. Launch Handy and grant necessary system permissions (microphone, accessibility)
  4. Configure your preferred keyboard shortcuts in Settings
  5. Start transcribing!

For detailed build instructions including platform-specific requirements, see BUILD.md.

Handy is built as a Tauri application combining:

  • Frontend: React + TypeScript with Tailwind CSS for the settings UI
  • Backend: Rust for system integration, audio processing, and ML inference
  • Core Libraries:
    • whisper-rs: Local speech recognition with Whisper models
    • transcription-rs: CPU-optimized speech recognition with Parakeet models
    • cpal: Cross-platform audio I/O
    • vad-rs: Voice Activity Detection
    • rdev: Global keyboard shortcuts and system events
    • rubato: Audio resampling

Handy includes an advanced debug mode for development and troubleshooting. Access it by pressing:

  • macOS: Cmd+Shift+D
  • Windows/Linux: Ctrl+Shift+D

This project is actively being developed and has some known issues. We believe in transparency about the current state:

Whisper Model Crashes:

  • Whisper models crash on certain system configurations (Windows and Linux)
  • Does not affect all systems - issue is configuration-dependent
    • If you experience crashes and are a developer, please help to fix and provide debug logs!

Wayland Support (Linux):

  • Limited support for Wayland display server
  • Requires wtype or dotool for text input to work correctly (see Linux Notes below for installation)

Text Input Tools:

For reliable text input on Linux, install the appropriate tool for your display server:

Display Server Recommended Tool Install Command
X11 xdotool sudo apt install xdotool
Wayland wtype sudo apt install wtype
Both dotool sudo apt install dotool (requires input group)
  • X11: Install xdotool for both direct typing and clipboard paste shortcuts
  • Wayland: Install wtype (preferred) or dotool for text input to work correctly
  • dotool setup: Requires adding your user to the input group: sudo usermod -aG input $USER (then log out and back in)

Without these tools, Handy falls back to enigo which may have limited compatibility, especially on Wayland.

Other Notes:

  • The recording overlay is disabled by default on Linux (Overlay Position: None) because certain compositors treat it as the active window. When the overlay is visible it can steal focus, which prevents Handy from pasting back into the application that triggered transcription. If you enable the overlay anyway, be aware that clipboard-based pasting might fail or end up in the wrong window.

  • If you are having trouble with the app, running with the environment variable WEBKIT_DISABLE_DMABUF_RENDERER=1 may help

  • You can manage global shortcuts outside of Handy and still control the app via signals. Sending SIGUSR2 to the Handy process toggles recording on/off, which lets Wayland window managers or other hotkey daemons keep ownership of keybindings. Example (Sway):

    bindsym $mod+o exec pkill -USR2 -n handy

    pkill here simply delivers the signal—it does not terminate the process.

  • macOS (both Intel and Apple Silicon)
  • x64 Windows
  • x64 Linux

The following are recommendations for running Handy on your own machine. If you don't meet the system requirements, the performance of the application may be degraded. We are working on improving the performance across all kinds of computers and hardware.

For Whisper Models:

  • macOS: M series Mac, Intel Mac
  • Windows: Intel, AMD, or NVIDIA GPU
  • Linux: Intel, AMD, or NVIDIA GPU

For Parakeet V3 Model:

  • CPU-only operation - runs on a wide variety of hardware
  • Minimum: Intel Skylake (6th gen) or equivalent AMD processors
  • Performance: ~5x real-time speed on mid-range hardware (tested on i5)
  • Automatic language detection - no manual language selection required

We're actively working on several features and improvements. Contributions and feedback are welcome!

Debug Logging:

  • Adding debug logging to a file to help diagnose issues

macOS Keyboard Improvements:

  • Support for Globe key as transcription trigger
  • A rewrite of global shortcut handling for MacOS, and potentially other OS's too.

Opt-in Analytics:

  • Collect anonymous usage data to help improve Handy
  • Privacy-first approach with clear opt-in

Settings Refactoring:

  • Cleanup and refactor settings system which is becoming bloated and messy
  • Implement better abstractions for settings management

Tauri Commands Cleanup:

  • Abstract and organize Tauri command patterns
  • Investigate tauri-specta for improved type safety and organization

If you're behind a proxy, firewall, or in a restricted network environment where Handy cannot download models automatically, you can manually download and install them. The URLs are publicly accessible from any browser.

  1. Open Handy settings
  2. Navigate to the About section
  3. Copy the "App Data Directory" path shown there, or use the shortcuts:
    • macOS: Cmd+Shift+D to open debug menu
    • Windows/Linux: Ctrl+Shift+D to open debug menu

The typical paths are:

  • macOS: ~/Library/Application Support/com.pais.handy/
  • Windows: C:\Users\{username}\AppData\Roaming\com.pais.handy\
  • Linux: ~/.config/com.pais.handy/

Inside your app data directory, create a models folder if it doesn't already exist:

# macOS/Linux
mkdir -p ~/Library/Application\ Support/com.pais.handy/models # Windows (PowerShell)
New-Item -ItemType Directory -Force -Path "$env:APPDATA\com.pais.handy\models"

Download the models you want from below

Whisper Models (single .bin files):

  • Small (487 MB): https://blob.handy.computer/ggml-small.bin
  • Medium (492 MB): https://blob.handy.computer/whisper-medium-q4_1.bin
  • Turbo (1600 MB): https://blob.handy.computer/ggml-large-v3-turbo.bin
  • Large (1100 MB): https://blob.handy.computer/ggml-large-v3-q5_0.bin

Parakeet Models (compressed archives):

  • V2 (473 MB): https://blob.handy.computer/parakeet-v2-int8.tar.gz
  • V3 (478 MB): https://blob.handy.computer/parakeet-v3-int8.tar.gz

For Whisper Models (.bin files):

Simply place the .bin file directly into the models directory:

{app_data_dir}/models/
├── ggml-small.bin
├── whisper-medium-q4_1.bin
├── ggml-large-v3-turbo.bin
└── ggml-large-v3-q5_0.bin

For Parakeet Models (.tar.gz archives):

  1. Extract the .tar.gz file
  2. Place the extracted directory into the models folder
  3. The directory must be named exactly as follows:
    • Parakeet V2: parakeet-tdt-0.6b-v2-int8
    • Parakeet V3: parakeet-tdt-0.6b-v3-int8

Final structure should look like:

{app_data_dir}/models/
├── parakeet-tdt-0.6b-v2-int8/     (directory with model files inside)
│   ├── (model files)
│   └── (config files)
└── parakeet-tdt-0.6b-v3-int8/     (directory with model files inside)
    ├── (model files)
    └── (config files)

Important Notes:

  • For Parakeet models, the extracted directory name must match exactly as shown above
  • Do not rename the .bin files for Whisper models—use the exact filenames from the download URLs
  • After placing the files, restart Handy to detect the new models
  1. Restart Handy
  2. Open Settings → Models
  3. Your manually installed models should now appear as "Downloaded"
  4. Select the model you want to use and test transcription

The goal is to create both a useful tool and a foundation for others to build upon—a well-patterned, simple codebase that serves the community.

We're grateful for the support of our sponsors who help make Handy possible:

Epicenter

MIT License - see LICENSE file for details.

  • Whisper by OpenAI for the speech recognition model
  • whisper.cpp and ggml for amazing cross-platform whisper inference/acceleration
  • Silero for great lightweight VAD
  • Tauri team for the excellent Rust-based app framework
  • Community contributors helping make Handy better

"Your search for the right speech-to-text tool can end here—not because Handy is perfect, but because you can make it perfect for you."


Read the original article

Comments

  • By d4rkp4ttern 2026-01-1511:491 reply

    I’ve tried several, including this one, and I’ve settled on VoiceInk (local, one-time payment), and with Parakeet V3 it’s stunningly fast (near-instant) and accurate enough to talk to LLMs/code-agents, in the sense that the slight drop in accuracy relative to Whisper Turbo3 is immaterial since they can “read between the lines” anyway.

    My regular cycle is to talk informally to the CLI agent and ask it to “say back to me what you understood”, and it almost always produces a nice clean and clear version. This simultaneously works as confirmation of its understanding and also as a sort of spec which likely helps keep the agent on track.

    UPDATE - just tried handy with Parakeet v3, and it works really well too, so I'll use this instead of VoiceInk for a few days. I just also discovered that turning on the "debug" UI with Cmd-shift-D shows additional options like post processing and appending trailing space.

    • By thethimble 2026-01-1516:203 reply

      I wish one of these models was fine tuned for programming.

      I want to be able to say things like "cd ~/projects" or "git push --force".

      • By netghost 2026-01-1517:10

        I'll bet you could take a relatively tiny model and get it to translate the transcribed "git force push" or "git push dash dash force" into "git push --force".

        Likewise "cd home slash projects" into "cd ~/projects".

        Maybe with some fine tuning, maybe without.

      • By vismit2000 2026-01-165:31

        You can try VSCode Speech to Text extension that works decently well in Github Copilot chat as part of Microsoft accessibility suite.

      • By swah 2026-01-1811:27

        Or just enjoy your last days of cd'ing, this shall pass soon!

  • By blutoot 2026-01-158:285 reply

    I have dystonia which often stiffens my arms in a way that makes it impossible for me to type on a keyboard. TTS apps like SuperWhisper have proven to be very helpful for me in such situations. I am hoping to get a similar experience out of "Handy" (very apt maming from my perspective).

    I do, however, wonder if there is a way all these TTS tools can get to the next level. The generated text should not be just a verbatim copy of what I just said, but depending on the context, it should elaborate. For example, if my cursor is actively inside an editor/IDE with some code, my coding-related verbal prompts should actually generate the right/desired code in that IDE.

    Perhaps this is a bit of combining TTS with computer-use.

    • By mritchie712 2026-01-1511:103 reply

      I made something called `ultraplan`. It's is a CLI tool that records multi-modal context (audio transcription via local Whisper, screenshots, clipboard content, etc.) into a timeline that AI agents like Claude Code can consume.

      I have a claude skill `/record` that runs the CLI which starts a new recording. I debug, research, etc., then say "finito" (or choose your own stopword). It outputs a markdown file with your transcribed speech interleaved with screenshots and text that you copied. You can say other keywords like "marco" and it will take a screenshot hands-free.

      When the session ends, claude reads the timeline (e.g. looks at screenshots) and gets to work.

      I can clean it up and push to github if anyone would get use out of it.

    • By sipjca 2026-01-158:50

      I totally agree with you and largely what you’re describing is one of the reasons I made Handy open source. I really want to see something like this and see someone go experiment with making it happen. I did hear some people playing with using some small local models (moondream, qwen) to get some more context of the computer itself

      I initially had a ton of keyboard shortcuts in handy for myself when I had a broken finger and was in a cast. It let me play with the simplest form of this contextual thing, as shortcuts could effectively be mapped to certain apps with very clear uses cases

    • By eddyg 2026-01-1510:00

      There’s lots of existing work on “coding by voice” long before LLMs were a thing. For example (from 2013): http://xahlee.info/emacs/emacs/using_voice_to_code.html and the associated HN discussion (“Using Voice to Code Faster than Keyboard”): https://news.ycombinator.com/item?id=6203805

      There’s also more recent-ish research, like https://dl.acm.org/doi/fullHtml/10.1145/3571884.3597130

    • By hasperdi 2026-01-158:501 reply

      What you said is possible by feeding the output of speech-to-text tools into an LLM. You can prompt the LLM to make sense of what you're trying to achieve and create sets of actions. With a CLI it’s trivial, you can have your verbal command translated into working shell commands. With a GUI it’s slightly more complicated because the LLM agent needs to know what you see on the screen, etc.

      That CLI bit I mentioned earlier is already possible. For instance, on macOS there’s an app called MacWhisper that can send dictation output to an OpenAI‑compatible endpoint.

      • By sipjca 2026-01-158:511 reply

        Handy can post process with LLMs too! It’s just currently hidden behind a debug menu as an alpha feature (ctrl/cmd+shift+d)

        • By sanex 2026-01-1514:211 reply

          I was just thinking about building something like this, looks like you beat me to the punch, I will have to try it out. I'm curious if you're able to give commands just as well as some wording you want cleaned up. I could see a model being confused between editting the command input into text to be inserted and responding to the command. Sorry if that's unclear, might be better if I just try it.

          • By sipjca 2026-01-160:14

            I’d just try it and fork handy if it doesn’t work how you want :)

    • By ryanshrott 2026-01-2521:51

      [dead]

  • By ryanshrott 2026-01-2521:50

    I've been going down this rabbit hole too. I ended up building DictaFlow (https://dictaflow.vercel.app/) because I needed something that specifically works in VDI/Citrix environments where clipboard pasting is blocked (I work in finance).

    It uses a 'character-typing' method instead of clipboard injection, so it's compatible with pretty much anything remote. Also kept it super lightweight (<50MB RAM) for Windows users who don't want to run a full local server stack.

    Cool to see Handy using the newer models—local voice tech is finally getting good.

HackerNews