Soundvibes logo

Soundvibes

Open source voice-to-text under your control

Documentation

Everything you need to know.

Learn how to configure Soundvibes, set up your ideal workflow, and troubleshoot common issues.

Quick Start

Get up and running with just two commands. Soundvibes uses a daemon/client architecture where the daemon handles transcription and the client sends toggle commands.

Step 1

Start the daemon

sv daemon start

Launches the background service and downloads models on first run.

Step 2

Toggle recording

sv

Run this to start/stop capture. Bind it to a hotkey for easy access.

Frequently Asked Questions

Does it run fully offline?

Yes. After the first model download, everything stays local.

How do I start and stop capture?

Start the daemon with sv daemon start, then bind sv to a hotkey and toggle on/off.

How do I install it?

Run the install script: curl -fsSL https://raw.githubusercontent.com/kejne/soundvibes/main/install.sh | sh. It handles everything automatically.

Installation

Automatic Install

The install script auto-detects your distribution, installs dependencies, downloads the binary, and sets up configuration.

curl -fsSL https://raw.githubusercontent.com/kejne/soundvibes/main/install.sh | sh

System Requirements

  • Linux x86_64
  • Working microphone
  • Vulkan libraries (optional)
  • wtype (Wayland) or xdotool (X11) for injection

Runtime Dependencies (Pre-built Binary)

If you install via the install script or download the binary from GitHub Releases, you only need these runtime libraries:

Arch Linux

sudo pacman -Syu alsa-lib vulkan-icd-loader

Ubuntu/Debian

sudo apt-get install -y libasound2 libvulkan1 mesa-vulkan-drivers

Fedora

sudo dnf install -y alsa-lib vulkan-loader mesa-vulkan-drivers

GPU Drivers (Optional)

For GPU acceleration, also install GPU drivers. The above includes Mesa for AMD/Intel. For NVIDIA, install proprietary drivers with Vulkan support. See GPU Acceleration section for details.

Building from source? See CONTRIBUTING.md for development dependencies.

Setting Up Your Workflow

The power of Soundvibes comes from binding the sv command to a hotkey. Here are setup instructions for popular desktop environments.

GNOME / KDE / XFCE

Go to Settings → Keyboard → Custom Shortcuts. Add a new shortcut with:

  • • Command: sv
  • • Binding: Your preferred key (e.g., Ctrl+Alt+Space, F12)

i3 / Sway (Window Managers)

Add to your config file:

bindsym $mod+Shift+v exec sv

Hyprland

Add to hyprland.conf:

bind = $mainMod, V, exec, sv

Auto-start the Daemon

To start the daemon automatically on login, add to your desktop environment's startup applications:

sv daemon start

Or use systemd: systemctl --user enable --now soundvibes (if you ran the install script with service setup)

Configuration File

Soundvibes uses a TOML configuration file located at ~/.config/soundvibes/config.toml. CLI flags override config file values, which override defaults.

Complete Example

# Model behavior
download_model = true          # Allow auto-download on first run
model_size = "small"           # tiny, base, small, medium, large, auto

# Transcription settings
language = "en"                # Default active language context
model_variants = "en"           # en, multilingual, both
device = "default"             # Audio device name
audio_host = "alsa"            # default, alsa
sample_rate = 16000            # Hz (16000 recommended)

# Output settings
format = "plain"               # plain, jsonl
mode = "inject"                # stdout, inject

# VAD (Voice Activity Detection) settings
vad = "on"                     # on, off (or true/false)
vad_silence_ms = 1200          # Silence timeout in milliseconds
vad_threshold = 0.01           # Energy threshold (0.001 - 0.1)
vad_chunk_ms = 100             # Chunk size in milliseconds

# Debug settings
debug_audio = false
debug_vad = false
dump_audio = false             # Save captured audio to WAV
list_devices = false

Model Variants

Soundvibes keeps model contexts per variant (English-only and multilingual), not per language code. Use model_variants to preload en, multilingual, or both.

If omitted, Soundvibes derives this from language (English -> en, otherwise multilingual). Use both to keep both variants warm for fast switching. You can switch active language at runtime with sv daemon set-language --lang <CODE> or sv --toggle-language <CODE>.

Priority 1

CLI Flags

Highest priority, overrides everything

Priority 2

Config File

~/.config/soundvibes/config.toml

Priority 3

Defaults

Built-in fallback values

CLI Reference

Global Options

Option Default Description
--language en Transcription language code
--toggle-language - Override language for a single toggle call
--download-model true Allow downloading missing models automatically
--model-size small Model size for all variants: tiny, base, small, medium, large, auto
--model-variants derived Preload variants: en, multilingual, both
--device - Audio input device name
--audio-host alsa default, alsa
--sample-rate 16000 Sample rate in Hz
--format plain Output format: plain, jsonl
--mode inject Output mode: stdout, inject
--vad on Voice Activity Detection: on, off
--vad-silence-ms 1200 VAD silence timeout (ms)
--vad-threshold 0.010 VAD energy threshold
--vad-chunk-ms 100 VAD chunk size (ms)
--list-devices false List available input devices
--debug-audio, --debug-vad, --dump-audio false Debug logging options

Subcommands

sv daemon start

Start the background daemon process

sv daemon stop

Stop the daemon gracefully

sv daemon status

Show daemon state and active language

sv daemon set-language --lang <CODE>

Switch active language without toggling recording

sv (no arguments)

Send toggle command to daemon (start/stop recording)

Model Settings

Soundvibes uses Whisper models from HuggingFace. Models download automatically on first run, or you can download them manually.

Current Model Policy

Soundvibes uses one configured model size for all daemon language contexts. Set model_size to choose tiny/base/small/medium/large, then language-specific model variants are selected automatically and loaded from the local cache.

Model Language Variants

  • multilingual: Uses `ggml-<size>.bin` for all non-English languages (and English fallback).
  • en: Uses `ggml-<size>.en.bin` for English-only transcription.
  • both: Preloads both variants so language switching has no first-use delay.

Model Storage

Models are stored in: ~/.local/share/soundvibes/models/

You can override the download URL with the SV_MODEL_BASE_URL environment variable for custom mirrors or offline setups.

Per-Language Model Contexts

The daemon keeps model contexts by variant, while active language remains a runtime setting. Use model_variants = "both" to preload both contexts, then switch instantly with sv daemon set-language --lang <CODE>.

Model variant selection is automatic per language key. English uses the English-optimized model variant, while other languages use multilingual variants.

Audio & VAD Configuration

Audio Devices

List available devices to find the correct name for your microphone:

sv --list-devices

Then set it in your config: device = "Your Device Name"

Voice Activity Detection (VAD)

VAD automatically trims silence from the end of recordings. Configure these settings to match your environment:

Setting Default Description
vad_silence_ms 1200 How long to wait after speech stops before ending (ms)
vad_threshold 0.010 Energy threshold for detecting speech (0.001 - 0.1)
vad_chunk_ms 100 Audio chunk size for VAD analysis (ms)

Quiet Environment

Lower threshold for better sensitivity:

vad_threshold = 0.005

Noisy Environment

Higher threshold to reduce false triggers:

vad_threshold = 0.02

Output Modes & Formats

Mode: inject

Type text at cursor (default)

Transcribed text is automatically typed at your cursor position. Uses wtype on Wayland or xdotool on X11.

Mode: stdout

Print to terminal

Output goes to standard output. Useful for piping to other commands or scripts.

Format: plain

Simple text output

Just the transcribed text, no extra formatting or metadata.

Format: jsonl

Structured JSON lines

Each utterance as JSON with type, text, timestamp, utterance, duration_ms fields.

Text Injection Requirements

For mode = "inject" to work, you need the appropriate tool for your display server:

  • Wayland: Install wtype (virtual keyboard)
  • X11: Install xdotool (XTest extension)

Daemon Management

The daemon runs in the background, listening on a Unix socket for control commands. It handles audio capture, VAD processing, transcription, and active language state.

Socket Location

${XDG_RUNTIME_DIR}/soundvibes/sv.sock

Usually resolves to /run/user/1000/soundvibes/sv.sock

Lifecycle Commands

sv daemon start Launch daemon
sv daemon stop Graceful shutdown
sv daemon status Current state + language
sv daemon set-language --lang fr Switch active language

Language-aware Hotkeys

You can keep your default toggle command and add dedicated per-language hotkeys.

sv sv --toggle-language fr sv --toggle-language sv

Auto-start on Login

The install script can set up a systemd user service. Or add to your desktop environment's startup:

sv daemon start

GPU Acceleration

Soundvibes uses Vulkan for GPU acceleration, providing significant speedups for transcription. It automatically falls back to CPU if GPU is unavailable.

GPU Drivers

AMD: vulkan-radeon (Arch) or mesa-vulkan-drivers (Ubuntu/Fedora)

NVIDIA: nvidia-utils (Arch) or nvidia-driver with Vulkan support

Intel: vulkan-intel (limited support)

Verify Vulkan

vulkaninfo --summary

Should show your GPU in the device list

Performance Comparison

Small model on GPU ~0.5-1x real-time
Small model on CPU (8 threads) ~1-2x real-time
Medium model on GPU ~1-2x real-time
Medium model on CPU ~3-5x real-time

Lower is faster. GPU acceleration provides 3-5x speedup for larger models.

Environment Variables

Variable Purpose
SV_MODEL_PATH Override default model path
SV_MODEL_BASE_URL Custom model download mirror (e.g., for offline/airgapped setups)
XDG_CONFIG_HOME Config directory (default: ~/.config)
XDG_DATA_HOME Data directory for models (default: ~/.local/share)
XDG_RUNTIME_DIR Runtime directory for socket (default: /run/user/UID)
SV_HARDWARE_TESTS, SV_GPU_TESTS, etc. Test environment flags (see testing docs)

Troubleshooting

Daemon Issues

"Connection refused" or "No such file"

The daemon isn't running. Start it with: sv daemon start

Daemon won't start

Check if another instance is running: ps aux | grep sv. Kill stale processes if needed.

Permission denied on socket

Check XDG_RUNTIME_DIR is set and writable. Usually /run/user/1000.

Model Issues

Model download fails

Check internet connection or set SV_MODEL_BASE_URL to a mirror. You can also download manually to ~/.local/share/soundvibes/models/

"Model file not found"

Enable auto-download with download_model = true, then restart the daemon so missing language models can be fetched

Exit Codes

0 Success
1 Runtime error (general)
2 Config or model error
3 Audio device error

Debugging

Debug Flags

Enable detailed logging to diagnose issues:

--debug-audio

Log audio capture details, device selection, and sample rates

--debug-vad

Log VAD decisions, energy levels, and trimming behavior

--dump-audio

Save captured audio to WAV file for inspection

Example Debug Session

# Start daemon with debug logging
sv daemon start --debug-audio --debug-vad

# In another terminal, toggle capture
sv

# Check logs in ~/.local/share/soundvibes/logs/ or console output

Audio Problems

Device Not Found

List all available devices and verify your microphone is detected:

sv --list-devices

If your device isn't listed, check ALSA/PulseAudio configuration and ensure the device isn't muted.

No Audio Captured

  • Check microphone isn't muted in your system mixer (alsamixer, pavucontrol)
  • Verify the correct device is selected with --device or config
  • Use --dump-audio to verify audio is being captured
  • Check VAD threshold isn't too high for quiet speech

Poor Transcription Quality

  • Check microphone quality and reduce background noise for cleaner input
  • Ensure language matches your spoken language
  • Reduce background noise or use a better microphone
  • Speak clearly and at a consistent volume

Text Injection Not Working

  • Verify wtype (Wayland) or xdotool (X11) is installed
  • On Wayland, ensure your compositor supports virtual keyboard protocols
  • Some applications (especially browsers) may block synthetic input for security
  • Use --mode stdout as a workaround

Still have questions? Check the technical documentation on GitHub.