How Parleq works — local-first AI dictation

1

Capture
on-device

AVAudioEngine taps the system microphone at the hardware sample rate and converts each buffer in-line to 16 kHz mono Int16 — the format the local ASR model expects. Audio is accumulated as raw samples in process memory; nothing is ever written to disk.

Bluetooth-aware

When the system default input is Bluetooth, Parleq overrides to the built-in mic so your music stays in A2DP instead of dropping to HFP/SCO mid-dictation.

Pre-warm

The audio unit is instantiated at app launch and a 250 ms silent capture cycle runs once after the speech model finishes loading. This pays the cold-start cost so your first real dictation isn't a 90 ms first-buffer stub.

Live level meter

Per-buffer RMS is computed on the audio thread and pushed to the overlay so the sound-wave bars animate with your actual voice.
2

Transcribe
on-device · Apple Neural Engine

WAV bytes are POSTed to a sidecar process that runs Parakeet TDT v3 (CoreML) on the Apple Neural Engine. Typical latency is ~64 ms for a 5-second clip after the model is warm. The model resides at ~150 MB and downloads from Hugging Face on first launch.

Custom dictionary biasing

Terms you've added in Settings travel as a base64-JSON header to the sidecar, which runs an extra CTC keyword-spotting + rescoring pass after the TDT transcription. Aliases for variant spellings ("parlay", "parlez") all match the canonical "Parleq" — the rescorer always emits the canonical form.

Per-term opt-out

A term whose phonetics overlap a common word (and so triggers false positives at the speech-recognition layer) can be marked LLM-only. The transcribe stage skips it; the LLM hint still applies.

Local-only

No cloud transcription. No per-call cost. No cold-start delay once the model has loaded. Audio bytes never leave your Mac.
3

Clean up
cloud LLM (or skipped)

The raw transcript streams to a configurable AI provider that lightly cleans the text — capitalization, punctuation, filler-word removal, common transcription errors, and spoken numbers turned into digits when context is technical. Cleaned words stream into the overlay as they arrive.

Pluggable providers across four paths

Five options: Google Gemini direct API (default — free tier, ~500–700 ms TTFT), Google Vertex AI (same Gemini models on GCP for IAM + audit logs + data residency), AWS Bedrock (Anthropic Claude or OpenAI GPT-OSS), Azure OpenAI (GPT-4o + gpt-5 family on Microsoft's contract), or skip cleanup entirely and paste raw ASR. The choice is made in Settings or via the first-run setup wizard.

Auth flexibility per provider

Each cloud supports both pasted API keys (stored in the macOS Keychain) and your existing CLI session — gcloud Application Default Credentials for Vertex (or service-account JSON), AWS SSO / static IAM keys / scoped Bedrock API keys for Bedrock, az login or resource keys for Azure. Parleq never stores long-lived cloud session tokens directly; the AWS/GCP/Azure CLIs handle refresh through their own caches.

Skippable

Pick "None — paste raw ASR (skip cleanup)" from the Settings provider list. Parleq will paste the raw transcript exactly as the on-device speech model emitted it. Useful when transcript content must never leave the device.

Custom dictionary hint

Your dictionary feeds a smart-vocabulary addendum to the cleanup prompt — terms with optional context blurbs and aliases. The LLM judges topic alignment and prefers your canonical spellings without force-correcting genuine homophones.

Refinement

When the overlay is already open, the next hotkey press re-runs this stage with a different system prompt that takes your speech as an edit instruction over the existing text. Walked through in detail in the section below.
4

Paste
on-device

On accept (auto-timer or manual hotkey tap), the cleaned text pastes into whatever app was focused when you pressed the hotkey originally — not whatever happens to be focused at accept time. CGEventTap synthesizes the keystrokes; the trailing-space heuristic adds a space after the pasted text by default.

Focused-app capture

The original target is captured at hotkey-down. If your focus drifted while dictating, the paste still lands in the right window.

Trailing-space override

Specific apps (your terminal, terminal-based editors, anything that handles its own spacing) skip the trailing space. Configurable per-app in Settings.

Recent Dictations

The last 20 cleaned dictations are kept in process memory and surfaced under the menu bar. Click any entry to copy it back to the clipboard if a paste landed somewhere unexpected. Never written to disk; wiped on app quit.

The loop

Refine until it's right.

Stages 3 and 4 don't have to run once. While the overlay is open, each subsequent hotkey press re-runs stage 3 with a different system prompt — a refine prompt that takes the existing text plus your new utterance as an edit instruction and produces the smallest change that fully accomplishes it. Stage 4 only fires when you tap to accept.

It's a voice undo/redo loop. The visible text is the working state; each press replaces it with the LLM's edit. Tone, format, length, structure — anything you can describe in a sentence is a valid edit.

After stage 3 (initial cleanup)

"Yeah, we should probably move the meeting to next Thursday because too many people are out this week."

You press ⌥ and say

"make it more professional"

Refine pass — same LLM, different prompt

"I'd like to move our meeting to next Thursday — too many team members are out this week."

You press ⌥ and say

"shorter, end with a question mark"

"Could we move our meeting to next Thursday?"

You press ⌥ and say

"add 'attendance has been thin lately' as the reason"

"Could we move our meeting to next Thursday? Attendance has been thin lately."

↳ tap ⌥ — stage 4 runs, text pastes

Implementation note: the refine prompt lives in SystemPrompts.swift alongside the cleanup prompt. Both run through the same streaming LLM provider; the only differences are the system prompt text and the user-message shape (refine includes the prior text + edit instruction in a single user message).

Privacy posture

What stays on your Mac.

Persisted

~/.parleq/config.json — settings + custom dictionary, user-authored.
~/.parleq/usage.jsonl — one line per LLM call: timestamp, model, token counts, latency. Metadata only.
~/.parleq/app.log — diagnostic log. ASR/LLM metrics are length-only ("post-utterance 87 ms, 142 chars / 28 words"); never the transcript itself.
Provider secrets in the macOS Keychain — Gemini API key, Bedrock API key, AWS static credentials, Vertex service-account JSON, Azure resource API key. Whichever you've configured.

Never on disk

Audio bytes. WAV / PCM stays in process memory only. The HTTP body to the sidecar is in-memory; there's no /tmp/parleq-*.wav.
Transcripts. Raw ASR output stays in memory.
Cleaned text. Held in the overlay during cleanup, then in a 20-entry in-memory ring for the menu's Recent Dictations submenu, then gone on app quit. Never serialized.
Cloud session tokens (AWS / GCP / Azure). CLI-session auth modes delegate token refresh to the official AWS / gcloud / az CLI caches at ~/.aws/sso/cache/, ~/.config/gcloud/, and ~/.azure/. Parleq stores no long-lived session tokens directly.

Read the full enterprise-review packet in SECURITY_REVIEW.md.

A press, a sentence, a paste.

Capture

Transcribe

Clean up

Paste

Refine until it's right.

What stays on your Mac.

Persisted

Never on disk