How We Run Two Neural Networks on Your MacBook
Resonant ships with two speech-to-text models: NVIDIA Parakeet TDT v3 and Qwen3 ASR. Both run entirely on your Mac, using Apple Neural Engine, with no cloud fallback.
This post explains how we got there: the model selection, the compilation pipeline, the runtime architecture, and the performance characteristics.
Why two models
No single speech model is best at everything. Parakeet v3 leads on English and European languages — it achieves under 4% Word Error Rate on LibriSpeech test-clean, which is competitive with cloud APIs. But its language coverage stops at 25 languages.
Qwen3 ASR covers 30+ languages and handles code-switching — speaking Mandarin mid-sentence in an English dictation, for instance. Its accuracy on any single language is slightly lower than a specialist model, but its breadth is unmatched for a model this size.
We ship both so you get specialist accuracy for common languages and broad coverage for everything else. The default is Parakeet v3. Switch to Qwen3 ASR from the model picker — one click, no download wait (both are pre-installed).
CoreML compilation
Both models are published as PyTorch checkpoints. To run on Apple Neural Engine (ANE), we convert them to CoreML format using Apple's coremltools library. This involves:
- 1.Tracing the model — running a forward pass with representative input to capture the computation graph.
- 2.Quantization — reducing weights from FP32 to FP16 or INT8 where possible. This cuts model size roughly in half without measurable accuracy loss on our benchmarks.
- 3.ANE optimization — restructuring operations to map efficiently to ANE compute units. Certain ops (like grouped convolutions in Parakeet's Conformer encoder) need manual attention to avoid falling back to CPU.
- 4.Validation — comparing CoreML output against PyTorch output across 500+ test utterances to verify accuracy parity.
The compiled models are bundled with Resonant. No downloading on first launch, no model manager, no waiting.
Neural Engine, not GPU
Apple Silicon Macs have three compute targets: CPU, GPU, and Neural Engine. Most ML-on-Mac tools default to GPU because it's familiar. We target Neural Engine specifically.
Why:
- •No resource contention. The Neural Engine is a dedicated chip. It doesn't compete with your GPU for rendering, compositing, or other ML workloads.
- •Lower power draw. ANE is purpose-built for inference at lower wattage than GPU paths. Your battery takes less of a hit.
- •Consistent performance. GPU throughput varies with what else is running. ANE throughput is stable because nothing else is using it.
The tradeoff: ANE has a more constrained op set. Some operations that are trivial on GPU need to be decomposed or rewritten for ANE. This is most of the work in our compilation pipeline.
The runtime pipeline
When you press the trigger key and speak, here's what happens:
01
Audio capture
Raw PCM from your microphone. 16kHz mono. Held in a ring buffer in memory — no file written to disk.
02
Voice Activity Detection
Silero VAD runs on CPU (it's tiny) to detect speech boundaries. Clips silence from the start and end of the recording.
03
Neural Engine inference
The selected CoreML model processes the audio on ANE. Parakeet v3 uses a Fast Conformer encoder with Token-and-Duration Transducer. Output: a token sequence with timestamps.
04
Text formatting
A 19-stage Rust pipeline: filler removal, false-start removal, Inverse Text Normalization (numbers, currency, ordinals, times), custom dictionary corrections, smart punctuation, bullet-point detection, smart spacing. Compiled at startup. Sub-millisecond per dictation.
05
Paste
Clean text lands in the active text field via the system clipboard. Paste latency under 50ms.
06
Audio discard
The ring buffer is released. No audio file is written to disk. The raw waveform exists only in memory for the duration of transcription.
Performance
Some numbers from our benchmarks:
Parakeet v3 WER
< 4% (LibriSpeech clean)
Parakeet v3 RTF
0.08–0.15x
First-token latency
< 100ms (M1 Pro)
Model load time
~2s (warm), ~5s (cold)
Memory usage
200–700 MB
ANE utilization
~80% during inference
Real-time factor (RTF) below 1.0x means the model processes audio faster than real-time. At 0.1x, a 10-second dictation takes about 1 second to transcribe. In practice, the perceived latency is dominated by the audio buffer flush, not the model.
What this means for you
You get cloud-competitive accuracy with local-only privacy. No audio leaves your Mac. No account is required. No internet connection is needed. You can dictate on an airplane, in a classified facility, or in your therapist's office — and the quality is the same as sitting at your desk with full connectivity.
The hardware caught up to the models. We just did the engineering work to make them run well.