ModelsMar 3, 2026

SenseVoice Small: Offline Dictation for Chinese, Japanese, Korean, and Cantonese on Mac

Finding a good offline dictation model for East Asian languages has historically been harder than for English or European languages. SenseVoice Small changes that. Built by Alibaba Research, it covers Mandarin Chinese, Japanese, Korean, Cantonese, and English in a single 226 MB model — and it's among the fastest transcription models available on any platform.

Why SenseVoice is different

Most speech recognition models use autoregressive decoding: they generate tokens one at a time, each token conditioned on the previous ones. This produces accurate results but takes time proportional to output length.

SenseVoice uses a non-autoregressive CTC architecture, which processes the entire audio sequence in parallel. The real-time factor (RTF) is approximately 0.10 — meaning a ten-second clip transcribes in about one second. On Apple Silicon, transcription feels nearly instantaneous for normal dictation lengths.

For languages like Mandarin, where character density is high and typing is particularly slow by comparison, that speed makes SenseVoice an exceptionally practical dictation tool.

Automatic language detection

SenseVoice auto-detects among its five supported languages. If your work involves switching between Mandarin and English in the same session — common in business, academic, or bilingual professional contexts — you don't need to change any settings. Speak in either language and SenseVoice follows.

The same applies to Japanese-English or Korean-English switching. You can dictate a sentence in Japanese and then continue in English within the same recording and the model handles the transition.

SenseVoice vs. dedicated language models

Resonant also includes dedicated models for Japanese (Zipformer Japanese) and Korean (Zipformer Korean). For Japanese-only speakers, Zipformer Japanese — trained on 35,000 hours of ReazonSpeech data — will generally outperform SenseVoice on Japanese content. Similarly, a Mandarin-primary speaker who needs maximum accuracy should consider FireRedASR Large.

But SenseVoice wins on versatility. If you work across multiple East Asian languages or regularly mix them with English, a single model that handles all five is more practical than switching between dedicated ones. At 226 MB, the download cost is low.

How to enable it in Resonant

Go to Settings → Transcription and select “SenseVoice Small”. The 226 MB model downloads quickly and switches immediately on completion. No language configuration is needed — auto-detection is always on.

Processing runs entirely on your Mac. Your audio never reaches a server. For workflows that involve sensitive content in any of SenseVoice's five languages, the local architecture means nothing you dictate is shared.

Download Resonant to try SenseVoice Small offline.