English | 简体中文
On-device speech AI runtime for ASR, TTS, VAD, and voice cloning. Python-simple, C++-native, GGUF-powered.
RapidSpeech.cpp runs speech recognition, text-to-speech, VAD, speaker embedding, and voice cloning on-device. It gives Python developers a simple API while keeping the runtime pure C/C++, backed by ggml and a unified GGUF model format. No cloud API, no speech server, no heavyweight Python model stack.
pip install rapidspeechGPU wheels:
pip install rapidspeech-metal # macOS / Apple Silicon
pip install rapidspeech-cuda # Linux / NVIDIApython python-api-examples/tts/tts-offline.py \
--model /path/to/omnivoice-f16.gguf \
--text "Hello, welcome to RapidSpeech." \
--output output.wavpython python-api-examples/asr/asr-offline.py \
--model /path/to/funasr-nano-fp16.gguf \
--audio /path/to/audio.wavimport rapidspeech
tts = rapidspeech.tts_synthesizer("/path/to/omnivoice-f16.gguf")
tts.set_params(instruct="male, young adult", language="English", seed=42)
pcm = tts.synthesize("Hello from a native speech engine.")
sample_rate = tts.get_sample_rate()import rapidspeech
asr = rapidspeech.asr_offline("/path/to/funasr-nano-fp16.gguf")
sample_rate = asr.get_model_meta()["audio_sample_rate"]
pcm = ... # 1-D float32 mono PCM at sample_rate
asr.push_audio(pcm)
asr.process()
print(asr.get_text())- Built for the edge: run speech models locally on laptops, servers, browsers, and device-class hardware.
- Python-simple, C++-native: write Python, run a C++/ggml engine underneath.
- One model format: ASR, TTS, VAD, and speaker models use GGUF.
- NumPy in, NumPy out: ASR takes float32 PCM; TTS returns float32 PCM.
- Edge-first backends: CPU, Metal, CUDA, Vulkan, CANN, OpenCL, and WebGPU.
Test environment: Apple M1 Pro, funasr-nano-fp16.gguf, 15s audio.
| Configuration | RTF | Wall Time | Notes |
|---|---|---|---|
| CPU -t 4 | 0.465 | 12.4s | CPU-only inference |
| GPU -t 4 | 0.170 | 5.2s | Metal acceleration |
| GPU -t 4 Q4_K | 0.756 | - | Quantized model: GPU dequant overhead |
| CPU -t 4 Q4_K | 0.530 | - | Quantized model CPU inference, 596 MB (3.3x compression) |
RTF is processing time divided by audio duration. Lower is faster; RTF < 1 is faster than real time.
| Task | Models | Status |
|---|---|---|
| ASR | SenseVoice-small, FunASR-nano | Stable |
| VAD | Silero VAD, FireRedVAD | Stable |
| TTS | OmniVoice, OpenVoice2, Kokoro | Active |
| Speaker | CAMPPlus | Stable |
CosyVoice3, Qwen3-ASR, Qwen3-TTS.
- Python examples
- Technical Notes: architecture, design tradeoffs, backends, model conversion, and binding surfaces.
- Browser / WASM examples
- Node.js example
Models are available on:
- 🤗 Hugging Face: https://huggingface.co/RapidAI/RapidSpeech
- ModelScope: https://www.modelscope.cn/models/RapidAI/RapidSpeech
git clone https://github.com/RapidAI/RapidSpeech.cpp
cd RapidSpeech.cpp
git submodule sync && git submodule update --init --recursive
cmake -B build
cmake --build build --config ReleaseBuild artifacts are located in the build/ directory:
rs-asr-offline— Offline ASR command-line toolrs-asr-vad-online— VAD-segmented quasi-streaming ASR command-line toolrs-tts-offline— Offline TTS command-line toolrs-quantize— Model quantization tool
Offline ASR
./build/rs-asr-offline \
-m /path/to/funasr-nano-fp16.gguf \
-w /path/to/audio.wav \
-t 4 \
--gpu trueVAD-segmented ASR
./build/rs-asr-offline \
-m /path/to/funasr-nano-fp16.gguf \
-v /path/to/silero_vad_v6.gguf \
-w /path/to/audio.wav \
-t 4 \
--vad-threshold 0.5 \
--silence-ms 600Text to speech
./build/rs-tts-offline \
-m /path/to/omnivoice-f16.gguf \
-t "Hello, welcome to RapidSpeech!" \
--instruct "male, young adult, moderate pitch" \
--lang English \
--n-steps 32 \
-o output.wavQuantization
./build/rs-quantize /path/to/input-f16.gguf /path/to/output-q4_k.gguf q4_kSee Python examples for offline ASR, streaming ASR, offline TTS, streaming TTS, VAD, and voice cloning.
If you are interested in the following areas, we welcome your PRs or participation in discussions:
- Adapting more models to the framework.
- Refining and optimizing the project architecture.
- Improving inference performance.
