Skip to content

RapidAI/RapidSpeech.cpp

Repository files navigation

RapidSpeech Logo

English | 简体中文

Open in Colab

RapidSpeech.cpp

On-device speech AI runtime for ASR, TTS, VAD, and voice cloning. Python-simple, C++-native, GGUF-powered.

RapidSpeech.cpp runs speech recognition, text-to-speech, VAD, speaker embedding, and voice cloning on-device. It gives Python developers a simple API while keeping the runtime pure C/C++, backed by ggml and a unified GGUF model format. No cloud API, no speech server, no heavyweight Python model stack.


Python In 60 Seconds

Install

pip install rapidspeech

GPU wheels:

pip install rapidspeech-metal   # macOS / Apple Silicon
pip install rapidspeech-cuda    # Linux / NVIDIA

Text to speech

python python-api-examples/tts/tts-offline.py \
  --model /path/to/omnivoice-f16.gguf \
  --text "Hello, welcome to RapidSpeech." \
  --output output.wav

Speech to text

python python-api-examples/asr/asr-offline.py \
  --model /path/to/funasr-nano-fp16.gguf \
  --audio /path/to/audio.wav

Python API

import rapidspeech

tts = rapidspeech.tts_synthesizer("/path/to/omnivoice-f16.gguf")
tts.set_params(instruct="male, young adult", language="English", seed=42)
pcm = tts.synthesize("Hello from a native speech engine.")
sample_rate = tts.get_sample_rate()
import rapidspeech

asr = rapidspeech.asr_offline("/path/to/funasr-nano-fp16.gguf")
sample_rate = asr.get_model_meta()["audio_sample_rate"]
pcm = ...  # 1-D float32 mono PCM at sample_rate
asr.push_audio(pcm)
asr.process()
print(asr.get_text())

Why RapidSpeech.cpp

  • Built for the edge: run speech models locally on laptops, servers, browsers, and device-class hardware.
  • Python-simple, C++-native: write Python, run a C++/ggml engine underneath.
  • One model format: ASR, TTS, VAD, and speaker models use GGUF.
  • NumPy in, NumPy out: ASR takes float32 PCM; TTS returns float32 PCM.
  • Edge-first backends: CPU, Metal, CUDA, Vulkan, CANN, OpenCL, and WebGPU.

Performance Snapshot

Test environment: Apple M1 Pro, funasr-nano-fp16.gguf, 15s audio.

Configuration RTF Wall Time Notes
CPU -t 4 0.465 12.4s CPU-only inference
GPU -t 4 0.170 5.2s Metal acceleration
GPU -t 4 Q4_K 0.756 - Quantized model: GPU dequant overhead
CPU -t 4 Q4_K 0.530 - Quantized model CPU inference, 596 MB (3.3x compression)

RTF is processing time divided by audio duration. Lower is faster; RTF < 1 is faster than real time.


Supported Today

Task Models Status
ASR SenseVoice-small, FunASR-nano Stable
VAD Silero VAD, FireRedVAD Stable
TTS OmniVoice, OpenVoice2, Kokoro Active
Speaker CAMPPlus Stable

In Progress

CosyVoice3, Qwen3-ASR, Qwen3-TTS.


Documentation


Native C++ CLI

Download Models

Models are available on:

Build from Source

git clone https://github.com/RapidAI/RapidSpeech.cpp
cd RapidSpeech.cpp
git submodule sync && git submodule update --init --recursive
cmake -B build
cmake --build build --config Release

Build artifacts are located in the build/ directory:

  • rs-asr-offline — Offline ASR command-line tool
  • rs-asr-vad-online — VAD-segmented quasi-streaming ASR command-line tool
  • rs-tts-offline — Offline TTS command-line tool
  • rs-quantize — Model quantization tool

Core Commands

Offline ASR

./build/rs-asr-offline \
  -m /path/to/funasr-nano-fp16.gguf \
  -w /path/to/audio.wav \
  -t 4 \
  --gpu true

VAD-segmented ASR

./build/rs-asr-offline \
  -m /path/to/funasr-nano-fp16.gguf \
  -v /path/to/silero_vad_v6.gguf \
  -w /path/to/audio.wav \
  -t 4 \
  --vad-threshold 0.5 \
  --silence-ms 600

Text to speech

./build/rs-tts-offline \
  -m /path/to/omnivoice-f16.gguf \
  -t "Hello, welcome to RapidSpeech!" \
  --instruct "male, young adult, moderate pitch" \
  --lang English \
  --n-steps 32 \
  -o output.wav

Quantization

./build/rs-quantize /path/to/input-f16.gguf /path/to/output-q4_k.gguf q4_k

Python

See Python examples for offline ASR, streaming ASR, offline TTS, streaming TTS, VAD, and voice cloning.


🤝 Contributing

If you are interested in the following areas, we welcome your PRs or participation in discussions:

  • Adapting more models to the framework.
  • Refining and optimizing the project architecture.
  • Improving inference performance.

Acknowledgements

  1. Fun-ASR
  2. llama.cpp
  3. ggml
  4. cppjieba — Chinese word segmentation
  5. WeText — text normalization (ITN/TN)
  6. miniaudio — single-file audio I/O

About

RapidSpeech.cpp is a high-performance, edge-native speech intelligence framework written in pure C++. Built atop the ggml tensor library, it is designed to bridge the gap between state-of-the-art Large Speech Models (LSMs) and resource-constrained environments.

Resources

Stars

Watchers

Forks

Contributors