feat: add Qwen3-TTS backend for multilingual text-to-speech#290
feat: add Qwen3-TTS backend for multilingual text-to-speech#290Alex-Wengg wants to merge 6 commits intomainfrom
Conversation
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 40.6s diarization time • Test runtime: 2m 39s • 02/13/2026, 03:53 PM EST |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 2m19s • 02/13/2026, 03:57 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 389.6s processing • Test runtime: 9m 13s • 02/13/2026, 03:57 PM EST |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 8m12s • 02/13/2026, 03:50 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 3m 37s • 2026-02-13T20:45:32.459Z |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
ca5bc7c to
acca996
Compare
acca996 to
37ef324
Compare
Add CoreML-based Qwen3-TTS inference pipeline supporting English and Chinese synthesis. The pipeline implements prefill → LM decode (CB0) → code predictor (CB1-15) → audio decoder with temperature+top_k sampling for natural speech generation and proper EOS detection. Key components: - Qwen3TtsSynthesizer: Full inference pipeline with KV-cache management, 16-codebook generation, and automatic silence trimming - Qwen3TtsModelStore: CoreML model loading for prefill, decode, code predictor, and audio decoder models - Qwen3TtsManager: High-level API for model loading and synthesis - Qwen3TtsConstants: Model dimensions, special tokens, and generation parameters matching the PyTorch reference implementation - CLI support via --backend qwen3 flag with bilingual test sentences
Add automatic model download from alexwengg/qwen3-tts-coreml repo, matching the PocketTTS download pattern. Models are cached locally at ~/.cache/fluidaudio/Models/qwen3-tts/. Changes: - Add qwen3Tts repo to ModelNames.swift with model file definitions - Add Qwen3TtsResourceDownloader for HuggingFace auto-download - Update Qwen3TtsModelStore to use mlmodelc bundles and support both auto-download (loadIfNeeded) and local directory loading - Add Qwen3TtsManager.initialize() for auto-download workflow - Update CLI to auto-download by default (QWEN3_TTS_MODEL_DIR env var still supported for local override)
- Add repetition_penalty=1.3 matching PyTorch default - Penalize last 20 CB0 tokens to prevent repetitive loops - Fix Chinese TTS producing silent audio - Adjust temperature (0.7) and topK (30) for cleaner output - Add audio post-processing with de-essing - Document issues and fixes in docs/qwen3-tts-coreml-issues.md Before: CB0 stuck at same values, only 27/125 unique, Chinese silent After: 98% unique CB0, natural EOS, both EN/ZH transcribe correctly Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CB0: repetition_penalty 1.3→1.05 on ALL prior tokens (was last 20) - CB0: add min_new_tokens=2 (suppress EOS for first 2 steps) - CB0: fix processing order to match transformers _get_logits_processor (rep_penalty → suppress → min_new_tokens → temp → top_k) - CP: temperature 0.7→0.9, topK 30→50 (matches PyTorch CP generate) - Disable audio post-processing (de-essing was muffling output) - Add codebook dump for debugging comparison with Python pipeline Python CoreML pipeline verified byte-for-byte identical to PyTorch with these params. Swift pipeline untested with new params. Co-Authored-By: Claude <noreply@anthropic.com>
FluidAudioTTS was renamed to FluidAudioEspeak on main. Move Qwen3TTS files to the new module location so the package builds correctly.
a2157d2 to
bfbf3ac
Compare
Summary
New files
Qwen3TtsSynthesizer.swift— Full inference pipeline: KV-cache prefill, CB0 sampling with EOS masking, CB1-15 code prediction, audio decoding, and silence trimmingQwen3TtsModelStore.swift— CoreML model loading for prefill, decode, code predictor, and audio decoderQwen3TtsManager.swift— High-level API for model loading and synthesisQwen3TtsConstants.swift— Model dimensions, special token IDs, and generation parametersModified files
TtsBackend.swift— Addqwen3TtscaseTTSCommand.swift— CLI support via--backend qwen3with bilingual test sentencesValidation
Test plan
swift buildswift run fluidaudio tts --backend qwen3 "Hello world, this is a test of the text to speech system."swift run fluidaudio tts --backend qwen3 "你好世界,这是一个文字转语音系统的测试。"🤖 Generated with Claude Code