feat: add int8 variant support for Qwen3-ASR#312
Conversation
|
Claude finished @Alex-Wengg's task —— View job PR Review: int8 variant support for Qwen3-ASR ✅Todo List
Code Review SummaryI've completed a thorough review of this PR which adds int8 variant support for Qwen3-ASR models. The implementation is solid and follows established patterns well. ✅ Key Strengths:
🔍 Minor Suggestions:
Technical Implementation ✅:
The code quality is high and the feature provides significant value for users with memory constraints. This is ready for merge after CI passes. |
Add `Qwen3AsrVariant` enum (.f32, .int8) so users can choose between full-precision (1.75 GB) and int8-quantized (900 MB) Qwen3-ASR models. - Add `Repo.qwen3AsrInt8` case following the parakeetEou pattern - Add `--variant f32|int8` flag to qwen3-benchmark and qwen3-transcribe - Update download/load APIs to accept variant parameter (defaults to .f32) Benchmarked on LibriSpeech test-clean (20 files): f32: 0.8% WER, 2.8x median RTFx, 1.1 GB decoder int8: 1.3% WER, 2.5x median RTFx, 571 MB decoder
d230c3b to
ee59912
Compare
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 72.7s diarization time • Test runtime: 2m 40s • 02/15/2026, 12:12 AM EST |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 261.7s processing • Test runtime: 5m 52s • 02/15/2026, 12:22 AM EST |
There was a problem hiding this comment.
This PR adds solid int8 variant support for Qwen3-ASR models with excellent API design consistency. The implementation follows established patterns and maintains backward compatibility while providing meaningful performance benefits.
The benchmark results are compelling - achieving 50% RAM reduction (1.75GB → 900MB) with only 0.5% WER increase is an excellent trade-off for resource-constrained scenarios.
Key Strengths:
- ✅ Consistent API design following existing parakeetEou patterns
- ✅ Proper enum with Sendable conformance
- ✅ Backward compatibility maintained with sensible defaults
- ✅ Comprehensive CLI integration with helpful error handling
- ✅ No breaking changes to existing APIs
Minor Suggestions:
- Consider clarifying the f32 case comment (says "FP16 weights" but case is named
f32) - CLI help could include specific RAM usage numbers for clarity
- Consider parameter ordering consistency in
downloadAndLoad()
Overall, this is a well-implemented feature that adds valuable functionality without disrupting existing workflows. The code quality is high and follows project conventions.
Runs f32 and int8 variants on LibriSpeech test-clean (25 files), posts WER/CER/RTFx comparison table as PR comment.
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m47s • 02/15/2026, 12:12 AM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 5m10s • 02/15/2026, 12:13 AM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 1m 38s • 2026-02-15T05:14:18.222Z |
Autoregressive decoder is very slow on CI VMs (no ANE, virtualized GPU). 25 files per variant was causing timeouts.
Autoregressive decoder is too slow on CI VMs (no ANE, virtualized GPU) to run on every PR. Changed to manual trigger with configurable file count. Results shown in job summary instead of PR comment.
Dropped f32 from CI (only run int8 to validate the quantized variant). Reduced to 5 files to keep runtime under 45 min on CI VMs. Posts results as PR comment.
swift run defaults to debug mode which is extremely slow for the autoregressive decoder loop. Use .build/release/fluidaudiocli directly. Also stream output so progress is visible in CI logs.
Qwen3-ASR int8 Smoke Test ✅
Runtime: 5m17s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
CI VMs produce 573% WER / 0.03x RTFx — the CoreML State API with GPU-resident KV cache doesn't work correctly in virtualized environments. Changed to a smoke test that verifies int8 model download, load, and transcribe pipeline works without checking accuracy metrics. Full benchmarks require physical Apple Silicon.
CoreML State API (GPU-resident KV cache) does not work on virtualized CI VMs, producing 100% WER / 0.03x RTFx. Convert to a smoke test that verifies the pipeline (download, load, transcribe) completes without crashing. Local benchmarks on physical Apple Silicon show ~1.3% WER / 2.5x RTFx.
Report full metrics (WER, CER, RTFx) for 1 file with a note that CI VM results are degraded due to lack of physical GPU.
Summary
Qwen3AsrVariantenum (.f32,.int8) so users can choose between full-precision (1.75 GB) and int8-quantized (900 MB) Qwen3-ASR modelsRepo.qwen3AsrInt8case following the existing parakeetEou160/320 pattern--variant f32|int8flag toqwen3-benchmarkandqwen3-transcribeCLI commandsdownload(),downloadAndLoad(),defaultCacheDirectory()APIs to accept variant parameter (defaults to.f32)Benchmark Results (LibriSpeech test-clean, 20 files)
Int8 gives ~50% RAM savings with negligible quality impact.
Usage
Test plan
swift build -c releasecompiles cleanlyqwen3-benchmark --variant int8 --max-files 5runs and produces valid WER/RTFxqwen3-benchmark --max-files 5(default f32) still works🤖 Generated with Claude Code