Skip to content

feat: add int8 variant support for Qwen3-ASR#312

Merged
Alex-Wengg merged 13 commits intomainfrom
feat/qwen3-asr-int8-variant
Feb 15, 2026
Merged

feat: add int8 variant support for Qwen3-ASR#312
Alex-Wengg merged 13 commits intomainfrom
feat/qwen3-asr-int8-variant

Conversation

@Alex-Wengg
Copy link
Contributor

Summary

  • Add Qwen3AsrVariant enum (.f32, .int8) so users can choose between full-precision (1.75 GB) and int8-quantized (900 MB) Qwen3-ASR models
  • Add Repo.qwen3AsrInt8 case following the existing parakeetEou160/320 pattern
  • Add --variant f32|int8 flag to qwen3-benchmark and qwen3-transcribe CLI commands
  • Update download(), downloadAndLoad(), defaultCacheDirectory() APIs to accept variant parameter (defaults to .f32)

Benchmark Results (LibriSpeech test-clean, 20 files)

Variant Avg WER Median RTFx Decoder Size
f32 0.8% 2.8x 1.1 GB
int8 1.3% 2.5x 571 MB

Int8 gives ~50% RAM savings with negligible quality impact.

Usage

# CLI
fluidaudio qwen3-benchmark --variant int8 --max-files 20
fluidaudio qwen3-transcribe audio.wav --variant int8

# Library API
let models = try await Qwen3AsrModels.downloadAndLoad(variant: .int8)

Test plan

  • swift build -c release compiles cleanly
  • qwen3-benchmark --variant int8 --max-files 5 runs and produces valid WER/RTFx
  • qwen3-benchmark --max-files 5 (default f32) still works
  • CI build passes

🤖 Generated with Claude Code

@claude
Copy link
Contributor

claude bot commented Feb 15, 2026

Claude finished @Alex-Wengg's task —— View job


PR Review: int8 variant support for Qwen3-ASR ✅

Todo List

  • Create pending review
  • Get PR diff information
  • Review code changes for quality and best practices
  • Add inline feedback comments
  • Submit comprehensive review

Code Review Summary

I've completed a thorough review of this PR which adds int8 variant support for Qwen3-ASR models. The implementation is solid and follows established patterns well.

✅ Key Strengths:

  • Consistent API design following existing parakeetEou160/320 pattern
  • Proper enum implementation with CaseIterable and Sendable conformance
  • Backward compatibility maintained with default .f32 parameter values
  • Good logging integration showing variant selection
  • CLI help text updated appropriately
  • Excellent performance results (50% RAM savings, minimal quality impact)

🔍 Minor Suggestions:

  1. Naming consistency: Comment says "FP16 weights" but case is named f32
  2. Parameter ordering: Consider placing variant after directory parameter
  3. Documentation enhancement: CLI help could include specific RAM usage numbers

Technical Implementation ✅:

  • Repo enum properly handles both variants with shared remote path logic
  • ModelNames.swift correctly groups both variants for required models
  • CLI argument parsing is robust with proper error handling
  • No breaking changes to existing APIs

The code quality is high and the feature provides significant value for users with memory constraints. This is ready for merge after CI passes.


Add `Qwen3AsrVariant` enum (.f32, .int8) so users can choose between
full-precision (1.75 GB) and int8-quantized (900 MB) Qwen3-ASR models.

- Add `Repo.qwen3AsrInt8` case following the parakeetEou pattern
- Add `--variant f32|int8` flag to qwen3-benchmark and qwen3-transcribe
- Update download/load APIs to accept variant parameter (defaults to .f32)

Benchmarked on LibriSpeech test-clean (20 files):
  f32:  0.8% WER, 2.8x median RTFx, 1.1 GB decoder
  int8: 1.3% WER, 2.5x median RTFx, 571 MB decoder
@Alex-Wengg Alex-Wengg force-pushed the feat/qwen3-asr-int8-variant branch from d230c3b to ee59912 Compare February 15, 2026 00:42
@github-actions
Copy link

github-actions bot commented Feb 15, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 695.9x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 764.5x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link

github-actions bot commented Feb 15, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 14.44x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 9.102 12.5 Fetching diarization models
Model Compile 3.901 5.4 CoreML compilation
Audio Load 0.122 0.2 Loading audio file
Segmentation 21.797 30.0 Detecting speech regions
Embedding 36.328 50.0 Extracting speaker voices
Clustering 14.531 20.0 Grouping same speakers
Total 72.695 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 72.7s diarization time • Test runtime: 2m 40s • 02/15/2026, 12:12 AM EST

@github-actions
Copy link

github-actions bot commented Feb 15, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 14.5% <20% Diarization Error Rate (lower is better)
RTFx 4.46x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 10.888 4.6 Fetching diarization models
Model Compile 4.666 2.0 CoreML compilation
Audio Load 0.071 0.0 Loading audio file
Segmentation 26.970 11.5 VAD + speech detection
Embedding 232.006 98.6 Speaker embedding extraction
Clustering (VBx) 2.738 1.2 Hungarian algorithm + VBx clustering
Total 235.272 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 14.5% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 261.7s processing • Test runtime: 5m 52s • 02/15/2026, 12:22 AM EST

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds solid int8 variant support for Qwen3-ASR models with excellent API design consistency. The implementation follows established patterns and maintains backward compatibility while providing meaningful performance benefits.

The benchmark results are compelling - achieving 50% RAM reduction (1.75GB → 900MB) with only 0.5% WER increase is an excellent trade-off for resource-constrained scenarios.

Key Strengths:

  • ✅ Consistent API design following existing parakeetEou patterns
  • ✅ Proper enum with Sendable conformance
  • ✅ Backward compatibility maintained with sensible defaults
  • ✅ Comprehensive CLI integration with helpful error handling
  • ✅ No breaking changes to existing APIs

Minor Suggestions:

  1. Consider clarifying the f32 case comment (says "FP16 weights" but case is named f32)
  2. CLI help could include specific RAM usage numbers for clarity
  3. Consider parameter ordering consistency in downloadAndLoad()

Overall, this is a well-implemented feature that adds valuable functionality without disrupting existing workflows. The code quality is high and follows project conventions.

Runs f32 and int8 variants on LibriSpeech test-clean (25 files),
posts WER/CER/RTFx comparison table as PR comment.
@github-actions
Copy link

github-actions bot commented Feb 15, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 11.51x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 41.0s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.041s Average chunk processing time
Max Chunk Time 0.082s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 0m47s • 02/15/2026, 12:12 AM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link

github-actions bot commented Feb 15, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.29x
test-other 1.35% 0.00% 3.71x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 5.39x
test-other 1.16% 0.00% 3.52x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.65x Streaming real-time factor
Avg Chunk Time 1.381s Average time to process each chunk
Max Chunk Time 1.503s Maximum chunk processing time
First Token 1.640s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.65x Streaming real-time factor
Avg Chunk Time 1.394s Average time to process each chunk
Max Chunk Time 1.577s Maximum chunk processing time
First Token 1.397s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 5m10s • 02/15/2026, 12:13 AM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link

github-actions bot commented Feb 15, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.1% - -
Speaker Error 8.9% - -
RTFx 20.8x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 1m 38s • 2026-02-15T05:14:18.222Z

Autoregressive decoder is very slow on CI VMs (no ANE, virtualized GPU).
25 files per variant was causing timeouts.
Autoregressive decoder is too slow on CI VMs (no ANE, virtualized GPU)
to run on every PR. Changed to manual trigger with configurable file
count. Results shown in job summary instead of PR comment.
Dropped f32 from CI (only run int8 to validate the quantized variant).
Reduced to 5 files to keep runtime under 45 min on CI VMs.
Posts results as PR comment.
swift run defaults to debug mode which is extremely slow for the
autoregressive decoder loop. Use .build/release/fluidaudiocli directly.
Also stream output so progress is visible in CI logs.
@github-actions
Copy link

github-actions bot commented Feb 15, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Runtime: 5m17s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

CI VMs produce 573% WER / 0.03x RTFx — the CoreML State API with
GPU-resident KV cache doesn't work correctly in virtualized environments.
Changed to a smoke test that verifies int8 model download, load, and
transcribe pipeline works without checking accuracy metrics.
Full benchmarks require physical Apple Silicon.
CoreML State API (GPU-resident KV cache) does not work on
virtualized CI VMs, producing 100% WER / 0.03x RTFx. Convert
to a smoke test that verifies the pipeline (download, load,
transcribe) completes without crashing. Local benchmarks on
physical Apple Silicon show ~1.3% WER / 2.5x RTFx.
Report full metrics (WER, CER, RTFx) for 1 file with a note
that CI VM results are degraded due to lack of physical GPU.
@Alex-Wengg Alex-Wengg merged commit 76b4526 into main Feb 15, 2026
11 checks passed
@Alex-Wengg Alex-Wengg deleted the feat/qwen3-asr-int8-variant branch February 15, 2026 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant