feat: add int8 variant support for Qwen3-ASR by Alex-Wengg · Pull Request #312 · FluidInference/FluidAudio

Alex-Wengg · 2026-02-15T00:41:39Z

Summary

Add Qwen3AsrVariant enum (.f32, .int8) so users can choose between full-precision (1.75 GB) and int8-quantized (900 MB) Qwen3-ASR models
Add Repo.qwen3AsrInt8 case following the existing parakeetEou160/320 pattern
Add --variant f32|int8 flag to qwen3-benchmark and qwen3-transcribe CLI commands
Update download(), downloadAndLoad(), defaultCacheDirectory() APIs to accept variant parameter (defaults to .f32)

Benchmark Results (LibriSpeech test-clean, 20 files)

Variant	Avg WER	Median RTFx	Decoder Size
f32	0.8%	2.8x	1.1 GB
int8	1.3%	2.5x	571 MB

Int8 gives ~50% RAM savings with negligible quality impact.

Usage

# CLI
fluidaudio qwen3-benchmark --variant int8 --max-files 20
fluidaudio qwen3-transcribe audio.wav --variant int8

# Library API
let models = try await Qwen3AsrModels.downloadAndLoad(variant: .int8)

Test plan

swift build -c release compiles cleanly
qwen3-benchmark --variant int8 --max-files 5 runs and produces valid WER/RTFx
qwen3-benchmark --max-files 5 (default f32) still works
CI build passes

🤖 Generated with Claude Code

claude · 2026-02-15T00:42:02Z

Claude finished @Alex-Wengg's task —— View job

PR Review: int8 variant support for Qwen3-ASR ✅

Todo List

Create pending review
Get PR diff information
Review code changes for quality and best practices
Add inline feedback comments
Submit comprehensive review

Code Review Summary

I've completed a thorough review of this PR which adds int8 variant support for Qwen3-ASR models. The implementation is solid and follows established patterns well.

✅ Key Strengths:

Consistent API design following existing parakeetEou160/320 pattern
Proper enum implementation with CaseIterable and Sendable conformance
Backward compatibility maintained with default .f32 parameter values
Good logging integration showing variant selection
CLI help text updated appropriately
Excellent performance results (50% RAM savings, minimal quality impact)

🔍 Minor Suggestions:

Naming consistency: Comment says "FP16 weights" but case is named f32
Parameter ordering: Consider placing variant after directory parameter
Documentation enhancement: CLI help could include specific RAM usage numbers

Technical Implementation ✅:

Repo enum properly handles both variants with shared remote path logic
ModelNames.swift correctly groups both variants for required models
CLI argument parsing is robust with proper error handling
No breaking changes to existing APIs

The code quality is high and the feature provides significant value for users with memory constraints. This is ready for merge after CI passes.

Add `Qwen3AsrVariant` enum (.f32, .int8) so users can choose between full-precision (1.75 GB) and int8-quantized (900 MB) Qwen3-ASR models. - Add `Repo.qwen3AsrInt8` case following the parakeetEou pattern - Add `--variant f32|int8` flag to qwen3-benchmark and qwen3-transcribe - Update download/load APIs to accept variant parameter (defaults to .f32) Benchmarked on LibriSpeech test-clean (20 files): f32: 0.8% WER, 2.8x median RTFx, 1.1 GB decoder int8: 1.3% WER, 2.5x median RTFx, 571 MB decoder

github-actions · 2026-02-15T00:42:45Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	695.9x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	764.5x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-02-15T00:42:45Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	14.44x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	9.102	12.5	Fetching diarization models
Model Compile	3.901	5.4	CoreML compilation
Audio Load	0.122	0.2	Loading audio file
Segmentation	21.797	30.0	Detecting speech regions
Embedding	36.328	50.0	Extracting speaker voices
Clustering	14.531	20.0	Grouping same speakers
Total	72.695	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 72.7s diarization time • Test runtime: 2m 40s • 02/15/2026, 12:12 AM EST}

github-actions · 2026-02-15T00:42:45Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	14.5%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	4.46x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	10.888	4.6	Fetching diarization models
Model Compile	4.666	2.0	CoreML compilation
Audio Load	0.071	0.0	Loading audio file
Segmentation	26.970	11.5	VAD + speech detection
Embedding	232.006	98.6	Speaker embedding extraction
Clustering (VBx)	2.738	1.2	Hungarian algorithm + VBx clustering
Total	235.272	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	14.5%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 261.7s processing • Test runtime: 5m 52s • 02/15/2026, 12:22 AM EST}

claude

This PR adds solid int8 variant support for Qwen3-ASR models with excellent API design consistency. The implementation follows established patterns and maintains backward compatibility while providing meaningful performance benefits.

The benchmark results are compelling - achieving 50% RAM reduction (1.75GB → 900MB) with only 0.5% WER increase is an excellent trade-off for resource-constrained scenarios.

Key Strengths:

✅ Consistent API design following existing parakeetEou patterns
✅ Proper enum with Sendable conformance
✅ Backward compatibility maintained with sensible defaults
✅ Comprehensive CLI integration with helpful error handling
✅ No breaking changes to existing APIs

Minor Suggestions:

Consider clarifying the f32 case comment (says "FP16 weights" but case is named f32)
CLI help could include specific RAM usage numbers for clarity
Consider parameter ordering consistency in downloadAndLoad()

Overall, this is a well-implemented feature that adds valuable functionality without disrupting existing workflows. The code quality is high and follows project conventions.

Runs f32 and int8 variants on LibriSpeech test-clean (25 files), posts WER/CER/RTFx comparison table as PR comment.

github-actions · 2026-02-15T00:52:18Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	11.51x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	41.0s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.041s	Average chunk processing time
Max Chunk Time	0.082s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 0m47s • 02/15/2026, 12:12 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-02-15T00:58:12Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	5.29x	✅
test-other	1.35%	0.00%	3.71x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	5.39x	✅
test-other	1.16%	0.00%	3.52x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.65x	Streaming real-time factor
Avg Chunk Time	1.381s	Average time to process each chunk
Max Chunk Time	1.503s	Maximum chunk processing time
First Token	1.640s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.65x	Streaming real-time factor
Avg Chunk Time	1.394s	Average time to process each chunk
Max Chunk Time	1.577s	Maximum chunk processing time
First Token	1.397s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 5m10s • 02/15/2026, 12:13 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

github-actions · 2026-02-15T01:15:47Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	33.4%	<35%	✅
Miss Rate	24.4%	-	-
False Alarm	0.1%	-	-
Speaker Error	8.9%	-	-
RTFx	20.8x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 1m 38s • 2026-02-15T05:14:18.222Z}

Autoregressive decoder is very slow on CI VMs (no ANE, virtualized GPU). 25 files per variant was causing timeouts.

Autoregressive decoder is too slow on CI VMs (no ANE, virtualized GPU) to run on every PR. Changed to manual trigger with configurable file count. Results shown in job summary instead of PR comment.

Dropped f32 from CI (only run int8 to validate the quantized variant). Reduced to 5 files to keep runtime under 45 min on CI VMs. Posts results as PR comment.

swift run defaults to debug mode which is extremely slow for the autoregressive decoder loop. Use .build/release/fluidaudiocli directly. Also stream output so progress is visible in CI logs.

github-actions · 2026-02-15T02:50:47Z

Qwen3-ASR int8 Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Transcription pipeline	✅
Decoder size	571 MB (vs 1.1 GB f32)

_{Runtime: 5m17s}

_{Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.}

CI VMs produce 573% WER / 0.03x RTFx — the CoreML State API with GPU-resident KV cache doesn't work correctly in virtualized environments. Changed to a smoke test that verifies int8 model download, load, and transcribe pipeline works without checking accuracy metrics. Full benchmarks require physical Apple Silicon.

CoreML State API (GPU-resident KV cache) does not work on virtualized CI VMs, producing 100% WER / 0.03x RTFx. Convert to a smoke test that verifies the pipeline (download, load, transcribe) completes without crashing. Local benchmarks on physical Apple Silicon show ~1.3% WER / 2.5x RTFx.

Report full metrics (WER, CER, RTFx) for 1 file with a note that CI VM results are degraded due to lack of physical GPU.

Alex-Wengg force-pushed the feat/qwen3-asr-int8-variant branch from d230c3b to ee59912 Compare February 15, 2026 00:42

claude bot reviewed Feb 15, 2026

View reviewed changes

ci: add Qwen3-ASR benchmark workflow

bc6b3e6

Runs f32 and int8 variants on LibriSpeech test-clean (25 files), posts WER/CER/RTFx comparison table as PR comment.

Alex-Wengg added 4 commits February 14, 2026 21:00

ci: reduce Qwen3-ASR benchmark to 10 files, add 60min timeout

ba6366f

Autoregressive decoder is very slow on CI VMs (no ANE, virtualized GPU). 25 files per variant was causing timeouts.

ci: make Qwen3-ASR benchmark manual-only (workflow_dispatch)

759da69

Autoregressive decoder is too slow on CI VMs (no ANE, virtualized GPU) to run on every PR. Changed to manual trigger with configurable file count. Results shown in job summary instead of PR comment.

ci: run int8-only Qwen3-ASR benchmark on PRs (5 files)

2b89880

Dropped f32 from CI (only run int8 to validate the quantized variant). Reduced to 5 files to keep runtime under 45 min on CI VMs. Posts results as PR comment.

ci: use release binary and stream output for Qwen3-ASR benchmark

cb90e37

swift run defaults to debug mode which is extremely slow for the autoregressive decoder loop. Use .build/release/fluidaudiocli directly. Also stream output so progress is visible in CI logs.

Alex-Wengg added 7 commits February 14, 2026 21:57

ci: restore Qwen3-ASR benchmark with WER/RTFx metrics (5 files, int8)

284870e

ci: reduce Qwen3-ASR benchmark to 1 file

74081a0

ci: restore WER/RTFx metrics for qwen3-asr benchmark

0fcc05f

Report full metrics (WER, CER, RTFx) for 1 file with a note that CI VM results are degraded due to lack of physical GPU.

ci: shorten smoke test note

9d48963

ci: mention MLState macOS 15 in smoke test note

c54222f

Alex-Wengg merged commit 76b4526 into main Feb 15, 2026
11 checks passed

Alex-Wengg deleted the feat/qwen3-asr-int8-variant branch February 15, 2026 05:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add int8 variant support for Qwen3-ASR#312

feat: add int8 variant support for Qwen3-ASR#312
Alex-Wengg merged 13 commits intomainfrom
feat/qwen3-asr-int8-variant

Alex-Wengg commented Feb 15, 2026

Uh oh!

claude bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

github-actions bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alex-Wengg commented Feb 15, 2026

Summary

Benchmark Results (LibriSpeech test-clean, 20 files)

Usage

Test plan

Uh oh!

claude bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: int8 variant support for Qwen3-ASR ✅

Todo List

Code Review Summary

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-ASR int8 Smoke Test ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading