-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[TTS] MagpieTTS: Implement Frechet Codec Distance metric + some minor inference bugfixes #15223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[TTS] MagpieTTS: Implement Frechet Codec Distance metric + some minor inference bugfixes #15223
Conversation
nemo/collections/tts/modules/magpietts_inference/evaluate_generated_audio.py
Fixed
Show fixed
Hide fixed
Signed-off-by: Fejgin, Roy <[email protected]>
Instead of taking a codec instance, accept a codec name: local path or HF/NGC name. This simplifies the metric's integration in calling code. Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
* address some CI linting issues * include a file that was missed in last commit Signed-off-by: Fejgin, Roy <[email protected]>
8d997ac to
3fc5f37
Compare
Signed-off-by: Fejgin, Roy <[email protected]>
| # Consturct a length tensor: one batch element, all frames. | ||
| x_len = torch.tensor(x.shape[0], device=x.device, dtype=torch.long).unsqueeze(0) # (1, 1) | ||
| tokens = x.permute(1, 0).unsqueeze(0) # 1, C, B*T | ||
| embeddings = self.codec.dequantize(tokens=tokens, tokens_len=x_len) # (B, D, T) | ||
| # we treat each time step as a separate example | ||
| embeddings = rearrange(embeddings, 'B D T -> (B T) D') | ||
| return embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need some sort of masking here? If we are reducing B and T into one dimension, how do we ensure that no padding gets passed to the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this only works on batch 1, we should add a check that x has a batch dimension of 1 or output an error or a warning
What does this PR do ?
Adds the Frechet Codec Distance metric and integrates it in MagpieTTS inference scripts. Also fixes some minor MagpieTTS inference bugs.
Collection: TTS
Changelog
The Frechet Distance (FD) is commonly used to evaluate generative models (e.g. Frechet Inception Distance, Frechet Audio Distance). In this PR we implements FD in the embedding space of a neural codec. This is a metric that measures how closely the distributions of real and generated codec frames match, at the single frame level.
Changes:
frechet_codec_distance.py: An implementation of FD in codec embedding space. Builds on TorchMetrics' FID implementation. We provide the audio codec as a custom feature extractor.test_frechet_coec_distance.py: Unit test--disable_fcdcommand line argument tomagpietts_inference.pytitanet_smallspeaker representation model. This was present in earlier versions of the inference scripts and appears to have been accidentally lost in recent refactoringsPR Type: