Authors:

Sang-Hoon LeeHa-Yeong ChoiSeong-Whan Lee

Paper:

https://arxiv.org/abs/2408.08019

Introduction

The paper “Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization” introduces PeriodWave-Turbo, a high-fidelity and efficient waveform generation model. This model leverages adversarial flow matching optimization to enhance the performance of pre-trained Conditional Flow Matching (CFM) generative models. The primary goal is to address the limitations of existing models, such as the need for numerous Ordinary Differential Equation (ODE) steps and the lack of high-frequency information in generated samples.

Related Works

Accelerating Methods for Few-Step Generator

Diffusion-based generative models have shown impressive performance but suffer from slow inference speeds due to iterative sampling processes. Various methods have been proposed to accelerate synthesis speed, including:

  • Consistency Models (CM): Introduce one-step or few-step generation methods by directly mapping noise to data.
  • Consistency Trajectory Models (CTM): Integrate CM and score-based models with adversarial training.
  • FlashSpeech: Propose adversarial consistency training using SSL-based pre-trained models.
  • Denoising Diffusion GAN (DDGAN): Integrate the denoising process with a multimodal conditional GAN for faster sampling.
  • UFOGen: Enhance performance using improved reconstruction loss and pre-trained diffusion models.
  • Adversarial Diffusion Distillation (ADD): Leverage pre-trained teacher models for distillation and adversarial training.
  • Latent Adversarial Diffusion Distillation (LADD): Unify discriminator and teacher models for efficient training.
  • Distribution Matching Distillation (DMD): Demonstrate high-quality one-step generation via distribution matching distillation and reconstruction loss.

Adversarial Feedback for Waveform Generation

GAN-based models have dominated waveform generation tasks by utilizing various well-designed discriminators to capture specific characteristics of waveform signals. Notable models include:

  • MelGAN: Propose a multi-scale discriminator (MSD) reflecting features from different scales of waveform signals.
  • HiFi-GAN: Introduce the multi-period discriminator (MPD) to capture implicit period features.
  • UnivNet: Present the multi-resolutional spectrogram discriminator (MRD) to capture different features in spectral domains.
  • Fre-GAN: Utilize resolution-wise discriminators.
  • Abocodo: Propose collaborative multi-band discriminator (CoMBD) and sub-band discriminator (SBD).
  • EnCodec: Modify MRD using complex values of spectral features.
  • MS-SB-CQTD: Improve modeling of pitch and harmonic information in waveform signals.

PeriodWave-Turbo

Flow Matching for Waveform Generation

Flow matching techniques have shown potential in generating high-quality waveforms by aligning the probability flow between noise and target distribution. PeriodWave employs CFM to create a waveform generator and incorporates a period-aware generator architecture to capture temporal features of input signals with greater precision. However, iterative processing steps can be slow, posing challenges for real-time applications.

Adversarial Flow Matching Optimization

Few-step Generator Modification

To accelerate waveform generation, the pre-trained CFM generator is modified into a fixed-step generator. The parameters are initialized by the pre-trained PeriodWave, and raw waveform signals are generated from noise using a fixed few-step ODE sampling with the Euler method. Fine-tuning the model with fixed steps makes it a specialist for better optimization.

Reconstruction Loss

Unlike the pre-training method with flow matching objectives, reconstruction losses can be utilized on raw waveform signals reconstructed by fixed-step sampling and ODE solver. The Mel-spectrogram reconstruction loss is adopted to focus on human-perceptual frequency:

[ L_{mel} = \| \psi(x) – \psi(\hat{x}) \|_1 ]

where (\hat{x}) is sampled with (G(x_t, c, t)) and ODE solver (Euler method) with fixed steps.

Adversarial Training

To ensure high-quality waveform generation, adversarial feedback is utilized by adopting multi-period discriminator (MPD) and multi-scale sub-band Constant-Q Transform discriminator (MS-SB-CQTD):

[ L_{adv}(D) = E(x) \left[ (D(x) – 1)^2 + D(G(x_t, c, t))^2 \right] ]

[ L_{adv}(G) = E_x \left[ (D(G(x_t, c, t)) – 1)^2 \right] ]

Additionally, feature matching loss (L_{fm}) is used, which is the L1 distance between the features of the discriminator from the ground-truth (x) and generated (\hat{x}).

Distillation Method

For few-step text-to-image generation, recent works introduced distillation methods using pre-trained diffusion models. The distillation method for waveform generation is compared using the pre-trained FM generator as a fake vector field estimator.

Final Loss

The total loss for PeriodWave-Turbo is expressed as:

[ L_{final} = L_{adv}(G) + \lambda_{fm} L_{fm} + \lambda_{mel} L_{mel} ]

where (\lambda_{fm}) and (\lambda_{mel}) are set to 2 and 45, respectively.

Model Size

PeriodWave and PeriodWave-Turbo are trained with different model sizes: Small (S, 7.57M), Base (B, 29.80M), and Large (L, 70.24M).

Experiment and Result

Dataset

The models are trained with the LJSpeech and LibriTTS datasets, widely used for waveform reconstruction tasks.

Pre-training

PeriodWave-S/B is pre-trained with flow matching objectives for 1M steps using the AdamW optimizer. PeriodWave-L is pre-trained with the same hyperparameters on four NVIDIA A100 80GB GPUs.

Training

PeriodWave-Turbo is trained from the pre-trained PeriodWave with adversarial flow matching optimization. The training steps are significantly reduced compared to fully GAN training.

ODE Sampling

PeriodWave-Turbo models utilize the Euler method with four sampling steps, compared to the teacher models using the Midpoint method with 16 steps.

LJSpeech: High-quality Single Speaker Dataset

PeriodWave-Turbo achieves state-of-the-art performance in all objective metrics on the LJSpeech benchmark, demonstrating the efficiency of adversarial flow matching optimization.

LibriTTS: Multi-speaker Dataset with 24,000 Hz

PeriodWave-Turbo achieves unprecedented performance on the LibriTTS benchmark, with a PESQ score of 4.454. The results show the robustness of the model structure, with better performance than larger GAN-based models.

Subjective Evaluation

PeriodWave-Turbo demonstrates better performance than previous models in subjective evaluation, including large-scale GAN-based models and teacher models.

Inference Speed and Memory Usage

PeriodWave-Turbo significantly accelerates inference speed compared to teacher models, with lower VRAM usage.

MUSDB18-HQ: Multi-track Music Audio Dataset for Out-Of-Distribution Robustness

PeriodWave-Turbo shows robustness on out-of-distribution samples, performing better on most objective metrics and achieving higher similarity MOS for each instrument.

Zero-shot TTS Results on LibriTTS dataset

PeriodWave-Turbo demonstrates better speech accuracy in terms of CER and WER, with better audio quality and naturalness in two-stage TTS scenarios.

Ablation Study

The ablation study compares different reconstruction losses, adversarial feedback, and distillation methods. The results highlight the importance of multi-scale Mel-spectrogram loss and adversarial feedback for stable training and high-quality waveform generation.

Conclusion

PeriodWave-Turbo, a novel ODE-based few-step waveform generator, successfully accelerates CFM-based waveform generation using adversarial flow matching optimization. The model achieves state-of-the-art performance across objective and subjective metrics, with superior robustness in out-of-distribution and two-stage TTS scenarios. Future work will focus on further optimizing inference speed and adapting the model for end-to-end text-to-speech and text-to-audio generation tasks.

Code:

https://github.com/sh-lee-prml/periodwave

Datasets:

LJSpeechLibriTTS

Share.

Comments are closed.

Exit mobile version