Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Authors:

Kai Qiu、Xiang Li、Hao Chen、Jie Sun、Jinglu Wang、Zhe Lin、Marios Savvides、Bhiksha Raj

Paper:

Introduction

Background

Autoregressive (AR) modeling has been a cornerstone in the field of generative modeling, particularly in text and image generation. However, its application in audio generation has been limited due to the inherent challenges posed by the long sequence lengths and continuity of audio data. Traditional AR models predict tokens sequentially, which can be computationally expensive and time-consuming, especially for audio data with high sampling rates.

Problem Statement

The primary challenge in AR-based audio generation is the efficiency of token prediction. Given the significant sequence length of audio, traditional next-token prediction methods are not feasible for real-time applications. This study aims to address this issue by proposing a novel Scale-level Audio Tokenizer (SAT) and a scale-level Acoustic AutoRegressive (AAR) modeling framework. These innovations aim to reduce the token length and the number of autoregressive steps, thereby improving the efficiency and quality of audio generation.

Related Work

Raw Audio Discretization

Before the advent of Variational Autoencoders (VAEs), converting continuous audio signals into discrete representations was a significant challenge. VAEs facilitated this process by using encoder-decoder networks to quantize inputs into structured priors. Recent advancements like VQGAN and RQGAN have further improved model generalization and inspired numerous works in audio discretization. Notable contributions include Encodec, which uses an encoder-decoder model with residual quantization, and HIFI-codec, which employs group residual quantization.

Diffusion-Based Audio Generation

Diffusion models have shown promise in generating high-quality audio by progressively transforming noise into coherent signals. These models have been widely adopted in various audio applications, including speech synthesis and music generation. However, the iterative nature of the diffusion process presents challenges such as high computational costs and significant inference times.

Autoregressive Modeling

Autoregressive models have excelled in text generation and machine translation by leveraging efficient Large Language Models (LLMs). These models generate the next tokens sequentially to construct the output. However, their application in raw audio generation remains challenging due to the large number of tokens required to represent audio data. This study aims to mitigate these limitations by employing a Scale-level Audio Tokenizer to encode raw audio at different scales and generate it using Acoustic AutoRegressive modeling via next-scale prediction.

Research Methodology

Scale-level Audio Tokenizer (SAT)

The proposed SAT aims to shorten the audio token length by improving traditional residual quantization with a multi-scale design. This approach compresses the token length according to the scale index, thereby reducing the number of tokens required to represent the audio.

Acoustic AutoRegressive (AAR) Modeling

Based on the multi-scale audio tokenizer, the AAR framework shifts the next-token AR prediction to next-scale AR prediction. This approach models the audio tokens with a next-scale paradigm, significantly reducing the number of autoregressive steps during inference. By reducing both the token length and the autoregressive step number, the proposed method achieves superior audio quality and faster inference speed.

Experimental Design

Evaluation Metrics

The performance of the proposed approach is evaluated using several metrics, including Fréchet Audio Distance (FAD), MEL distance, and STFT distance for reconstruction, and FAD, Inception Score (ISc), and KL divergence for generation. These metrics provide a comprehensive assessment of the quality and efficiency of the generated audio.

Dataset

All experiments are conducted on the AudioSet dataset. The evaluation set is divided into segments matching the window size of the model for reconstruction. For autoregressive generation, a random segment from the evaluation set is used as the ground truth.

Implementation Details

Tokenizer

In the first stage, the multi-scale residual quantization (MSRQ) is employed with a codebook size of 1024. The model is trained for 100 epochs using the Adam optimizer with specific learning rate settings and loss weights.

Transformer

In the second stage, a GPT-2-style transformer with adaptive normalization is used for scale-level acoustic autoregressive modeling. CLAP audio embeddings are used as start tokens to capture richer context. The model is trained using the AdamW optimizer with specific learning rate settings and weight decay.

Results and Analysis

Main Results

The proposed SAT tokenizer outperforms the baseline Encodec by 0.3 FAD, despite using fewer tokens. This demonstrates that increasing quantization while reducing the number of tokens can efficiently improve reconstruction quality. The AAR framework shows superior performance in terms of both latency and audio quality, achieving a 35x speed improvement and better FAD scores compared to the next-token prediction method.

Ablation Studies

Effect of Scale Setting

Different scale settings were tested to find the optimal combination for SAT configuration. Quadratic scheduling proved to be more efficient, requiring fewer tokens and achieving comparable reconstruction performance.

Effect of Discriminator

Multiple discriminator configurations were explored to optimize the performance of SAT. The results indicate that using only a multi-scale STFT discriminator is sufficient for effective reconstruction.

Effect of Temporal Windows

The performance of SAT was validated across different temporal windows. The results suggest that SAT performs well across diverse time windows, maintaining consistent quality and demonstrating robustness in handling varying temporal scales.

Effect of Upsampling Function

The effectiveness of the 1D convolutional layer after upsampling was evaluated through different configurations. The partially shared architecture significantly improved generation quality.

Effect of AAR and Sampling Technique

The AAR framework, combined with attention normalization, classifier-free guidance, and advanced sampling techniques, showed improved generation abilities and significantly reduced inference time.

Classifier-Free Guidance

The relationship between Inception Score (ISc) and Fréchet Audio Distance (FAD) across different Classifier-Free Guidance (CFG) scales was evaluated. The results indicate that as the CFG scale increases, the ISc improves, while both FAD and KL metrics converge and stabilize at a specific CFG scale.

Overall Conclusion

This study introduces a novel approach for efficient autoregressive audio modeling via next-scale prediction. The proposed Scale-level Audio Tokenizer (SAT) and Acoustic AutoRegressive (AAR) modeling framework significantly reduce the token length and the number of autoregressive steps, thereby improving the efficiency and quality of audio generation. Comprehensive experiments demonstrate the superior performance of the proposed method compared to traditional autoregressive methods. This approach provides an efficient solution for audio generation, leveraging multi-scale residual quantization to enhance efficiency and reduce computational demands.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Authors:

Paper:

Introduction

Background

Problem Statement

Related Work

Raw Audio Discretization

Diffusion-Based Audio Generation

Autoregressive Modeling

Research Methodology

Scale-level Audio Tokenizer (SAT)

Acoustic AutoRegressive (AAR) Modeling

Experimental Design

Evaluation Metrics

Dataset

Implementation Details

Tokenizer

Transformer

Results and Analysis

Main Results

Ablation Studies

Effect of Scale Setting

Effect of Discriminator

Effect of Temporal Windows

Effect of Upsampling Function

Effect of AAR and Sampling Technique

Classifier-Free Guidance

Overall Conclusion

Related Posts