Linear Attention is Enough in Spatial-Temporal Forecasting

Authors:

Paper:

Introduction

Background

Spatial-temporal forecasting, particularly in traffic prediction, has garnered significant attention due to its complex correlations in both space and time dimensions. Traditional methods often treat road networks as spatial-temporal graphs, addressing spatial and temporal representations independently. However, these approaches face challenges in capturing the dynamic topology of road networks, dealing with message passing mechanisms, and learning spatial and temporal relationships separately.

Problem Statement

Existing methods struggle with:
1. Capturing the dynamic topology of road networks.
2. Overcoming the over-smoothing problem in Graph Neural Networks (GNNs).
3. Learning spatial and temporal relationships separately, which requires more layers and increases computational complexity.

To address these limitations, this study proposes treating nodes in road networks at different time steps as independent spatial-temporal tokens and feeding them into a vanilla Transformer to learn complex spatial-temporal patterns. The proposed models, STformer and NSTformer, aim to achieve state-of-the-art performance with reduced computational complexity.

Related Work

Transformer in Traffic Forecasting

Transformers have become the de-facto standard in various applications, including traffic forecasting. Previous works have used attention-based models to separately learn spatial and temporal representations. However, these models still operate within the Spatial-Temporal Graph framework, which has inherent drawbacks.

Efficient Transformer

The self-attention mechanism in Transformers has quadratic complexity, which limits its performance with large inputs. Various methods, such as Nyströmformer, have been proposed to approximate self-attention with linear complexity, making it more feasible for large-scale applications.

Research Methodology

Problem Setting

The traffic forecasting task is defined as learning a mapping from past traffic flow to future traffic flow using neural networks. The road network is represented as a tensor, and the traffic flow over a period forms a spatial-temporal tensor.

Model Architecture

STformer

STformer treats each sensor at different time steps as an independent spatial-temporal token (ST-Token). These tokens are fed into a vanilla Transformer to capture spatial-temporal dependencies dynamically and efficiently. The model achieves state-of-the-art performance but has quadratic complexity.

NSTformer

To overcome the computational complexity of STformer, NSTformer uses the Nyström method to approximate self-attention with linear complexity. This model surprisingly performs slightly better than STformer in some cases, suggesting potential additional benefits of approximate attention.

Experimental Design

Datasets

The study uses two commonly used traffic forecasting datasets: METR-LA and PEMS-BAY. These datasets provide extensive traffic data, allowing for robust evaluation of the proposed models.

Experimental Setup

The datasets are partitioned into training, validation, and test sets. The models are configured with specific embedding dimensions, Transformer layers, and attention heads. The experiments are conducted on a server with NVIDIA A100 GPU, using Python, Pytorch, and BasicTS platform.

Baselines

The study compares the proposed models with various traditional and deep learning methods, including HA, HI, SVR, VAR, FC-LSTM, DCRNN, AGCRN, GTS, GWNet, MTGNN, STGCN, STSGCN, AST-GCN, GMAN, STID, STNorm, PDFormer, STAEformer, and ST-Mamba.

Results and Analysis

Performance Evaluation

The results show that NSTformer and STformer achieve the best and second-best performance, respectively, on almost all metrics. The slight performance advantage of NSTformer over STformer raises an open question about the potential regularization effect of approximate attention.

Model Complexity

The complexity analysis reveals that NSTformer has linear complexity, making it more computationally efficient than STformer and other models. This efficiency is achieved without compromising performance, highlighting the effectiveness of the proposed approach.

Overall Conclusion

The study demonstrates that pure self-attention is sufficient for spatial-temporal forecasting, achieving state-of-the-art performance with STformer. By adapting the Nyström method, NSTformer further reduces computational complexity while maintaining or even slightly improving performance. This work offers new insights into spatial-temporal forecasting and opens up avenues for future research in applying linear attention mechanisms to other spatial-temporal tasks.

Acknowledgments

The authors express gratitude to Le Zhang, Zezhi Shao, Zheng Dong, and Tsz Chiu Kwok for their valuable insights and assistance in the experiments.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Linear Attention is Enough in Spatial-Temporal Forecasting

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

Linear Attention is Enough in Spatial-Temporal Forecasting

Authors:

Paper:

Introduction

Background

Problem Statement

Related Work

Transformer in Traffic Forecasting

Efficient Transformer

Research Methodology

Problem Setting

Model Architecture

STformer

NSTformer

Experimental Design

Datasets

Experimental Setup

Baselines

Results and Analysis

Performance Evaluation

Model Complexity

Overall Conclusion

Acknowledgments

Related Posts