PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications

Authors:

Paper:

Introduction

Vision Transformers (ViTs) have emerged as a powerful alternative to Convolutional Neural Networks (CNNs) for various image recognition tasks. They offer significant improvements in computational efficiency and accuracy. However, ViTs are complex and memory-intensive, making them unsuitable for resource-constrained mobile and edge systems. To address this challenge, the paper introduces PQV-Mobile, a combined pruning and quantization toolkit designed to optimize ViTs for mobile applications.

PQV-Mobile Tool

The PQV-Mobile tool supports various structured pruning strategies and quantization methods to optimize ViTs for mobile deployment. The tool’s workflow is illustrated in Figure 1.

Pruning Method

PQV-Mobile supports several structured pruning strategies, including:

Magnitude Importance Based Grouping: Applies L1 or L2 norm regularization to penalize non-zero parameters, dropping connections below a certain threshold.
Taylor Importance Based Grouping: Uses a Taylor expansion to approximate the importance of parameters, allowing for faster computation.
Hessian Importance Based Grouping: Utilizes a second-order metric to identify insensitive parameters, pruning those with relatively small sensitivity.

In structured pruning, a ‘Group’ is defined as the minimal unit that can be removed. PQV-Mobile uses a dependency graph to model these dependencies and find the right groupings for parameter pruning.

Quantization Method

PQV-Mobile supports post-training quantization of both weights and activations from FP32 to FP16 and int8. The quantization process involves:

Quantizing Models for Specific Backends: Creating a quantization engine based on backends like x86, FBGEMM, QNNPACK, and ONEDNN.
Converting Pytorch Models to Torchscript Format: Exporting models to production environments through Torchscript.
Optimizing Models for Mobile Applications: Using Pytorch’s mobile optimizer and Lite Interpreter to create deployable and lightweight models.

Experimental Results

The effectiveness of PQV-Mobile is demonstrated through experiments on Facebook’s Data Efficient Image Transformers (DeiT). The experiments evaluate latency-memory-accuracy trade-offs for different pruning and quantization levels.

Latency and Memory Results

Figure 2 shows the latency and memory results for the DeiT base patch16 model with varying pruning and quantization levels. Quantizing the original dense model to int8 leads to a 6.47× reduction in latency. Pruning the quantized model by 9.375% further reduces latency by 9.8%, resulting in an overall 7.14× latency reduction.

Accuracy Results

Figure 4 illustrates the accuracy results for the DeiT base patch16 model with varying pruning and quantization levels. Pruning the model by 9.375% results in a 1.25% accuracy loss, while further quantizing it to int8 leads to an additional 0.99% loss in accuracy.

Structured Pruning Groupings

Figure 5 compares the accuracy results for different structured pruning groupings. Taylor pruning outperforms L1-norm and Hessian-based pruning, making it the preferred method for all experiments.

Comparison of DeiT Models

Figure 6 compares the latency and accuracy of pruning and quantizing the DeiT base patch16 model with the DeiT3 medium patch16 model. The latter shows latency improvements by 18.65% and 13.55% over the former when pruned to the same levels.

Hardware Backend Evaluation

Figure 7 evaluates the latency for different int8 quantization hardware backends using the 9.375% pruned DeiT3 medium patch16 model. The x86 and FBGEMM backends perform the best, with FBGEMM slightly outperforming x86.

Conclusion and Future Work

PQV-Mobile is a powerful tool for optimizing ViTs for mobile applications through combined pruning and quantization. It supports various structured pruning strategies and quantization methods, demonstrating significant improvements in latency and memory with minimal accuracy loss. Future work includes extending PQV-Mobile to int4 quantization and targeting large language models.

Acknowledgements

This work was performed under the auspices of the U.S. Department of Energy by LLNL under contract DE-AC52-07NA27344 (LLNL-CONF-865054).

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications

Authors:

Paper:

Introduction

PQV-Mobile Tool

Pruning Method

Quantization Method

Experimental Results

Latency and Memory Results

Accuracy Results

Structured Pruning Groupings

Comparison of DeiT Models

Hardware Backend Evaluation

Conclusion and Future Work

Acknowledgements

Related Posts