Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning

Authors:

Bei Ouyang、Shengyuan Ye、Liekang Zeng、Tianyi Qian、Jingyi Li、Xu Chen

Paper:

Introduction

Background

Large Language Models (LLMs) have revolutionized machine intelligence, enabling a wide range of applications, especially at the network edge. Intelligent personal assistants (IPAs) are a prime example, providing users with high-performance, privacy-preserving intelligent services. However, the fine-tuning of these models on edge devices presents significant challenges due to their computational and memory-intensive nature.

Problem Statement

While parameter-efficient fine-tuning (PEFT) techniques like Adapters and LoRA have been developed to mitigate resource constraints, they are not sufficiently resource-efficient for edge devices. Additionally, resource management optimization techniques are bottlenecked by the resource limitations of individual devices. To address these challenges, the paper proposes Pluto and Charon (PAC), a collaborative edge AI framework designed to efficiently fine-tune personal LLMs by leveraging multiple edge devices.

Related Work

Parameter-Efficient Fine-Tuning

PEFT techniques such as Adapters and LoRA aim to reduce the number of trainable parameters, thereby lowering the computational and memory requirements. However, these techniques still require full backward passes through the LLM backbone, limiting their efficiency on resource-constrained edge devices.

On-Device Fine-Tuning

On-device fine-tuning preserves user data privacy by leveraging idle resources at the edge. However, the computational and memory limitations of edge devices pose significant challenges. Techniques like CPU-DSP co-execution and memory budget adaptation have been explored, but they are still constrained by the resource limitations of single devices.

Collaborative Edge Computing

Collaborative edge computing leverages multiple edge devices to overcome individual resource limitations. Techniques like federated learning and pipeline parallelism have been explored, but they primarily focus on LLM inference rather than fine-tuning.

Research Methodology

Algorithmic Design

PAC employs a sophisticated algorithm-system co-design to break the resource wall of personal LLMs fine-tuning. The key components are:

Parallel Adapters: These adapters provide a dedicated gradient “highway” for trainable parameters, eliminating the need for full backward passes through the LLM backbone.
Activation Cache: This mechanism caches intermediate activations, allowing for their reuse across multiple epochs, thereby reducing the need for repeated forward passes.

System Design

PAC leverages edge devices in close proximity, pooling them as a collective resource for in-situ personal LLMs fine-tuning. The framework employs a hybrid data and pipeline parallelism approach to orchestrate distributed training, further enhancing resource efficiency.

Experimental Design

Prototype Implementation

PAC was implemented in a realistic testbed with a cluster of edge devices. The framework was evaluated using three LLMs (T5-Base, BART-Large, and T5-Large) and four tasks from the GLUE benchmark (SST-2, MRPC, STS-B, and QNLI).

Baseline Methods

The performance of PAC was compared against several baseline methods, including standalone fine-tuning on a single device, Eco-FL (pipeline parallelism), and EDDL (data parallelism). These baseline methods were enhanced with prevalent PEFT techniques like Adapters and LoRA for a fair comparison.

Results and Analysis

End-to-End Performance

PAC significantly outperformed the baseline methods, achieving up to 8.64× end-to-end speedup and up to 88.16% reduction in memory footprint. The framework demonstrated comparable or superior performance to full model fine-tuning and PEFT techniques across various models and datasets.

Time and Memory Efficiency

Parallel Adapters markedly reduced the average sample training time and memory usage compared to baseline methods. The activation cache mechanism further decreased the training time by up to 96.39% and reduced the peak memory footprint by up to 88.16%.

Scalability

PAC’s hybrid parallelism provided enhanced scalability and robustness across varying numbers of devices and workloads. The framework’s planning algorithm identified efficient parallel configurations, maximizing throughput within the constraints of available resources.

Overall Conclusion

PAC presents a time and memory-efficient collaborative edge AI framework for personal LLMs fine-tuning. By leveraging a sophisticated algorithm-system co-design, PAC breaks the resource wall of personal LLMs fine-tuning, achieving significant improvements in training speed and memory efficiency. The framework’s extensive evaluation demonstrates its potential to enable efficient and privacy-preserving fine-tuning of LLMs on resource-constrained edge devices.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning

Authors:

Paper:

Introduction

Background

Problem Statement

Related Work

Parameter-Efficient Fine-Tuning

On-Device Fine-Tuning

Collaborative Edge Computing

Research Methodology

Algorithmic Design

System Design

Experimental Design

Prototype Implementation

Baseline Methods

Results and Analysis

End-to-End Performance

Time and Memory Efficiency

Scalability

Overall Conclusion

Related Posts