Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

Authors:

Hong Xie、Jinyu Mo、Defu Lian、Jie Wang、Enhong Chen

Paper:

Introduction

In the realm of distributed selection problems, the multi-player multi-armed bandit (MAB) model has been a cornerstone for addressing issues such as channel access in cognitive radio applications. However, traditional models often fall short when applied to real-world scenarios like ridesharing or food delivery services, where requests arrive stochastically and independently of the number of players. This paper introduces a novel variant of the multi-player MAB model that captures these stochastic arrivals and proposes efficient algorithms to optimize the allocation of requests to players.

Related Work

Stochastic Multi-player MAB with Collision

The literature on multi-player MAB models primarily focuses on scenarios where players know if they encounter a collision. Various algorithms have been proposed to minimize regret in these settings:

Time-division fair sharing algorithm by Liu et al. [11] achieves logarithmic total regret.
Logarithmic total regret algorithms by Anandkumar et al. [1] and Rosenski et al. [16].
Communication-free algorithms by Besson et al. [4] and Lugosi et al. [12] that use collision information as a means of communication.

Stochastic Multi-player MAB without Collision

In settings where players receive independent rewards upon collision, algorithms like DD-UCB by Martínez-Rubio et al. [14] and MP-UCB by Dubey et al. [8] have been proposed. These algorithms utilize communication graphs to share reward information among players.

Summary of Differences

Unlike previous works, this paper formulates a new variant of the multi-player MAB model that captures the stochastic arrival of requests and the allocation policy. The proposed algorithms introduce novel ideas for searching and committing to the optimal arm pulling profile and achieving consensus among players with different estimates.

Research Methodology

Platform Model

The platform consists of requests, players, and a platform operator. The arrival of requests is modeled by a finite set of arms, each characterized by a pair of random vectors representing the stochastic request and reward distributions. Players can pull only one arm per time slot, and the platform operator ensures that each player serves at most one request.

Problem Formulation

The objective is to maximize the total reward across all players over a given time horizon. The optimal arm pulling profile is defined as the one that maximizes the total reward. The challenge lies in searching for this optimal profile and ensuring that players commit to it without communication.

Experimental Design

Searching the Optimal Arm Pulling Profile

A greedy algorithm (Algorithm 1) is proposed to locate one of the optimal arm pulling profiles with a computational complexity of O(KM). The algorithm sequentially adds players to arms based on the largest marginal reward gain.

markdown Algorithm 1 OptArmPulProfile (µ, e P ) 1: ∆m ←µm, ∀m ∈M, ngreedy,m ←0, ∀m ∈M 2: for k = 1, . . . , K do 3: i = arg maxm∈M ∆m (if there is a tie, breaking it by an arbitrary tie breaking rule) 4: ngreedy,i ←ngreedy,i + 1, ∆i ←µiPi,ngreedy,i 5: end for 6: return ngreedy = [ngreedy,m : ∀m ∈M]

Committing to Optimal Arm Pulling Profile

An iterative distributed algorithm (Algorithm 2) is designed for players to commit to the optimal arm pulling profile. The algorithm ensures that players commit to the same profile in a constant number of rounds in expectation.

markdown Algorithm 2 CommitOptArmPulProfile (k, n∗, Tstart) 1: ck←0, ∀k∈K, n− t,m←n∗ m, ∀m∈M, n+ t,m←0, ∀m∈M 2: for t = Tstart, . . . , T do 3: if ck ̸= 0 then 4: at+1,k ←at,k 5: else if n∗ at,k ≥nt,at,k then 6: ck ←at,k, at+1,k ←at,k 7: else 8: for m = 1, . . . , M do 9: if n∗ m ≥nt,m then 10: n+ t,m ←nt,m, n− t,m ←n∗ m −n+ t,m 11: end if 12: end for 13: N − t = P m∈M n− t,m. at+1,k←m, w.p. n− t,m/N − t 14: end if 15: player k pulls arm at+1,k, player k receive nt+1 16: end for

Online Learning Algorithm

The online setting is addressed using the explore-then-commit (ETC) framework. During the exploration phase, players estimate the request and reward distributions. An iterative distributed algorithm (Algorithm 4) ensures that players reach a consensus on the optimal arm pulling profile.

markdown Algorithm 4 Consensus(k, n∗(k)) 1: vboard,k ← 0, vboard,k ← ∅, Num←0 2: for t = T0 + 1, . . . , T0 + M do 3: player k pulls arm (n∗ t−T0 (k) mod M + 1) 4: if max{m|nt,m > 0} − min{m|nt,m > 0} == 1 then 5: Update borderline elem. vboard,k←(vboard,k, t − T0) 6: vboard,k ← vboard,k ∪ {(t − T0, n∗ t−T0 (k) mod M+1=min{m|nt,m>0})} 7: else if max{m|nt,m > 0} − min{m|nt,m > 0} > 1 then 8: vboard,k ← (vboard,k, t − T0), ∀k ∈ K 9: vboard,k ← vboard,k ∪ {(t − T0, n∗ (k) mod M+1=max{m|nt,m>0})}, ∀k ∈ K 10: end if 11: end for 12: for m = 1, . . . , M do 13: if {(m, n∗ m(k))} ∈ vboard,n then 14: n∗ m(k) ← n∗ m(k) − 1, Num←Num +1 15: end if 16: end for 17: vboard,n ← vboard,n sorted in descending order 18: n∗ vboard,k(i) ← n∗ vboard,k(i) + 1, ∀i = 1, . . . , Num 19: return n∗(k) = [n∗ m(k) : ∀m ∈ M]

Results and Analysis

Impact of Exploration Length

The length of the exploration phase significantly impacts the regret and total reward. As shown in Figure 1, the regret initially increases sharply during the exploration phase and then stabilizes during the committing phase. The algorithm achieves higher rewards compared to baseline methods.

Impact of Number of Arms

The algorithm’s performance remains robust across different numbers of arms. As shown in Figure 2, the regret curves exhibit similar behavior, validating the algorithm’s convergence properties.

Impact of Number of Players

The algorithm also performs well with varying numbers of players. Figure 3 demonstrates that the regret curves stabilize after the exploration phase, and the algorithm consistently outperforms baseline methods in terms of total reward.

Impact of Standard Deviation

The algorithm’s performance is consistent across different levels of reward variability. Figure 4 shows that the regret curves stabilize after the exploration phase, and the algorithm achieves higher rewards compared to baseline methods.

Overall Conclusion

This paper introduces a new variant of the multi-player MAB model to address distributed selection problems with stochastic request arrivals. The proposed greedy algorithm efficiently locates one of the optimal arm pulling profiles, and the iterative distributed algorithm ensures that players commit to this profile with a constant number of rounds in expectation. The exploration strategy allows players to estimate the optimal profile accurately, and the consensus algorithm guarantees agreement among players. Experimental results validate the efficiency and robustness of the proposed algorithms across various settings.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

Authors:

Paper:

Introduction

Related Work

Stochastic Multi-player MAB with Collision

Stochastic Multi-player MAB without Collision

Summary of Differences

Research Methodology

Platform Model

Problem Formulation

Experimental Design

Searching the Optimal Arm Pulling Profile

Committing to Optimal Arm Pulling Profile

Online Learning Algorithm

Results and Analysis

Impact of Exploration Length

Impact of Number of Arms

Impact of Number of Players

Impact of Standard Deviation

Overall Conclusion

Related Posts