Multi-agent actor-critic with time dynamical opponent model

In multi-agent reinforcement learning, multiple agents learn simultaneously while interacting with a common environment and each other. Since the agents adapt their policies during learning, not only the behavior of a single agent becomes non-stationary, but also the environment as perceived by the agent. This renders it particularly challenging to perform policy improvement. In this paper, we propose to exploit the fact that the agents seek to improve their expected cumulative reward and introduce a novel Time Dynamical Opponent Model (TDOM) to encode the knowledge that the opponent policies tend to improve over time. We motivate TDOM theoretically by deriving a lower bound of the log objective of an individual agent and further propose Multi-Agent Actor-Critic with Time Dynamical Opponent Model (TDOM-AC). We evaluate the proposed TDOM-AC on a differential game and the Multi-agent Particle Environment. We show empirically that TDOM achieves superior opponent behavior prediction during test time. The proposed TDOM-AC methodology outperforms state-of-the-art Actor-Critic methods on the performed tasks in cooperative and especially in mixed cooperative-competitive environments. TDOM-AC results in a more stable training and a faster convergence. Our code is available at https://github.com/Yuantian013/TDOM-AC.

10.1016/j.neucom.2022.10.045

0925-2312

165-172

Tian, Yuan

c66ed5b1-2e87-4c26-8bd8-5dc1314cc268

Kladny, Klaus-Rudolf

9c62dd91-9a32-4bcd-b7bc-24fbb7e5d2fe

Wang, Qin

b018eb23-13bc-4226-a1ed-7b4951cca7af

Huang, Zhiwu

84f477cd-9097-44dd-a33e-ff71f253d36b

Fink, Olga

1902ad46-555e-498e-8117-2bcb12b4958a

14 January 2023