Multiagent reinforcement learning using Non-Parametric Approximation

Aprendizaje por reforzamiento para sistemas multiagentes utilizando Aproximación No Paramétrica

David Luviano Cruz1*, Francesco José García-Luna2 , Luis Asunción Pérez-Domínguez3 .

This paper presents a hybrid control proposal for multi-agent systems, where the advantages of the reinforcement learning and nonparametric functions are exploited. A modified version of the Q-learning algorithm is used which will provide data training for a Kernel, this approach will provide a sub optimal set of actions to be used by the agents. The proposed algorithm is experimentally tested in a path generation task in an unknown environment for mobile robots.

Keywords:Multiagent systems, nonparametric approximator, reinforcement learning, trajectory generation


En este artículo se presenta una propuesta hibrida de algoritmo de control para sistemas multiagentes, en donde se aprovechan las ventajas del aprendizaje por reforzamiento y de las funciones de aproximación no paramétricas. Se utiliza una versión modificada del algoritmo Q-learning la cual proveerá de datos de entrenamiento para un Kernel, el cual ofrecerá una aproximación sub optima de acciones a realizar por los agentes. El algoritmo propuesto es probado experimentalmente en una tarea de generación de trayectoria en un entorno desconocido para robot móviles.

Palabras clave:Sistemas multiagentes, aproximador no paramétrico, aprendizaje por reforzamiento, generación de trayectorias.

1. Introduction

There are Research related to multiagent systems (MAS) is an emerging subfield of distributed artificial intelligence, which aims to provide principles for building complex systems through the integration of multiple agents.
There are features in MAS that distinguish it from a single-agent control system. First the agents are considered partially autonomous, this is due to the fact that the agents do not have available all the global information and therefore of the work environment, reason why they can only have access to a limited information, second in the MAS an individual agent cannot decide an optimal action only using his local knowledge [1].
Multi-agent systems have found applications in a wide variety of fields such as robot teams playing soccer, distributed control, unmanned aerial vehicles, training control, resource and traffic management, systems support, data mining, design engineering, intelligent search, medical diagnostics, product delivery, among others. The agents that make up a MAS have to deal with environments that can be static or dynamic, deterministic (an action has a single effect) or non-deterministic, discrete (there is a finite number of actions and states) or continuous [2].
For example, most existing artificial intelligence techniques for individual agents have been developed in static media as they are easier to handle and allow for rigorous mathematical treatment. In MAS, with the mere presence of multiple agents they make a static environment appear as dynamic from the point of view of other agents.
Although traditional control approaches seek to equip agents with MAS with pre-programmed or pre-designed behaviors, agents often need to learn new behaviors online so that MAS performance gradually improves. This is because the complexity of the working environment in which the agents operate and the tasks assigned to them make an a priori design of the control laws difficult or even impossible.
An agent that learns through Reinforcement Learning (RL) acquires knowledge through interaction with the dynamic environment where it performs its assigned task. In each step of time, the agent perceives the state of the environment and executes a determined action, which generates that the environment transitions to a new state. A reward signal scale evaluates the quality of each state transition so the agent must maximize the reward accumulated during the interaction with the environment, it is important to mention, the agents are not told what action to take, so they must explore the environment to find the actions that provide a greater reward in the long term [3].
One area where learning by reinforcement has been successful is trajectory planning for mobile robots, where a trajectory is generated from a starting point to an end point respecting certain restrictions imposed on movement, such as obstacles, delimitation of the trajectory area.
The problems addressed by RL are generally limited to problems with discrete states and with a finite number of actions available to agents. This is due to the so-called curse of dimensionality, which is the exponential growth of state-action pairs to learn as the number of states and actions increase in the problem, leading to an increase in computation time and the amount of memory needed to store the data associated with the algorithm [4].
Therefore, it is necessary to incorporate an additional strategy to learning by reinforcement, which offers the opportunity to generalize the results obtained.
There are two approaches to be used to approximate action-state values in RL, one of them is the parametric approximators, where the functional form of the mapping and the number of parameters are designed beforehand and do not depend on the data [5], on the other hand, in non-parametric approximators, the number of parameters and the shape of the approximator are derived as a function of the available data [6].
The article proposes a methodology that takes advantage of the RL along with a non-parametric approximator in the form of Kernel, the algorithm is integrated by two stages of learning, the first will use the Q-learning algorithm [7], where the model of the task is known, at this stage the agent will explore the task environment in order to collect information states-actions, that is to say which is the optimal action to take each one of the explored states, in the second stage, the information obtained by the algorithm of RL will be used to adjust the weights of the Kernel, which will offer us the optimal actions to take by the agent (robot) in states that were not explored in the first stage of learning.
The proposed algorithm is validated by means of simulation, where the generation of optimal trajectories for mobile robots is sought under a cooperative task scheme. Materials and methods

A generalized approach used to solve the problem of coordination in MAS is to make sure that any decision situation is solved in the same way by all the agents using some type of negotiation. In our proposal, implicit coordination is used, where agents learn to choose actions in a coordinated way through trial and error.
Q-learning algorithm
There are a large number of algorithms available for learning by reinforcement, one of the most popular methods in RL is the Q-Learning algorithm [7], which uses an iterative approach procedure. Q-Learning begins with an arbitrary functionQ, observe transitions (St, At, St+1, r t+1) and after each transition updates the function Q with:

The term in square brackets is called a temporal difference. The learning parameter α ∈(0,1] can be variant in time and usually decreases with time.
The sequence t Q converges on Q∗ under the following conditions [9]:
• Different function values Q are saved and updated for each action-state.
• The summation Σ∞ α t=0 is finite.
• Asymptotically all action-state pairs are visited infinitely.
The third point can be satisfied if agents are kept trying all actions in all states that have non-zero probability of happening. This requirement is called exploration, which can be done in several ways, one of them is choosing in each step a random action with probability. ε ∈(0,1) and choosing a greedy action with probability ε −1, this way we get a greedy exploration.

Learning by means of Non-Parametric Kernel approximator
The algorithms that use RL obtain a policy of optimal actions from the optimal values obtained during the learning process, most of these methods are based on discrete considerations of the environment and a limited number of states, actions and agents in order to avoid the problem of dimensionality.
Since most real applications have a large number of states and the Q-Learning algorithm is based on search tables, the non-parametric approximation method based on Kernel is used to approximate the unknown states that are not visited when the RL algorithm is carried out, also to make generalizations when the environment has been slightly modified, in both cases avoiding the need to recalculate the optimal policies.

A conceptually simple representation of the sequence of weights t W , is by describing the shape of the weight function by means of a density function with a scaling parameter that has the function to adjust the size and shape of the weights near the data points . t s It is common to refer to this shaping function as a kernel K[z]. The kernel is a real, continuous, dimensioned and symmetrical function which can be

In automatic control applications all random variables have a constant probability density function, with this type of random variables, the summation implied by the kernel function is equivalent to a Monte Carlo integration:

At every discreet step of time t , the states where the agents are located are observed and these are referred to a table of states-actions called Q -tabla. The Q-learning algorithm is used to obtain optimal actions, from the data of Q∗ which is obtained at the end of the convergence of the algorithm.
When the states where the agents are located are not available in the Q-table, either because they were not explored during the learning algorithm by reinforcement or by the occurrence of small changes in the environment, the actions to be performed by the agents will be approximate Kernel. The new approach will be sent to the agents in order to continue the task entrusted.
The previously trained approximator generates as output a sub-optimal action for each agent in the environment, thus avoiding to run the RL algorithm again when the agents face an unknown state.

The proposed method can be listed in the following steps:
1) The initial states of each training cycle of the RL Q-learning algorithm are captured: The current state of the agents’ state with respect to the environment is captured through sensors.
2) Limit the number of captured states: The limitation of captured states reduces the set of states agents require to complete the task which saves time and computing power.
3) Establish the actions available to the agents: At each moment the agents are required to carry out an action with a degree of coordination, therefore, it is necessary to select in advance the most reliable actions to be carried out by the agents, with the aim of keeping the space for actions minimized and avoiding dimensionality problems.
4) Estimate the Q-state-action values of each agent: The numerical reward of each action is calculated and given to the agent after a joint action is performed, the values obtained are saved in a search table called Q-table.
5) Repeat steps 2-4 until the agents reach the target: The training cycle ends if the agents reach the final objective or if an established limit of iterations is reached.
6) Repeat steps 2-5 until the Q values converge: This happens when the values remain unchanged or they are below a predetermined level.
7) Obtaining final Q-table of action-states: The final table of optimal states-actions is fine-tuned for the selection of the optimal actions by means of the location of the action that will generate the maximum value Q in every state.
8) Kernel training phase: The table of values is used Q obtained by the Q-Learning algorithm to train the kernel, each column of the table Q represents a state, which is entered as input and the optimal actions found as outputs of the system.
Once the kernel has been trained, it will provide a joint approximate action that the agents will implement when they are facing unknown states that had not been explored or learned in the previous stages of learning.

Results and discussion

In order to validate the performance of the proposed method, two Khepera IV mobile robots are used, whose objective is to generate a trajectory from an initial point to a goal. The software used was Matlab, the robots were used in slave mode with a bluetooth connection at 115200 bauds, the exchange of information between the robots Khepera and Matlab was through ASCII code. The task must be completed in a minimum time avoiding obstacles and coordinating among them, it is assumed that the agents have no prior knowledge about the position and shape of the obstacles present in the environment, the configuration of the working environment is shown (fig2)

The initial position of the agents is randomly selected and 50 learning steps are carried out, if this limit is reached, the experiment is stopped and restarted. As soon as the agents find the optimal trajectory without colliding with obstacles or other agents it will be said that the values of the Q-table have converged, so the learning stage will be stopped by reinforcement.
In order to complete the assigned cooperative task, each agent is required to choose one action from the 4 available actions

• Move forward 5 cm.
• Turn around 25° in the direction of the clock.
Matlab was through ASCII code. The task must be completed in a minimum time avoiding obstacles and coordinating among them, it is assumed that the agents have no knowledge about the position and shape of the obstacles present in the environment, configuration of the working environ • Turn around 25° in a counterclockwise direction.
• Don’t move.
The reward function ρ (x,u) is given by:
r k+1= 1 if the agent takes the target
r k+1= 1 0 if the agents takes the target to the base position
r k+1= 0 in any other situation
The reward function is designed in such a way that by assigning the numeric value 1 when the agent takes the target, the numeric reward of 10 for when the agent reaches the target prevents agents from finding suboptimal states motivated by intermediate rewards. (fig 3)

The convergence of the algorithm is shown in figure 3. In this figure it is shown that the algorithm converges after 16 trials, the duration of each trial depends on the number of steps that the agents perform before stopping the trial or reaching the learning objective. The set of data that will be used as training samples for the Kernel are taken from the optimal Q-table generated by the Q-learning algorithm, Table I. Each column of the Q-table represents a state and the output of the approximator will be the joint action that the agents must execute. (tab 1)

In order to compare the performance of the algorithm, we choose as initial position for the agents, an unknown state which is not in the Q-table, under this situation it would not be possible to provide the optimal actions to the agents, so the kernel will offer a coordinated suboptimal action for each agent, example of these situations are shown in figures 4 and 5, where the agents have initial position in states that are not in the Q-table. (fig 4)

The state vector x = [x1,x2 , x3 , x4 ]T and the control signal 1 2 u = [u ,u ] of agent 1 is shown in Figure 6, where 1 x is the position in the x, 2 x is the position on the y-axis, 3 x is the speed on the axis x, 4 x is the speed on the y-axis. 1 u is the force applied to the shaft x, 2 u is the force applied to the y-axis. (fig 6)

The state vector x= [x1,x2,x3,x4]^T,and the control signal u = [u1 ,u2 ] of agent 2 is shown in Figure 7. (fig 7)


A proposal for integration between 2 control strategies is presented. In a first stage MARL is used as a means to obtain data and information of the task and the environment in which the agents are developed, later these data are used to train a nonparametric approximator.
The experimental results confirm the reliability. and robustness of the controller for path planning when agents face unknown states, overcoming the need to re-execute the learning algorithm by reinforcement, which leads to time savings and computational power.
It should be noted that when using a non-parametric approximator, the number of weights to be tuned in the kernel increases as the size of the data available for training increases, so a balance should be sought between simplicity and accuracy of the approximator.
In addition, it is necessary to emphasize that the proposed method uses a model of discrete states of the system, this is possible due to the quantization of the states in the environment. The minimum number of data captured by the algorithm should be sufficient to describe the dynamics of the system and together with the design of an appropriate reward function ensure that there is a local maximum in the return function. The choice of data captured will depend on prior knowledge of the problem to be solved. The natural direction in the study of this topic would be to look for techniques where multi-agent systems with continuous states could be dealt with.


The authors extend their thanks to the Mexican Ministry of Public Education for funding this work through Research Agreement 511-6 / 17-7605.


[1] P. Stone, M. Veloso, “Multiagent systems: A survey from machine learning perspective”, Autonomous Robots, vol.8, no.3, pp. 345-383, 2000.
[2] M. Wooldridge, An Introduction to Multi Agent Systems, Baffins Lane, Chichester, England: John Wiley & Sons. 1992.
[3] L. Busoniu, R. Babuska and B. De Schuttert, “Multi-agent Reinforcement Learning: An Overview”, Delf Center for System and Control, Delf University of Technology, pp. 183-221, 2010.
[4] J.M. Vidal, “Learning in multiagent systems: An introduction from a game-theoretic perspective”, In: Alonso E., Kudenko D., Kazakov D. (eds) Adaptive Agents and Multi-Agent Systems. Lecture Notes in Computer Science, vol. 2636. Springer, Berlin, Heidelberg, pp. 202-215, 2003.
[5] R. Postoyan, L. Busoniu, D. Nesic and J. Daafouz, “Stability Analysis of Discrete- Time Infinite-Horizon Optimal Control with Discounted Cost”. IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 2736– 2749, 2017
[6] B. Kiumarsi, K.G. Vamvoudakis, M. Hamidreza and F.L. Lewis. “Optimal and Autonomous Control Using Reinforcement Learning: A Survey”, IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2042- 2062, 2018.
[7] C. Watkins, P. Dayan, “Q Learning: Technical Note”, Machine Learning, vol.8, pp. 279-292, 1992.
[8] C. Boutilier, “Planning Learning and Coordination in Multiagent Decision Processes”, In Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge (TARK96), 1996, pp. 195-202, 1996.
[9] Y. Ishiwaka, T. Sato and Y. Kakazu, “An approach to the pursuit problem on a heterogeneous multiagent system using reinforcement learning”, Robotics and Autonomous Systems, vol. 43, no. 4, pp.245-256, 2003.
[10] A. Nadaraya, “On Estimating Regression”, Theory of Probability and its Applications, vol. 9, no.1, pp. 141-142, 1964.

1* Doctor en Ciencias, Orcid: 0000-0002-4778-8873, Universidad Autónoma de Ciudad Juárez, Juárez, México,
2 Doctor en Ciencias, Orcid: 0000-0002-8571-914X, Universidad Autónoma de Ciudad Juárez, Juárez, México,
3 Doctor en Ciencias en Ingeniería, Orcid: 0000-0003-2541-4595, Universidad Autónoma de Ciudad Juárez, Juárez, México,

Received on January 21, 2018 - Approved on June 12, 2018.
How to cite :D.L. Cruz, F.J. García-Luna, L.A. Pérez-Domínguez, “Multiagent reinforcement learning using Non-Parametric Approximation”, Respuestas, vol. 23, no. 2, pp. 53-61, 2019

Received on January 10, 2018 - Approved on May 29, 2018.