Multiagent reinforcement learning using Non-Parametric Approximation

This paper presents a hybrid control proposal for multi-agent systems, where the advantages of the reinforcement learning and nonparametric functions are exploited. A modified version of the Q-learning algorithm is used which will provide data training for a Kernel, this approach will provide a sub optimal set of actions to be used by the agents. The proposed algorithm is experimentally tested in a path generation task in an unknown environment for mobile robots.


Introduction
There are Research related to multiagent systems (MAS) is an emerging subfield of distributed artificial intelligence, which aims to provide principles for building complex systems through the integration of multiple agents.
There are features in MAS that distinguish it from a single-agent control system. First the agents are considered partially autonomous, this is due to the fact that the agents do not have available all the global information and therefore of the work environment, reason why they can only have access to a limited information, second in the MAS an individual agent cannot decide an optimal action only using his local knowledge [1]. 54 where a trajectory is generated from a starting point to an end point respecting certain restrictions imposed on movement, such as obstacles, delimitation of the trajectory area.
The problems addressed by RL are generally limited to problems with discrete states and with a finite number of actions available to agents. This is due to the so-called curse of dimensionality, which is the exponential growth of state-action pairs to learn as the number of states and actions increase in the problem, leading to an increase in computation time and the amount of memory needed to store the data associated with the algorithm [4].
Therefore, it is necessary to incorporate an additional strategy to learning by reinforcement, which offers the opportunity to generalize the results obtained. There are two approaches to be used to approximate action-state values in RL, one of them is the parametric approximators, where the functional form of the mapping and the number of parameters are designed beforehand and do not depend on the data [5], on the other hand, in non-parametric approximators, the number of parameters and the shape of the approximator are derived as a function of the available data.
The article proposes a methodology that takes advantage of the RL along with a non-parametric approximator in the form of Kernel, the algorithm is integrated by two stages of learning, the first will use the Q-learning algorithm, where the model of the task is known, at this stage the agent will explore the task environment in order to collect information statesactions, that is to say which is the optimal action to take each one of the explored states, in the second stage, the information obtained by the algorithm of RL will be used to adjust the weights of the Kernel, which will offer us the optimal actions to take by the agent (robot) in states that were not explored in the first stage of learning.
The proposed algorithm is validated by means of simulation, where the generation of optimal trajectories for mobile robots is sought under a cooperative task scheme.
Respuestas, 23 (2) July -December 2018, pp. 53-61, ISSN 0122-820X David Luviano Cruz, Francesco José García-Luna, Luis Asunción Pérez-Domínguez Multi-agent systems have found applications in a wide variety of fields such as robot teams playing soccer, distributed control, unmanned aerial vehicles, training control, resource and traffic management, systems support, data mining, design engineering, intelligent search, medical diagnostics, product delivery, among others. The agents that make up a MAS have to deal with environments that can be static or dynamic, deterministic (an action has a single effect) or non-deterministic, discrete (there is a finite number of actions and states) or continuous [2].
For example, most existing artificial intelligence techniques for individual agents have been developed in static media as they are easier to handle and allow for rigorous mathematical treatment. In MAS, with the mere presence of multiple agents they make a static environment appear as dynamic from the point of view of other agents.
Although traditional control approaches seek to equip agents with MAS with pre-programmed or pre-designed behaviors, agents often need to learn new behaviors online so that MAS performance gradually improves. This is because the complexity of the working environment in which the agents operate and the tasks assigned to them make an a priori design of the control laws difficult or even impossible.
An agent that learns through Reinforcement Learning (RL) acquires knowledge through interaction with the dynamic environment where it performs its assigned task. In each step of time, the agent perceives the state of the environment and executes a determined action, which generates that the environment transitions to a new state. A reward signal scale evaluates the quality of each state transition so the agent must maximize the reward accumulated during the interaction with the environment, it is important to mention, the agents are not told what action to take, so they must explore the environment to find the actions that provide a greater reward in the long term [3]. One area where learning by reinforcement has been successful is trajectory planning for mobile robots, 55

Reinforcement learning for multi-agent Systems
The generalization of Markov's decision-making process in learning by reinforcement in multiagent systems (MARL) is the so-called stochastic game [7]. The stochastic game is defined as a tuple The function Q of each agent depends on joint action and joint action policy: In entirely cooperative stochastic games, the reward functions are the same for all agents n ρ ρ ρ = = = ... 2 1 which implies that the returns R are also the same .. 2 1 Therefore, all agents have the same objective which is to maximize the common return.
The optimal function Q is defined as * Q

Which satisfies Bellman's optimization equation.:
Once * Q is available, an optimal stock policy can be calculated by choosing in each state a stock with the highest value Q optimal: A generalized approach used to solve the problem of coordination in MAS is to make sure that any decision situation is solved in the same way by all the agents using some type of negotiation. In our proposal, implicit coordination is used, where agents learn to choose actions in a coordinated way through trial and error.

Q-learning algorithm
There are a large number of algorithms available for learning by reinforcement, one of the most popular methods in RL is the Q-Learning algorithm, which uses an iterative approach procedure [8]. Q-Learning begins with an arbitrary function Q , observe transitions ( ) Respuestas, 23 (2) July -December 2018, pp. 53-61, ISSN 0122-820X Multiagent reinforcement learning using Non-Parametric Approximation The term in square brackets is called a temporal difference. The learning parameter ( ] 1 , 0 ∈ α can be variant in time and usually decreases with time.
The sequence t Q converges on * Q under the following conditions [8]: • Different function values Q are saved and updated for each action-state.
• Asymptotically all action-state pairs are visited infinitely.
The third point can be satisfied if agents are kept trying all actions in all states that have non-zero probability of happening. This requirement is called exploration, which can be done in several ways, one of them is choosing in each step a random action with probability.
( ) 1 , 0 ∈ ε and choosing a greedy action with probability , 1 − ε this way we get a greedy exploration.

Learning by means of Non-Parametric Kernel approximator
The algorithms that use RL obtain a policy of optimal actions from the optimal values obtained during the learning process, most of these methods are based on discrete considerations of the environment and a limited number of states, actions and agents in order to avoid the problem of dimensionality.
Since most real applications have a large number of states and the Q-Learning algorithm is based on search tables, the non-parametric approximation method based on Kernel is used to approximate the unknown states that are not visited when the RL algorithm is carried out, also to make generalizations when the environment has been slightly modified, in both cases avoiding the need to recalculate the optimal policies.
is an unknown soft response curve and ε is the mistake. The goal is to find an estimate of Kernel φˆ at some point in a pre-specified time . t The Kernel is simply a weighted average of all data points: The sequence of weights for the kernel is defined by: In automatic control applications all random variables have a constant probability density function, with this Therefore, the regression proposed by Nadaraya used in the kernel can be replaced by the regression proposed by Priestley-Chao [9], which is defined as: (1) Given the above the expression (1) is used to model the unknown or unexplored states of the environment. One possible way is to design an expanded kernel function in order to encompass a larger amount of data by means of a full amplitude: where max σ is chosen on the basis of the total number of data obtained, e.g.
is used to model the unknown or unexplored states of is to design an expanded kernel function in order to y means of a full amplitude: Where the total data available is.

Proposed learning algorithm
Proposed learning flow is show in the figure 1. At every discreet step of time t , the states where the agents are located are observed and these are referred to a table of states-actions called Q -tabla. The Q-learning algorithm is used to obtain optimal actions, from the data of * Q which is obtained at the end of the convergence of the algorithm.
When the states where the agents are located are not available in the Q-table, either because they were not explored during the learning algorithm by reinforcement or by the occurrence of small changes in the environment, the actions to be performed by the agents will be approximate Kernel. The new approach will be sent to the agents in order to continue the task entrusted.
The previously trained approximator generates as output a sub-optimal action for each agent in the environment, thus avoiding to run the RL algorithm again when the agents face an unknown state.
The proposed method can be listed in the following steps: 1) The initial states of each training cycle of the RL Q-learning algorithm are captured: The current state of the agents' state with respect to the environment is captured through sensors.

2) Limit the number of captured states:
The limitation of captured states reduces the set of states agents require to complete the task which saves time and computing power.

3) Establish the actions available to the agents:
At each moment the agents are required to carry out an action with a degree of coordination, therefore, it is necessary to select in advance the most reliable actions to be carried out by the agents, with the aim of keeping the space for actions minimized and avoiding dimensionality problems.

4) Estimate the Q-state-action values of each agent:
The numerical reward of each action is calculated and given to the agent after a joint action is performed, the values obtained are saved in a search table called Q-table.

5) Repeat steps 2-4 until the agents reach the target:
The training cycle ends if the agents reach the final objective or if an established limit of iterations is reached.
above the expression (1) is used to model the unknown or unexplored states of nment. One possible way is to design an expanded kernel function in order to a larger amount of data by means of a full amplitude: x is chosen on the basis of the total number of data obtained, e.g.
above the expression (1) is used to model the unknown or unexplored states of onment. One possible way is to design an expanded kernel function in order to ss a larger amount of data by means of a full amplitude:  This happens when the values remain unchanged or they are below a predetermined level.

7) Obtaining final Q-table of action-states:
The final table of optimal states-actions is fine-tuned for the selection of the optimal actions by means of the location of the action that will generate the maximum value Q in every state.

8) Kernel training phase:
The table of values is used Q obtained by the Q-Learning algorithm to train the kernel, each column of the table Q represents a state, which is entered as input and the optimal actions found as outputs of the system.
Once the kernel has been trained, it will provide a joint approximate action that the agents will implement when they are facing unknown states that had not been explored or learned in the previous stages of learning.

Results and Discussion
In order to validate the performance of the proposed method, two Khepera IV mobile robots are used, whose objective is to generate a trajectory from an initial point to a goal. The software used was Matlab, the robots were used in slave mode with a bluetooth connection at 115200 bauds, the exchange of information between the robots Khepera and Matlab was through ASCII code. The task must be completed in a minimum time avoiding obstacles and coordinating among them, it is assumed that the agents have no prior knowledge about the position and shape of the obstacles present in the environment, the configuration of the working environment is shown in figure 2.

Figure 2. Task Configuration
The initial position of the agents is randomly selected and 50 learning steps are carried out, if this limit is reached, the experiment is stopped and restarted. As soon as the agents find the optimal trajectory without colliding with obstacles or other agents it will be said that the values of the Q-table have converged, so the learning stage will be stopped by reinforcement.
In order to complete the assigned cooperative task, each agent is required to choose one action from the 4 available actions • Move forward 5 cm.
• Turn around 25° in the direction of the clock.
• Turn around 25° in a counterclockwise direction.
The reward function ( , ) x u ρ is given by: The reward function is designed in such a way that by assigning the numeric value 1 when the agent takes the target, the numeric reward of 10 for when the agent reaches the target prevents agents from finding suboptimal states motivated by intermediate rewards.
The convergence of the algorithm is shown in figure 3. In this figure it is shown that the algorithm converges after 16 trials, the duration of each trial depends on the number of steps that the agents perform before stopping the trial or reaching the learning objective.
The set of data that will be used as training samples for the Kernel are taken from the optimal Q-table generated by the Q-learning algorithm, Table I. Each column of the Q-table represents a state and the output of the approximator will be the joint action that the agents must execute.
In order to compare the performance of the algorithm, we choose as initial position for the agents, an unknown state which is not in the Q-table, under this situation it would not be possible to provide the optimal actions to the agents, so the kernel will offer a coordinated suboptimal action for each agent, example of these situations are shown in figures 4 and 5, where the agents have initial position in states that are not in the Q-table.
b was through ASCII code. The task must be completed in a minimum time avoiding cles and coordinating among them, it is assumed that the agents have no prior ledge about the position and shape of the obstacles present in the environment, the guration of the working environment is shown in figure 2.     [ , ] u u u = of agent 1 is shown in Figure 6, where 1 x is the position in the x, 2 x is the position on the y-axis, 3 x is the speed on the axis x, 4 x is the speed on the y-axis. 1 u is the force applied to the shaft x, 2 u is the force applied to the y-axis.

Conclusions
A proposal for integration between 2 control strategies is presented. In a first stage MARL is used as a means to obtain data and information of the task and the environment in which the agents are developed, later these data are used to train a nonparametric approximator.
The experimental results confirm the reliability. and robustness of the controller for path planning when agents face unknown states, overcoming the need to re-execute the learning algorithm by reinforcement, which leads to time savings and computational power.
It should be noted that when using a non-parametric approximator, the number of weights to be tuned in the kernel increases as the size of the data available for training increases, so a balance should be sought between simplicity and accuracy of the approximator.
In addition, it is necessary to emphasize that the proposed method uses a model of discrete states of the system, this is possible due to the quantization of the states in the environment. The minimum number of data captured by the algorithm should be sufficient to describe the dynamics of the system and together with the design of an appropriate reward function ensure that there is a local maximum in the return function. The choice of data captured will depend on prior knowledge of the problem to be solved. The natural direction in the study of this topic would be to look for techniques where multi-agent systems with continuous states could be dealt with.