This paper presents a hybrid control proposal for multi-agent systems, where the advantages of the reinforcement learning and nonparametric functions are exploited. A modified version of the Q-learning algorithm is used which will provide data training for a Kernel, this approach will provide a sub optimal set of actions to be used by the agents. The proposed algorithm is experimentally tested in a path generation task in an unknown environment for mobile robots.

**Keywords:**Multiagent
systems,
nonparametric
approximator,
reinforcement
learning, trajectory
generation

En este artículo se presenta una propuesta hibrida de algoritmo de control para sistemas multiagentes, en donde se aprovechan las ventajas del aprendizaje por reforzamiento y de las funciones de aproximación no paramétricas. Se utiliza una versión modificada del algoritmo Q-learning la cual proveerá de datos de entrenamiento para un Kernel, el cual ofrecerá una aproximación sub optima de acciones a realizar por los agentes. El algoritmo propuesto es probado experimentalmente en una tarea de generación de trayectoria en un entorno desconocido para robot móviles.

**Palabras clave:**Sistemas
multiagentes,
aproximador
no paramétrico,
aprendizaje por
reforzamiento,
generación de
trayectorias.

There are Research related to multiagent systems
(MAS) is an emerging subfield of distributed artificial
intelligence, which aims to provide principles for
building complex systems through the integration of
multiple agents.

There are features in MAS that distinguish it from
a single-agent control system. First the agents are
considered partially autonomous, this is due to the
fact that the agents do not have available all the global
information and therefore of the work environment,
reason why they can only have access to a limited
information, second in the MAS an individual agent
cannot decide an optimal action only using his local
knowledge [1].

Multi-agent systems have found applications in a
wide variety of fields such as robot teams playing
soccer, distributed control, unmanned aerial vehicles,
training control, resource and traffic management,
systems support, data mining, design engineering,
intelligent search, medical diagnostics, product
delivery, among others. The agents that make up a
MAS have to deal with environments that can be
static or dynamic, deterministic (an action has a
single effect) or non-deterministic, discrete (there is
a finite number of actions and states) or continuous
[2].

For example, most existing artificial intelligence
techniques for individual agents have been developed
in static media as they are easier to handle and allow
for rigorous mathematical treatment. In MAS, with
the mere presence of multiple agents they make a
static environment appear as dynamic from the point
of view of other agents.

Although traditional control approaches seek to
equip agents with MAS with pre-programmed or
pre-designed behaviors, agents often need to learn
new behaviors online so that MAS performance
gradually improves. This is because the complexity
of the working environment in which the agents
operate and the tasks assigned to them make an a
priori design of the control laws difficult or even
impossible.

An agent that learns through Reinforcement Learning
(RL) acquires knowledge through interaction with the
dynamic environment where it performs its assigned
task. In each step of time, the agent perceives the
state of the environment and executes a determined
action, which generates that the environment
transitions to a new state. A reward signal scale
evaluates the quality of each state transition so the
agent must maximize the reward accumulated during
the interaction with the environment, it is important
to mention, the agents are not told what action to
take, so they must explore the environment to find
the actions that provide a greater reward in the long
term [3].

One area where learning by reinforcement has been
successful is trajectory planning for mobile robots,
where a trajectory is generated from a starting point to
an end point respecting certain restrictions imposed
on movement, such as obstacles, delimitation of the
trajectory area.

The problems addressed by RL are generally limited
to problems with discrete states and with a finite
number of actions available to agents. This is due
to the so-called curse of dimensionality, which is
the exponential growth of state-action pairs to learn
as the number of states and actions increase in the
problem, leading to an increase in computation time
and the amount of memory needed to store the data
associated with the algorithm [4].

Therefore, it is necessary to incorporate an additional
strategy to learning by reinforcement, which offers
the opportunity to generalize the results obtained.

There are two approaches to be used to approximate
action-state values in RL, one of them is the
parametric approximators, where the functional
form of the mapping and the number of parameters
are designed beforehand and do not depend on
the data [5], on the other hand, in non-parametric
approximators, the number of parameters and the
shape of the approximator are derived as a function
of the available data [6].

The article proposes a methodology that takes
advantage of the RL along with a non-parametric
approximator in the form of Kernel, the algorithm
is integrated by two stages of learning, the first will
use the Q-learning algorithm [7], where the model of
the task is known, at this stage the agent will explore
the task environment in order to collect information
states-actions, that is to say which is the optimal
action to take each one of the explored states, in
the second stage, the information obtained by the
algorithm of RL will be used to adjust the weights
of the Kernel, which will offer us the optimal actions
to take by the agent (robot) in states that were not
explored in the first stage of learning.

The proposed algorithm is validated by means
of simulation, where the generation of optimal
trajectories for mobile robots is sought under a
cooperative task scheme.
**Materials and methods **

A generalized approach used to solve the problem
of coordination in MAS is to make sure that any
decision situation is solved in the same way by all
the agents using some type of negotiation. In our
proposal, implicit coordination is used, where agents
learn to choose actions in a coordinated way through
trial and error.

**Q-learning algorithm**

There are a large number of algorithms available
for learning by reinforcement, one of the most
popular methods in RL is the Q-Learning algorithm
[7], which uses an iterative approach procedure.
Q-Learning begins with an arbitrary functionQ,
observe transitions (St, At, St+1, r t+1) and after each
transition updates the function Q with:

The term in square brackets is called a temporal
difference. The learning parameter α ∈(0,1] can be
variant in time and usually decreases with time.

The sequence t Q converges on Q∗ under the
following conditions [9]:

• Different function values Q are saved and
updated for each action-state.

• The summation Σ∞ α
t=0 is finite.

• Asymptotically all action-state pairs are visited
infinitely.

The third point can be satisfied if agents are kept
trying all actions in all states that have non-zero
probability of happening. This requirement is called
exploration, which can be done in several ways, one
of them is choosing in each step a random action
with probability. ε ∈(0,1) and choosing a greedy
action with probability ε −1, this way we get a
greedy exploration.

**Learning by means of Non-Parametric Kernel
approximator**

The algorithms that use RL obtain a policy of optimal
actions from the optimal values obtained during the
learning process, most of these methods are based
on discrete considerations of the environment and a
limited number of states, actions and agents in order
to avoid the problem of dimensionality.

Since most real applications have a large number
of states and the Q-Learning algorithm is based on
search tables, the non-parametric approximation
method based on Kernel is used to approximate the
unknown states that are not visited when the RL
algorithm is carried out, also to make generalizations
when the environment has been slightly modified,
in both cases avoiding the need to recalculate the
optimal policies.

A conceptually simple representation of the sequence of weights t W , is by describing the shape of the weight function by means of a density function with a scaling parameter that has the function to adjust the size and shape of the weights near the data points . t s It is common to refer to this shaping function as a kernel K[z]. The kernel is a real, continuous, dimensioned and symmetrical function which can be

In automatic control applications all random variables have a constant probability density function, with this type of random variables, the summation implied by the kernel function is equivalent to a Monte Carlo integration:

At every discreet step of time t , the states where
the agents are located are observed and these are
referred to a table of states-actions called Q -tabla.
The Q-learning algorithm is used to obtain optimal
actions, from the data of Q∗ which is obtained at the
end of the convergence of the algorithm.

When the states where the agents are located are
not available in the Q-table, either because they
were not explored during the learning algorithm by
reinforcement or by the occurrence of small changes
in the environment, the actions to be performed
by the agents will be approximate Kernel. The
new approach will be sent to the agents in order to
continue the task entrusted.

The previously trained approximator generates as
output a sub-optimal action for each agent in the
environment, thus avoiding to run the RL algorithm
again when the agents face an unknown state.

The proposed method can be listed in the following
steps:

1) The initial states of each training cycle of the RL
Q-learning algorithm are captured: The current
state of the agents’ state with respect to the
environment is captured through sensors.

2) Limit the number of captured states: The
limitation of captured states reduces the set of
states agents require to complete the task which
saves time and computing power.

3) Establish the actions available to the agents: At
each moment the agents are required to carry
out an action with a degree of coordination,
therefore, it is necessary to select in advance
the most reliable actions to be carried out by the
agents, with the aim of keeping the space for
actions minimized and avoiding dimensionality
problems.

4) Estimate the Q-state-action values of each
agent: The numerical reward of each action is
calculated and given to the agent after a joint
action is performed, the values obtained are
saved in a search table called Q-table.

5) Repeat steps 2-4 until the agents reach the
target: The training cycle ends if the agents
reach the final objective or if an established limit
of iterations is reached.

6) Repeat steps 2-5 until the Q values converge:
This happens when the values remain unchanged
or they are below a predetermined level.

7) Obtaining final Q-table of action-states: The
final table of optimal states-actions is fine-tuned
for the selection of the optimal actions by means
of the location of the action that will generate the
maximum value Q in every state.

8) Kernel training phase: The table of values is
used Q obtained by the Q-Learning algorithm
to train the kernel, each column of the table
Q represents a state, which is entered as input
and the optimal actions found as outputs of the
system.

Once the kernel has been trained, it will provide
a joint approximate action that the agents will
implement when they are facing unknown states
that had not been explored or learned in the previous
stages of learning.

In order to validate the performance of the proposed
method, two Khepera IV mobile robots are used,
whose objective is to generate a trajectory from
an initial point to a goal. The software used was
Matlab, the robots were used in slave mode with a
bluetooth connection at 115200 bauds, the exchange
of information between the robots Khepera and
Matlab was through ASCII code. The task must be
completed in a minimum time avoiding obstacles
and coordinating among them, it is assumed that the
agents have no prior knowledge about the position
and shape of the obstacles present in the environment,
the configuration of the working environment is
shown (fig2)

The initial position of the agents is randomly selected
and 50 learning steps are carried out, if this limit is
reached, the experiment is stopped and restarted. As
soon as the agents find the optimal trajectory without
colliding with obstacles or other agents it will be said
that the values of the Q-table have converged, so the
learning stage will be stopped by reinforcement.

In order to complete the assigned cooperative task,
each agent is required to choose one action from the
4 available actions

• Move forward 5 cm.

• Turn around 25° in the direction of the clock.

Matlab was through ASCII code. The task must be completed in a minimum time avoiding
obstacles and coordinating among them, it is assumed that the agents have no knowledge about the position and shape of the obstacles present in the environment, configuration of the working environ
• Turn around 25° in a counterclockwise direction.

• Don’t move.

The reward function ρ (x,u) is given by:

r k+1= 1 if the agent takes the target

r k+1= 1 0 if the agents takes the target to the base position

r k+1= 0 in any other situation

The reward function is designed in such a way that
by assigning the numeric value 1 when the agent
takes the target, the numeric reward of 10 for when
the agent reaches the target prevents agents from
finding suboptimal states motivated by intermediate
rewards. (fig 3)

The convergence of the algorithm is shown in figure 3. In this figure it is shown that the algorithm converges after 16 trials, the duration of each trial depends on the number of steps that the agents perform before stopping the trial or reaching the learning objective. The set of data that will be used as training samples for the Kernel are taken from the optimal Q-table generated by the Q-learning algorithm, Table I. Each column of the Q-table represents a state and the output of the approximator will be the joint action that the agents must execute. (tab 1)

In order to compare the performance of the algorithm, we choose as initial position for the agents, an unknown state which is not in the Q-table, under this situation it would not be possible to provide the optimal actions to the agents, so the kernel will offer a coordinated suboptimal action for each agent, example of these situations are shown in figures 4 and 5, where the agents have initial position in states that are not in the Q-table. (fig 4)

The state vector x = [x1,x2 , x3 , x4 ]T and the control signal 1 2 u = [u ,u ] of agent 1 is shown in Figure 6, where 1 x is the position in the x, 2 x is the position on the y-axis, 3 x is the speed on the axis x, 4 x is the speed on the y-axis. 1 u is the force applied to the shaft x, 2 u is the force applied to the y-axis. (fig 6)

The state vector x= [x1,x2,x3,x4]^T,and the control signal u = [u1 ,u2 ] of agent 2 is shown in Figure 7. (fig 7)

A proposal for integration between 2 control
strategies is presented. In a first stage MARL is
used as a means to obtain data and information of
the task and the environment in which the agents are
developed, later these data are used to train a nonparametric
approximator.

The experimental results confirm the reliability. and
robustness of the controller for path planning when
agents face unknown states, overcoming the need to
re-execute the learning algorithm by reinforcement,
which leads to time savings and computational
power.

It should be noted that when using a non-parametric
approximator, the number of weights to be tuned in
the kernel increases as the size of the data available
for training increases, so a balance should be sought
between simplicity and accuracy of the approximator.

In addition, it is necessary to emphasize that the
proposed method uses a model of discrete states of
the system, this is possible due to the quantization of
the states in the environment. The minimum number
of data captured by the algorithm should be sufficient
to describe the dynamics of the system and together
with the design of an appropriate reward function
ensure that there is a local maximum in the return
function. The choice of data captured will depend
on prior knowledge of the problem to be solved. The
natural direction in the study of this topic would be
to look for techniques where multi-agent systems
with continuous states could be dealt with.

The authors extend their thanks to the Mexican Ministry of Public Education for funding this work through Research Agreement 511-6 / 17-7605.

[1] P. Stone, M. Veloso, “Multiagent systems: A
survey from machine learning perspective”,
Autonomous Robots, vol.8, no.3, pp. 345-383,
2000.

[2] M. Wooldridge, An Introduction to Multi Agent
Systems, Baffins Lane, Chichester, England:
John Wiley & Sons. 1992.

[3] L. Busoniu, R. Babuska and B. De Schuttert,
“Multi-agent Reinforcement Learning: An
Overview”, Delf Center for System and Control,
Delf University of Technology, pp. 183-221,
2010.

[4] J.M. Vidal, “Learning in multiagent systems: An
introduction from a game-theoretic perspective”,
In: Alonso E., Kudenko D., Kazakov D. (eds)
Adaptive Agents and Multi-Agent Systems.
Lecture Notes in Computer Science, vol. 2636.
Springer, Berlin, Heidelberg, pp. 202-215, 2003.

[5] R. Postoyan, L. Busoniu, D. Nesic and J.
Daafouz, “Stability Analysis of Discrete-
Time Infinite-Horizon Optimal Control with
Discounted Cost”. IEEE Transactions on
Automatic Control, vol. 62, no. 6, pp. 2736–
2749, 2017

[6] B. Kiumarsi, K.G. Vamvoudakis, M. Hamidreza
and F.L. Lewis. “Optimal and Autonomous
Control Using Reinforcement Learning: A
Survey”, IEEE Transactions on Neural Networks
and Learning Systems, vol. 29, no. 6, pp. 2042-
2062, 2018.

[7] C. Watkins, P. Dayan, “Q Learning: Technical
Note”, Machine Learning, vol.8, pp. 279-292,
1992.

[8] C. Boutilier, “Planning Learning and
Coordination in Multiagent Decision Processes”,
In Proceedings of the Sixth Conference
on Theoretical Aspects of Rationality and
Knowledge (TARK96), 1996, pp. 195-202, 1996.

[9] Y. Ishiwaka, T. Sato and Y. Kakazu, “An approach
to the pursuit problem on a heterogeneous
multiagent system using reinforcement
learning”, Robotics and Autonomous Systems,
vol. 43, no. 4, pp.245-256, 2003.

[10] A. Nadaraya, “On Estimating Regression”,
Theory of Probability and its Applications, vol.
9, no.1, pp. 141-142, 1964.

1* Doctor en Ciencias, Orcid: 0000-0002-4778-8873, Universidad Autónoma de Ciudad Juárez, Juárez, México, david.luviano@uacj.mx

2 Doctor en Ciencias, Orcid: 0000-0002-8571-914X, Universidad Autónoma de Ciudad Juárez, Juárez, México, francesco.garcia@uacj.mx

3 Doctor en Ciencias en Ingeniería, Orcid: 0000-0003-2541-4595, Universidad Autónoma de Ciudad Juárez, Juárez, México, luis.dominguez@uacj.mx

Received on January 21, 2018 - Approved on June 12, 2018.

Received on January 10, 2018 - Approved on May 29, 2018.