The Task
Our task domain is the NRL Navigation task developed by Alan Schultz at the Naval
Research Laboratory (NRL). It requires piloting an underwater vehicle
through a field of mines guided by a small suite of sonar, range,
bearing and fuel sensors. Sensor information is presented via an
instrument panel that is updated in real-time. The sensors are noisy.
Decisions about motion of the vehicle (speed and turn) are
communicated via a joystick interface. The task objective is to
rendezvous with a stationary target before exhausting fuel and without
hitting the mines. The mines may be stationary or drifting. A trial
or episode begins with the vehicle being randomly placed on one side
of a mine field and ends with one of three possible outcomes: the
vehicle reaches the target, hits a mine, or exhausts its fuel.
Reinforcement, in the form of a scalar reward dependent on the
outcome, is received at the end of each episode.
Since the mine configurations vary from episode to episode, it is
fruitless for subjects to memorize a sequence of actions that will get
the vehicle to the target. To solve the task, subjects must learn a
policy for choosing actions based on the sensor values presented to
them.
Mathematical characterization of the task
The Navigation task belongs to the family of partially observable
Markov decision processes. With the addition of the last action
taken, we can transform it into a fully observable Markov decision
process (MDP). This transformation lends theoretical tractability
because deterministic optimal decision procedures exist for MDPs.
However, the size of the state space is about 10^{18} and there are
153 choices of action at each time step, which make the Navigation
task extremely challenging both for humans as well as for present-day
learning algorithms like reinforcement learning.
Why the task is hard for humans
There are four major sources of complexity in the Navigation task
from a cognitive perspective:
- the need for rapid decision making with incomplete information.
- the sheer number of distinct sensor configurations for
which an action choice has to be computed, and the need to learn a
partition in the sensor space while acquiring a policy.
- limited
binary feedback at the end of each episode.
- a tightly coupled
action space in which the different components (turn and speed) cannot
be learned independently.
Together, these make the task difficult for our human subjects; one
out of every three never acquires the task with our current training
protocols.