Machine Learning Engineer Nanodegree¶

Reinforcement Learning¶

Project 4: Train a Smartcab How to Drive¶

Implement a Basic Driving Agent¶

To begin, your only task is to get the smartcab to move around in the environment. At this point, you will not be concerned with any sort of optimal driving policy. Note that the driving agent is given the following information at each intersection:

The next waypoint location relative to its current location and heading.

The state of the traffic light at the intersection and the presence of oncoming vehicles from other directions.
The current time left from the allotted deadline.

To complete this task, simply have your driving agent choose a random action from the set of possible actions (None, 'forward', 'left', 'right') at each intersection, disregarding the input information above. Set the simulation deadline enforcement, enforce_deadline to False and observe how it performs.

import random
from environment import Agent, Environment
from planner import RoutePlanner
from simulator import Simulator

class LearningAgent(Agent):
    """An agent that learns to drive in the smartcab world."""

    def __init__(self, env):
        super(LearningAgent, self).__init__(env)  
        # sets self.env = env, state = None, next_waypoint = None, and a default color
        self.color = 'red'  # override color
        self.planner = RoutePlanner(self.env, self)  # simple route planner to get next_waypoint
        # TODO: Initialize any additional variables here

    def reset(self, destination=None):
        self.planner.route_to(destination)
        # TODO: Prepare for a new trip; reset any variables here, if required

    def update(self, t):
        # Gather inputs
        self.next_waypoint = self.planner.next_waypoint()  # from route planner, also displayed by simulator
        inputs = self.env.sense(self)
        deadline = self.env.get_deadline(self)

        # TODO: Update state
        
        # TODO: Select action according to your policy
        action = random.choice([None, 'forward', 'left', 'right'])

        # Execute action and get reward
        reward = self.env.act(self, action)

        # TODO: Learn policy based on state, action, reward

        # print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, reward)  # [debug]

def run():
    """Run the agent for a finite number of trials."""

    # Set up environment and agent
    e = Environment()  # create environment (also adds some dummy traffic)
    a = e.create_agent(LearningAgent)  # create agent
    e.set_primary_agent(a, enforce_deadline=True)  # specify agent to track
    # NOTE: You can set enforce_deadline=False while debugging to allow longer trials

    # Now simulate it
    sim = Simulator(e, update_delay=0, display=False)  # create simulator (uses pygame when display=True, if available)
    # NOTE: To speed up simulation, reduce update_delay and/or set display=False

    sim.run(n_trials=100)  # run for a specified number of trials
    # NOTE: To quit midway, press Esc or close pygame window, or hit Ctrl+C on the command-line
    
run()

QUESTION:

Observe what you see with the agent's behavior as it takes random actions. Does the smartcab eventually make it to the destination? Are there any other interesting observations to note?

ANSWER:

The smartcab does pretty much what you expect - it behaves randomly. It would occasionally stop for no reason, run lights and completely discregard traffic. Sometimes it would do the right thing and in very rare cases make it to the destination.

There are 3 other agents that exist simultaneously in the world. The observed rewards are as follows:

Violating a traffic rule: -1.0
No violation but going in a direction other than specified by the planner: -0.5
No violation and going in the right direction: 2.0
No violation, right direction and reaching the destination: 12.0

Inform the Driving Agent¶

Now that your driving agent is capable of moving around in the environment, your next task is to identify a set of states that are appropriate for modeling the smartcab and environment. The main source of state variables are the current inputs at the intersection, but not all may require representation. You may choose to explicitly define states, or use some combination of inputs as an implicit state. At each time step, process the inputs and update the agent's current state using the self.state variable. Continue with the simulation deadline enforcement enforce_deadline being set to False, and observe how your driving agent now reports the change in state as the simulation progresses.

class LearningAgent(Agent):
    """An agent that learns to drive in the smartcab world."""

    def __init__(self, env):
        super(LearningAgent, self).__init__(env)  # sets self.env = env, state = None, next_waypoint = None, and a default color
        self.color = 'red'  # override color
        self.planner = RoutePlanner(self.env, self)  # simple route planner to get next_waypoint
        # TODO: Initialize any additional variables here

    def reset(self, destination=None):
        self.planner.route_to(destination)
        # TODO: Prepare for a new trip; reset any variables here, if required

    def update(self, t):
        # Gather inputs
        self.next_waypoint = self.planner.next_waypoint()  # from route planner, also displayed by simulator
        inputs = self.env.sense(self)
        deadline = self.env.get_deadline(self)

        # TODO: Update state
        inputs.pop('right')
        self.state = inputs
        self.state['waypoint'] = self.next_waypoint
        
        # TODO: Select action according to your policy
        action = random.choice([None, 'forward', 'left', 'right'])

        # Execute action and get reward
        reward = self.env.act(self, action)

        # TODO: Learn policy based on state, action, reward

        # print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, reward)  # [debug]

run()

QUESTION:

What states have you identified that are appropriate for modeling the smartcab and environment? Why do you believe each of these states to be appropriate for this problem?

OPTIONAL:

How many states in total exist for the smartcab in this environment? Does this number seem reasonable given that the goal of Q-Learning is to learn and make informed decisions about each state? Why or why not?

ANSWER:

The identified state consists of most the inputs plus the next waypoint. The following are its parameters and possible values:

light ['green', 'red']
oncoming [ None, 'forward', 'left', 'right']
left [ None, 'forward', 'left', 'right']
waypoint [ 'forward', 'left', 'right']

Generally speaking, our problem is to get our smartcab to a specified location in our grid world. Normally, this would be much more complicated. However, pathing has already been solved for us by the planner module as it already tells us where the next waypoint is. This also effectively let's us bypass the problem of localization, so we don't have to worry about sensing and ascertaining where our smartcab is exactly. What we do have to worry about is the conditions of its immediate vicinity (the inputs) and where it must go next (next waypoint). So the inputs plus the waypoint is all we need really to take an action.

We have chosen to exclude two parameters:

Deadline was excluded since we only need to take it into account if we want to control pathing and optimize our route to the destination, but this problem is already taken care of by the planner. We would also like the cab to obey traffic rules even if it was running out of time, making this parameter irrelevant. Moreover, the current range of deadline is 0 to 60. This would cause the number of states to blow up and scale poorly if we were to consider larger world.
Right can be safely excluded since based on the right-of-way rules, traffic from the right need not be considered whether the cab is moving forward, left or right.

Altogether there are 2x4x4x3 = 96 total states. It's a small number compared to possible events that can occur in a real world environment. If we were to take account different vehicles, pedestrians, road signs, weather, road elevation, etc., it would certainly be a lot more than 96 states.

Implement a Q-Learning Driving Agent¶

With your driving agent being capable of interpreting the input information and having a mapping of environmental states, your next task is to implement the Q-Learning algorithm for your driving agent to choose the best action at each time step, based on the Q-values for the current state and action. Each action taken by the smartcab will produce a reward which depends on the state of the environment. The Q-Learning driving agent will need to consider these rewards when updating the Q-values. Once implemented, set the simulation deadline enforcement enforce_deadline to True. Run the simulation and observe how the smartcab moves about the environment in each trial.

import operator

Q_INITIAL = 2
ALPHA = 1

class hashabledict(dict):
    def __hash__(self):
        return hash(frozenset(self.iteritems()))

class LearningAgent(Agent):
    """An agent that learns to drive in the smartcab world."""

    def __init__(self, env):
        super(LearningAgent, self).__init__(env)  # sets self.env = env, state = None, next_waypoint = None, and a default color
        self.color = 'red'  # override color
        self.planner = RoutePlanner(self.env, self)  # simple route planner to get next_waypoint
        # TODO: Initialize any additional variables here
        self.policy = {}
        self.alpha = ALPHA

    def reset(self, destination=None):
        self.planner.route_to(destination)
        # TODO: Prepare for a new trip; reset any variables here, if required

    def update(self, t):
        # Gather inputs
        self.next_waypoint = self.planner.next_waypoint()  # from route planner, also displayed by simulator
        inputs = self.env.sense(self)
        deadline = self.env.get_deadline(self)

        # TODO: Update state
        inputs.pop('right')
        self.state = inputs
        self.state['waypoint'] = self.next_waypoint

        # TODO: Select action according to your policy
        state_hash = hashabledict(frozenset(self.state.iteritems()))
        q = self.policy.get(state_hash, {
            None: Q_INITIAL,
            'forward': Q_INITIAL,
            'left': Q_INITIAL,
            'right': Q_INITIAL,
        })
        action = max(q.iteritems(), key=operator.itemgetter(1))[0]

        # Execute action and get reward
        reward = self.env.act(self, action)

        # TODO: Learn policy based on state, action, reward
        q[action] = (1 - self.alpha) * q[action] + self.alpha * reward
        self.policy[state_hash] = q
        
run()

QUESTION:

What changes do you notice in the agent's behavior when compared to the basic driving agent when random actions were always taken? Why is this behavior occurring?

ANSWER:

The smartcab's behavior would start out random but would slowly correct itself once negative rewards were encountered. For example, after it runs a stop light and receives a negative reward, it would stop doing that and try something else the next time around. This is because the Q-value of that state-action pair is degraded and is seen as a less appealing option by the smartcab.

Interestingly, the cab encounters more common cases first (usually without traffic) and therefore has Q-values for those states-action pairs first. It can get confused when it encounters an entirely new state as the actions for that state will always start out with uniform Q-values. For example, even though it ran a stop with no traffic before and got a negative reward, if there was suddenly any traffic in any direction, it's possible that it will run the stop again so it can seemingly make the same mistakes.

Improve the Q-Learning Driving Agent¶

Your final task for this project is to enhance your driving agent so that, after sufficient training, the smartcab is able to reach the destination within the allotted time safely and efficiently. Parameters in the Q-Learning algorithm, such as the learning rate (alpha), the discount factor (gamma) and the exploration rate (epsilon) all contribute to the driving agent’s ability to learn the best action for each state. To improve on the success of your smartcab:

Set the number of trials, n_trials, in the simulation to 100.
Run the simulation with the deadline enforcement enforce_deadline set to True (you will need to reduce the update delay update_delay and set the display to False).
Observe the driving agent’s learning and smartcab’s success rate, particularly during the later trials.
Adjust one or several of the above parameters and iterate this process.

This task is complete once you have arrived at what you determine is the best combination of parameters required for your driving agent to learn successfully.

import pandas as pd

class LearningAgent(Agent):
    """An agent that learns to drive in the smartcab world."""

    def __init__(self, env, alpha, q_initial):
        super(LearningAgent, self).__init__(env)  # sets self.env = env, state = None, next_waypoint = None, and a default color
        self.color = 'red'  # override color
        self.planner = RoutePlanner(self.env, self)  # simple route planner to get next_waypoint
        # TODO: Initialize any additional variables here
        self.policy = {}
        self.trip_log = []
        self.trip = None
        self.alpha = alpha
        self.q_initial = q_initial

    def reset(self, destination=None):
        self.planner.route_to(destination)
        # TODO: Prepare for a new trip; reset any variables here, if required
        if self.trip:
            self.trip_log.append(self.trip)
        self.trip = {}
        self.trip['Deadline'] = self.env.get_deadline(self)
        self.trip['Reward'] = 0
        self.trip['Penalty'] = 0

    def update(self, t):
        # Gather inputs
        self.next_waypoint = self.planner.next_waypoint()  # from route planner, also displayed by simulator
        inputs = self.env.sense(self)
        deadline = self.env.get_deadline(self)

        # TODO: Update state
        inputs.pop('right')
        self.state = inputs
        self.state['waypoint'] = self.next_waypoint

        # TODO: Select action according to your policy
        state_hash = hashabledict(frozenset(self.state.iteritems()))
        q = self.policy.get(state_hash, {
            None: self.q_initial,
            'forward': self.q_initial,
            'left': self.q_initial,
            'right': self.q_initial,
        })
        action = max(q.iteritems(), key=operator.itemgetter(1))[0]

        # Execute action and get reward
        reward = self.env.act(self, action)

        # Update trip stats
        self.trip['Reward'] += reward
        self.trip['Remaining'] = self.env.get_deadline(self)
        self.trip['Success'] = self.planner.next_waypoint() == None
        if reward < 0: self.trip['Penalty'] += reward

        # TODO: Learn policy based on state, action, reward
        q[action] = (1 - self.alpha) * q[action] + self.alpha * reward
        self.policy[state_hash] = q

        # print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, reward)  # [debug]


def run():
    """Run the agent for a finite number of trials."""

    record = []
    for q_initial in [0, 2, 10]:
        for alpha in range(1, 6):
            # Set up environment and agent
            e = Environment()  # create environment (also adds some dummy traffic)
            a = e.create_agent(LearningAgent, alpha * 0.2, q_initial)  # create agent
            e.set_primary_agent(a, enforce_deadline=True)  # specify agent to track
            # NOTE: You can set enforce_deadline=False while debugging to allow longer trials

            # Now simulate it
            sim = Simulator(e, update_delay=0, display=False)  # create simulator (uses pygame when display=True, if available)
            # NOTE: To speed up simulation, reduce update_delay and/or set display=False

            sim.run(n_trials=100)  # run for a specified number of trials
            # NOTE: To quit midway, press Esc or close pygame window, or hit Ctrl+C on the command-line

            a.reset()
            trip_log = pd.DataFrame(a.trip_log)
            trip_log['Efficiency'] = trip_log['Remaining'] / trip_log['Deadline'] * 100
            
            record.append({
                'Success Rate': trip_log[trip_log.Success == True].shape[0],
                'Alpha': alpha * 0.2,
                'Q Initial': q_initial,
                'Efficiency': trip_log['Efficiency'].mean(),
                'Ave Reward': trip_log['Reward'].mean(),
                'Ave Penalty': trip_log['Penalty'].mean(),
            });

    return pd.DataFrame(record)

record = run()

QUESTION:

Report the different values for the parameters tuned in your basic implementation of Q-Learning. For which set of parameters does the agent perform best? How well does the final driving agent perform?

ANSWER:

I have chosen to implement the 'optimistic' version of the Q-leaning algorithm where the agent begins with sufficiently high Q values but approaches correct values as it acquires penalties for wrong actions. This is an elegant solution to this particular problem because:

of the way the rewards are structured
each state is independent from each other
there is only one correct action for each state

As such, we only need to tune 2 parameters - learning rate alpha and our initial Q-values. We run the simulation for a combination of learning rates and initial Q-values:

Learning rate: 0.2 to 1.0 in increments of 0.2
Q-values: 0, 2, 10

The results are as follows:

print record

    Alpha  Ave Penalty  Ave Reward  Efficiency  Q Initial  Success Rate
0     0.2       -0.245      13.955   31.027381          0            50
1     0.4       -0.280      13.140   25.984444          0            43
2     0.6       -0.320      10.775   20.012662          0            36
3     0.8       -0.255      11.515   24.071349          0            39
4     1.0       -0.290      12.910   25.865188          0            44
5     0.2       -0.545      22.130   60.532525          2            99
6     0.4       -0.410      21.300   61.589603          2            99
7     0.6       -0.340      22.020   57.740079          2            98
8     0.8       -0.200      22.240   59.109264          2           100
9     1.0       -0.265      22.155   59.475368          2            99
10    0.2       -1.120      22.280   56.297165         10            93
11    0.4       -0.610      21.290   55.947294         10            95
12    0.6       -0.540      21.980   56.640079         10            97
13    0.8       -0.365      22.035   57.701140         10            98
14    1.0       -0.310      22.830   58.326710         10            98

We note that the inital Q-values are extremely important and the cab performs optimally at 2 with success rates close to 100. As for the learning rate, 0.6 to 1.0 are good values.

QUESTION:

Does your agent get close to finding an optimal policy, i.e. reach the destination in the minimum possible time, and not incur any penalties? How would you describe an optimal policy for this problem?

ANSWER:

All in all, the cab has a very good success rate - around 99 out of 100. At best, the cab has an efficiency of 60%. That is, it only uses up 40% of the steps allotted to arrive at the destination. The average penalty per trip is around 0.3, which means it still makes a few mistakes along the way so the policy is not perfect. This is because the cab still has not encountered all possible states so it will behave naively when it is faced with a new state.

An optimal policy put simply is - proceed to the next waypoint when it's safe (no violations), stay in place otherwise. We can arrive at this optimal policy by either letting the cab 'experience' all combinations of 96x4 states and actions or apply some heuristics.