To begin, your only task is to get the smartcab to move around in the environment. At this point, you will not be concerned with any sort of optimal driving policy. Note that the driving agent is given the following information at each intersection:
The next waypoint location relative to its current location and heading.
The state of the traffic light at the intersection and the presence of oncoming vehicles from other directions.
The current time left from the allotted deadline.
To complete this task, simply have your driving agent choose a random action from the set of possible actions (None, 'forward', 'left', 'right') at each intersection, disregarding the input information above. Set the simulation deadline enforcement, enforce_deadline to False and observe how it performs.
import random
from environment import Agent, Environment
from planner import RoutePlanner
from simulator import Simulator
class LearningAgent(Agent):
"""An agent that learns to drive in the smartcab world."""
def __init__(self, env):
super(LearningAgent, self).__init__(env)
# sets self.env = env, state = None, next_waypoint = None, and a default color
self.color = 'red' # override color
self.planner = RoutePlanner(self.env, self) # simple route planner to get next_waypoint
# TODO: Initialize any additional variables here
def reset(self, destination=None):
self.planner.route_to(destination)
# TODO: Prepare for a new trip; reset any variables here, if required
def update(self, t):
# Gather inputs
self.next_waypoint = self.planner.next_waypoint() # from route planner, also displayed by simulator
inputs = self.env.sense(self)
deadline = self.env.get_deadline(self)
# TODO: Update state
# TODO: Select action according to your policy
action = random.choice([None, 'forward', 'left', 'right'])
# Execute action and get reward
reward = self.env.act(self, action)
# TODO: Learn policy based on state, action, reward
# print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, reward) # [debug]
def run():
"""Run the agent for a finite number of trials."""
# Set up environment and agent
e = Environment() # create environment (also adds some dummy traffic)
a = e.create_agent(LearningAgent) # create agent
e.set_primary_agent(a, enforce_deadline=True) # specify agent to track
# NOTE: You can set enforce_deadline=False while debugging to allow longer trials
# Now simulate it
sim = Simulator(e, update_delay=0, display=False) # create simulator (uses pygame when display=True, if available)
# NOTE: To speed up simulation, reduce update_delay and/or set display=False
sim.run(n_trials=100) # run for a specified number of trials
# NOTE: To quit midway, press Esc or close pygame window, or hit Ctrl+C on the command-line
run()
QUESTION:
Observe what you see with the agent's behavior as it takes random actions. Does the smartcab eventually make it to the destination? Are there any other interesting observations to note?
ANSWER:
The smartcab does pretty much what you expect - it behaves randomly. It would occasionally stop for no reason, run lights and completely discregard traffic. Sometimes it would do the right thing and in very rare cases make it to the destination.
There are 3 other agents that exist simultaneously in the world. The observed rewards are as follows:
Now that your driving agent is capable of moving around in the environment, your next task is to identify a set of states that are appropriate for modeling the smartcab and environment. The main source of state variables are the current inputs at the intersection, but not all may require representation. You may choose to explicitly define states, or use some combination of inputs as an implicit state. At each time step, process the inputs and update the agent's current state using the self.state variable. Continue with the simulation deadline enforcement enforce_deadline being set to False, and observe how your driving agent now reports the change in state as the simulation progresses.
class LearningAgent(Agent):
"""An agent that learns to drive in the smartcab world."""
def __init__(self, env):
super(LearningAgent, self).__init__(env) # sets self.env = env, state = None, next_waypoint = None, and a default color
self.color = 'red' # override color
self.planner = RoutePlanner(self.env, self) # simple route planner to get next_waypoint
# TODO: Initialize any additional variables here
def reset(self, destination=None):
self.planner.route_to(destination)
# TODO: Prepare for a new trip; reset any variables here, if required
def update(self, t):
# Gather inputs
self.next_waypoint = self.planner.next_waypoint() # from route planner, also displayed by simulator
inputs = self.env.sense(self)
deadline = self.env.get_deadline(self)
# TODO: Update state
inputs.pop('right')
self.state = inputs
self.state['waypoint'] = self.next_waypoint
# TODO: Select action according to your policy
action = random.choice([None, 'forward', 'left', 'right'])
# Execute action and get reward
reward = self.env.act(self, action)
# TODO: Learn policy based on state, action, reward
# print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, reward) # [debug]
run()
QUESTION:
What states have you identified that are appropriate for modeling the smartcab and environment? Why do you believe each of these states to be appropriate for this problem?
OPTIONAL:
How many states in total exist for the smartcab in this environment? Does this number seem reasonable given that the goal of Q-Learning is to learn and make informed decisions about each state? Why or why not?
ANSWER:
The identified state consists of most the inputs plus the next waypoint. The following are its parameters and possible values:
Generally speaking, our problem is to get our smartcab to a specified location in our grid world. Normally, this would be much more complicated. However, pathing has already been solved for us by the planner module as it already tells us where the next waypoint is. This also effectively let's us bypass the problem of localization, so we don't have to worry about sensing and ascertaining where our smartcab is exactly. What we do have to worry about is the conditions of its immediate vicinity (the inputs) and where it must go next (next waypoint). So the inputs plus the waypoint is all we need really to take an action.
We have chosen to exclude two parameters:
Deadline was excluded since we only need to take it into account if we want to control pathing and optimize our route to the destination, but this problem is already taken care of by the planner. We would also like the cab to obey traffic rules even if it was running out of time, making this parameter irrelevant. Moreover, the current range of deadline is 0 to 60. This would cause the number of states to blow up and scale poorly if we were to consider larger world.
Right can be safely excluded since based on the right-of-way rules, traffic from the right need not be considered whether the cab is moving forward, left or right.
Altogether there are 2x4x4x3 = 96 total states. It's a small number compared to possible events that can occur in a real world environment. If we were to take account different vehicles, pedestrians, road signs, weather, road elevation, etc., it would certainly be a lot more than 96 states.
With your driving agent being capable of interpreting the input information and having a mapping of environmental states, your next task is to implement the Q-Learning algorithm for your driving agent to choose the best action at each time step, based on the Q-values for the current state and action. Each action taken by the smartcab will produce a reward which depends on the state of the environment. The Q-Learning driving agent will need to consider these rewards when updating the Q-values. Once implemented, set the simulation deadline enforcement enforce_deadline to True. Run the simulation and observe how the smartcab moves about the environment in each trial.
import operator
Q_INITIAL = 2
ALPHA = 1
class hashabledict(dict):
def __hash__(self):
return hash(frozenset(self.iteritems()))
class LearningAgent(Agent):
"""An agent that learns to drive in the smartcab world."""
def __init__(self, env):
super(LearningAgent, self).__init__(env) # sets self.env = env, state = None, next_waypoint = None, and a default color
self.color = 'red' # override color
self.planner = RoutePlanner(self.env, self) # simple route planner to get next_waypoint
# TODO: Initialize any additional variables here
self.policy = {}
self.alpha = ALPHA
def reset(self, destination=None):
self.planner.route_to(destination)
# TODO: Prepare for a new trip; reset any variables here, if required
def update(self, t):
# Gather inputs
self.next_waypoint = self.planner.next_waypoint() # from route planner, also displayed by simulator
inputs = self.env.sense(self)
deadline = self.env.get_deadline(self)
# TODO: Update state
inputs.pop('right')
self.state = inputs
self.state['waypoint'] = self.next_waypoint
# TODO: Select action according to your policy
state_hash = hashabledict(frozenset(self.state.iteritems()))
q = self.policy.get(state_hash, {
None: Q_INITIAL,
'forward': Q_INITIAL,
'left': Q_INITIAL,
'right': Q_INITIAL,
})
action = max(q.iteritems(), key=operator.itemgetter(1))[0]
# Execute action and get reward
reward = self.env.act(self, action)
# TODO: Learn policy based on state, action, reward
q[action] = (1 - self.alpha) * q[action] + self.alpha * reward
self.policy[state_hash] = q
run()
QUESTION:
What changes do you notice in the agent's behavior when compared to the basic driving agent when random actions were always taken? Why is this behavior occurring?
ANSWER:
The smartcab's behavior would start out random but would slowly correct itself once negative rewards were encountered. For example, after it runs a stop light and receives a negative reward, it would stop doing that and try something else the next time around. This is because the Q-value of that state-action pair is degraded and is seen as a less appealing option by the smartcab.
Interestingly, the cab encounters more common cases first (usually without traffic) and therefore has Q-values for those states-action pairs first. It can get confused when it encounters an entirely new state as the actions for that state will always start out with uniform Q-values. For example, even though it ran a stop with no traffic before and got a negative reward, if there was suddenly any traffic in any direction, it's possible that it will run the stop again so it can seemingly make the same mistakes.
Your final task for this project is to enhance your driving agent so that, after sufficient training, the smartcab is able to reach the destination within the allotted time safely and efficiently. Parameters in the Q-Learning algorithm, such as the learning rate (alpha), the discount factor (gamma) and the exploration rate (epsilon) all contribute to the driving agent’s ability to learn the best action for each state. To improve on the success of your smartcab:
This task is complete once you have arrived at what you determine is the best combination of parameters required for your driving agent to learn successfully.
import pandas as pd
class LearningAgent(Agent):
"""An agent that learns to drive in the smartcab world."""
def __init__(self, env, alpha, q_initial):
super(LearningAgent, self).__init__(env) # sets self.env = env, state = None, next_waypoint = None, and a default color
self.color = 'red' # override color
self.planner = RoutePlanner(self.env, self) # simple route planner to get next_waypoint
# TODO: Initialize any additional variables here
self.policy = {}
self.trip_log = []
self.trip = None
self.alpha = alpha
self.q_initial = q_initial
def reset(self, destination=None):
self.planner.route_to(destination)
# TODO: Prepare for a new trip; reset any variables here, if required
if self.trip:
self.trip_log.append(self.trip)
self.trip = {}
self.trip['Deadline'] = self.env.get_deadline(self)
self.trip['Reward'] = 0
self.trip['Penalty'] = 0
def update(self, t):
# Gather inputs
self.next_waypoint = self.planner.next_waypoint() # from route planner, also displayed by simulator
inputs = self.env.sense(self)
deadline = self.env.get_deadline(self)
# TODO: Update state
inputs.pop('right')
self.state = inputs
self.state['waypoint'] = self.next_waypoint
# TODO: Select action according to your policy
state_hash = hashabledict(frozenset(self.state.iteritems()))
q = self.policy.get(state_hash, {
None: self.q_initial,
'forward': self.q_initial,
'left': self.q_initial,
'right': self.q_initial,
})
action = max(q.iteritems(), key=operator.itemgetter(1))[0]
# Execute action and get reward
reward = self.env.act(self, action)
# Update trip stats
self.trip['Reward'] += reward
self.trip['Remaining'] = self.env.get_deadline(self)
self.trip['Success'] = self.planner.next_waypoint() == None
if reward < 0: self.trip['Penalty'] += reward
# TODO: Learn policy based on state, action, reward
q[action] = (1 - self.alpha) * q[action] + self.alpha * reward
self.policy[state_hash] = q
# print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, reward) # [debug]
def run():
"""Run the agent for a finite number of trials."""
record = []
for q_initial in [0, 2, 10]:
for alpha in range(1, 6):
# Set up environment and agent
e = Environment() # create environment (also adds some dummy traffic)
a = e.create_agent(LearningAgent, alpha * 0.2, q_initial) # create agent
e.set_primary_agent(a, enforce_deadline=True) # specify agent to track
# NOTE: You can set enforce_deadline=False while debugging to allow longer trials
# Now simulate it
sim = Simulator(e, update_delay=0, display=False) # create simulator (uses pygame when display=True, if available)
# NOTE: To speed up simulation, reduce update_delay and/or set display=False
sim.run(n_trials=100) # run for a specified number of trials
# NOTE: To quit midway, press Esc or close pygame window, or hit Ctrl+C on the command-line
a.reset()
trip_log = pd.DataFrame(a.trip_log)
trip_log['Efficiency'] = trip_log['Remaining'] / trip_log['Deadline'] * 100
record.append({
'Success Rate': trip_log[trip_log.Success == True].shape[0],
'Alpha': alpha * 0.2,
'Q Initial': q_initial,
'Efficiency': trip_log['Efficiency'].mean(),
'Ave Reward': trip_log['Reward'].mean(),
'Ave Penalty': trip_log['Penalty'].mean(),
});
return pd.DataFrame(record)
record = run()
QUESTION:
Report the different values for the parameters tuned in your basic implementation of Q-Learning. For which set of parameters does the agent perform best? How well does the final driving agent perform?
ANSWER:
I have chosen to implement the 'optimistic' version of the Q-leaning algorithm where the agent begins with sufficiently high Q values but approaches correct values as it acquires penalties for wrong actions. This is an elegant solution to this particular problem because:
As such, we only need to tune 2 parameters - learning rate alpha and our initial Q-values. We run the simulation for a combination of learning rates and initial Q-values:
The results are as follows:
print record
We note that the inital Q-values are extremely important and the cab performs optimally at 2 with success rates close to 100. As for the learning rate, 0.6 to 1.0 are good values.
QUESTION:
Does your agent get close to finding an optimal policy, i.e. reach the destination in the minimum possible time, and not incur any penalties? How would you describe an optimal policy for this problem?
ANSWER:
All in all, the cab has a very good success rate - around 99 out of 100. At best, the cab has an efficiency of 60%. That is, it only uses up 40% of the steps allotted to arrive at the destination. The average penalty per trip is around 0.3, which means it still makes a few mistakes along the way so the policy is not perfect. This is because the cab still has not encountered all possible states so it will behave naively when it is faced with a new state.
An optimal policy put simply is - proceed to the next waypoint when it's safe (no violations), stay in place otherwise. We can arrive at this optimal policy by either letting the cab 'experience' all combinations of 96x4 states and actions or apply some heuristics.