Video

Project Summary

Project background:

Suppose a Minecraft player wanted to collect items around the map and is aware that there are hostile mobs nearby. This scenario is similar to the setup of PacMan; a player collects items in a maze while evading enemies. Thus, we framed our project around this setup.

Goal:

The goal of our project is to train our agent to play a modified version of Pac-Man on Minecraft. We will develop our AI using the Malmo platform. The agent will be placed in a maze to explore. The pellets from the original game will be substituted with diamonds and the ghosts will be substituted by a Zombie. The agent's score will be based on how many diamonds they collect in a given episode. The agent will "win" if it is able to collect all diamonds.

Of course, the agent needs to do more than collect diamonds. The agent will also have to learn to avoid the zombie as it picks up diamonds. If it gets near a zombie it will receive a penalty score. If it gets touched by the zombie, the agent will "die" and received a larger penalty score. Thus, the agent will have to learn to maneuver itself with this in mind. Because the position of the zombie changes over time and the diamonds disappear when they are collected, ML and/or RL algorithms can help the agent make reasonable actions in an environment with dynamic entities. It will be interesting to see if our agent will develop strategies for collecting diamonds similar to how a human might do.

We will create the environment ourselves and train our agent using two different approaches. We will then evaluate our agent based on several metrics.

Environement Setup

Compared to the previous version of the map in the status report, we made some changes to the environment. Because the agent is enclosed by walls, one behavior it learned was to not move to minimize the amount of negative rewards that it receives from touching the wall. Thus, it learned to stand around instead of exploring the maze. To encourage the agent to explore, we added more walking space for the agent. This also allows it to maneuver around a Zombie if it learns to do so.

Environment

- Enclosed 21 x 21 Map
- 31 Diamonds
- Zombie spawned randomly in one of three locations on the map. 
Below is the layout of Map 1

Rewards

We defined the following rewards:

Positive rewards

Negative Rewards

Approach

Our minecraft agent was developed on Malmo. We trained on two different reinforcement learning algorithms. We wanted to explore on-policy and off-policy algorithms, so we decicded to train one agent using Proximal Policy Optimization (on-policy) and train another using Q-Learning (off-policy).

The main advantage of PPO is that it tends to directly optimize for policy, which might be more stable for our scenario. Q-learning methods learn an action-value function approximation, which might be unstable for our scenario since it involves moving entities and large number of states. However, a disadvantage of PPO is that it may get trapped in local minima/maxima, such as when exploiting vs exploring. On the other hand, Q-Learning needs more data and may take more time to learn in our environment. We expect PPO to perform better in our experiments.

Baseline:

Our baseline is our agent from the status report. Before training, the agent mostly performs random actions, often running into the walls and into the zombie. It is not good at collecting diamonds either.

Approach 1: PPO

One of the algorithms we used is Proximal Policy Optimization or PPO for short. We used the pre-implemented version of the PPO algorithm trainer from RLlib. PPO is a on-policy algorithm, meaning that it explores by sampling actions based on the latest version of its policy. Essentially our agent learns from the actions it took with its current policy and then updates its policy in small batches and in multiple training steps. PPO uses a on-policy update and clips the gradient descent step so learning is improved. Initially the actions the agent will perform will be based on it's initial conditions and training procedure, but should get less random as more training goes on.

PPO uses the update function:

where r(θ) is the ratio of the current policy and the old policy

Observation Space

In our scenario, we used a 3 x 17 x 17 image shape for the observation. We utilized 3 channels: one each for diamond, zombie, and wall blocks.

To preserve spatial information, we defined a custom NN model with three convutional layers.

class MyModel(TorchModelV2, nn.Module):
    def __init__(self, *args, **kwargs):
        TorchModelV2.__init__(self, *args, **kwargs)
        nn.Module.__init__(self)

        self.obs_size = 17

        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) # 32, self.obs_size, self.obs_size
        self.conv2 = nn.Conv2d(32, 32, kernel_size=3, padding=1) # 32, self.obs_size, self.obs_size
        self.conv3 = nn.Conv2d(32, 32, kernel_size=3, padding=1) # 32, self.obs_size, self.obs_size

        self.policy_layer = nn.Linear(32*self.obs_size*self.obs_size, 4) # input is flattened, action size 4
        self.value_layer = nn.Linear(32*self.obs_size*self.obs_size, 1)

        self.value = None
    
    def forward(self, input_dict, state, seq_lens):
        x  = input_dict['obs'] # BATCH, 3, self.obs_size, self.obs_size

        x = F.relu(self.conv1(x)) # BATCH, 32, self.obs_size, self.obs_size
        x = F.relu(self.conv2(x))  # BATCH, 32, self.obs_size, self.obs_size
        x = F.relu(self.conv3(x)) # BATCH, 32, self.obs_size, self.obs_size
        
        x = x.flatten(start_dim=1) # Flattened

        policy = self.policy_layer(x) 
        self.value = self.value_layer(x) 

        return policy, state
    
    def value_function(self):
        return self.value.squeeze(1) 

We used discrete actions and defined the action space for PPO as follows:

Action Space

self.action_dict = {
    0: 'move 1', 
    1: 'turn 1',  
    2: 'turn -1', 
    3: 'move 1.5', 
}

Adjusting Observations

Because our agent and the zombie are moving entities, their positions on the grid will change as they move around on the map. Diamonds will also disappear from the map as they are collected by the agent. Thus, we defined our observations with ObservationFromNearbyEntities and ObservationFromGrid. To adjust the observation grid in relation to the agent, we used the following equation.

index = math.floor((self.obs_size**2)/2) + math.floor(X-x) + math.floor(Z-z) * self.obs_size
# Where X and Z are the x,z coordinates of the entity and x and z are the x,z coordinates of the agent

After updating the observation array, the distance between the agent and zombie are checked. If the agent and the zombie are within touching distance or has been attacked by the zombie, the agent is considered to be "dead" and the mission will end.

Approach 2: Q-Learning

Q-Learning is an off-policy algorithm, meaning the updated policy is different from the behavior policy. Unlike an on-policy algorithm, off-policy algorithms learn the value of the optimal policy independently of the agent’s actions. It updates its q-table using the q-value estimate of the next state. We used a simple tabular approach to implement Q-Learning.

Q-Learning uses the following equation to update its table where Q(S, A) is the expected value of performing action a in state s

Updating q-table

We used the following code snippet to update the q-table. This is computed by adding the old q-value with an estimate shown above in the diagram.

if self.training and self.prev_s is not None and self.prev_a is not None:
    old_q = self.q_table[self.prev_s][self.prev_a]
    self.q_table[self.prev_s][self.prev_a] = old_q + self.alpha * (current_r + self.gamma * max(self.q_table[current_s]) - old_q)

Q-table being filled at the start

Q-table after agent explores the map

Evaluation

PPO: Quantitative

We generated three graphs to visualize and analyze the agent’s performance over time. We will evaluate the number of diamonds the agent is able to collect, the total rewards, and the number of steps it takes to reach the solution (collect all diamonds). Overall, all of these metrics should improve with time. One constraint we enforced on our agent is that the maximum number of steps it can perform is 500.

PPO Map 1

Map 1 is basic maze with one path that goes around the maze.

Number of diamonds collected:

At the start of training, the agent mostly performs random actions and collects only a few diamonds. As indicated on the graph, the number of diamonds the agent is able to collect increases over time. As the agent’s score is based off the number of diamonds it collects, the agent should be able to collect more diamonds as more training goes on. The graph is a good indication of improvement as the agent is able to increase the number of diamonds it can collect over time.

Returns:

At the start of training, the agent mostly performs random actions, resulting it in running into walls and the zombie. This resulted in the agent receiving large negative rewards. Over time, the rewards increase as the agent is able to collect more diamonds while avoiding the zombie. As indicated on the graph, the agent progresses from mostly negative rewards to mostly positive rewards. This is a good indication of improvement as the agent should be able to receive higher rewards the longer it stays alive.

Number of steps to find solution:

The following graph shows the number of steps it took to collect all diamonds. The total steps the agent is allowed to perform is 500 steps. Initially, the agent requires a lot of steps to reach the solution, with the maximum being 251 steps. Over time, the agent requires fewer steps to reach the solution, with the minimum number being 58 steps. The average amount of steps the agent performed was 101 steps, which is about 1/5th of the total steps the agent is allowed to perform.

Steps taken to reach solution

PPO Map 2

We also tested our model on a map where the agent has multiple path options available. There are now 38 diamonds for the agent to collect.

Number of diamonds collected:

We tested our model on map 2. Initially, the performance of the agent is unstable as it has not learned to explore the new path added to the map. Over time, the agent’s performance becomes less unstable and is able to collect diamonds more consistently compared to at the start.

Returns:

Likewise, the number of returns increases with time.

Number of steps to find solution:

Similar to the analysis for Map 1, the number of steps the agent requires decreases with time, with the maximum number of steps being 493 and the minimum number of steps being 88. Since there is an extra path the agent has to explore, the agent has to perform more steps compared to Map 1. The total amount of steps allowed is still 500 steps. The average number of steps the agent required to reach the solution was 198, which is about 2/5ths the total allowed steps.

Steps taken to reach solution

PPO: Qualitative

We will evaluate our agent’s performance based on the expected behaviours: Avoiding walls and zombies. The agent should remain adept at collecting diamonds while performing the expected behaviors and should not be negatively impacted by them.

Reviewing the total steps vs returns graph once more, we see a spike in the returns around the 10,000 to 20,000 steps mark. Although the rewards are still negative, this indicates that with time, the agent has learned to avoid walls. Since touching walls results in the agent receiving -10 reward points, it accounts for most of the negative returns initially. After the 30,000 steps mark, the returns are mostly positive, which indicates that the agent has succesfully learned to avoid touching the walls.

When the agent gets touched by a zombie and dies, the current mission will end. The graph indicates that over time, the agent is able to receive higher returns. Higher returns indicate that the agent is able to survive long enough to accumulate more rewards. Thus, our agent accomplishes the expected behaviour.

Below are videos demonstrating our agent performing the expected behavior.

Below is a video of an example run where a solution is found on Map 1

Below is a video of an example run where a solution is found on Map 2

As shown in the demonstrations, the agent is adept at collecting diamonds. When the agent observes a nearby zombie, it will avoid it. This behavior does not negatively impact the agent’s ability to collect diamonds, so our agent also accomplishes this expected behavior.

Evaluation: Q-Learning

Quantitative

Overall, the agent trained with Q-Learning took longer to reach a solution (collect all diamonds). As the number of runs increased, our agent's performance did not show significant improvement. This might be due to the large number of states or the inaccuracies of q-table entries due to the agent dying to a moving zombie.

Qualitative

Compared to PPO, the agent trained with tabular Q-Learning was unable to effectively learn to avoid the zombie while collecting diamonds. Its performance was unstable and performed poorly in our new environment. A major limitation of tabular Q-learning is that it applies only to discrete action and state spaces. Due to the zombie being a moving entity, it made q-value entries involving player death inaccurate and resulted in inefficient learning. Thus, the agent trained under Q-Learning did not accomplish the expected behavior of avoiding zombies.

Nevertheless, the agent was able to collect all the diamonds on the map within 500 steps in a map with no zombie.

Below is a video of an example run where a solution is found

Overall, tabular Q-Learning is limited and could be improved upon in conjunction with other models.

Resources Used