Embedded video of our project can be found below:
The video includes a brief problem description using images, screenshots, or screen captures, along with examples of a simple policy (e.g., random) and our best-performing approach. The video is under three minutes, recorded in at least 720p resolution, and contains comprehensible speech where applicable.
Our project focuses on developing an intelligent self-navigation system for DuckieBot within the DuckieTown simulation environment. The goal is to enable the DuckieBot to autonomously navigate through a miniature city-like environment using reinforcement learning (RL) techniques. This requires the robot to recognize lanes, follow them accurately, stop at traffic signals, and avoid obstacles such as walls, trees, and buses.
The primary challenge in this project stems from the complexity of autonomous driving, even in a constrained environment. Unlike traditional rule-based approaches, reinforcement learning offers a scalable solution but requires extensive training, fine-tuning, and robust environment modeling. In our case, the DuckieBot must learn to navigate dynamically without explicit programming for each scenario. To achieve this, we employ RL algorithms such as Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO), training the DuckieBot in a simulated environment before deploying it onto a real physical robot.
A major motivation for using AI/ML techniques is the inherent difficulty in defining precise rules for navigation in diverse and unpredictable conditions. The DuckieBot must recognize lane boundaries while adapting to different road layouts, varying obstacle placements, and dynamically changing environments. It needs to anticipate and react to traffic signals, avoid collisions with moving and stationary objects, and navigate turns effectively. Hand-coding these behaviors for every possible scenario is impractical, making RL a natural choice for training an adaptable and generalizable navigation policy.
With recent progress, we successfully ran the DuckieBot in an empty loop environment, demonstrating basic movement capabilities. The next steps involve refining policies for better performance in complex environments.
Our initial exploration included implementing a Soft Actor-Critic (SAC) model to train the DuckieBot. SAC is an off-policy reinforcement learning algorithm that optimizes for stability and exploration efficiency.
(x, z): The 2D position of the DuckieBot.sin(θ), cos(θ): Encodes the orientation of the DuckieBot.velocity: Measures the DuckieBot’s speed.[-1, 1] (reverse to forward motion).[-1, 1] (right to left turns).Despite our efforts in fine-tuning SAC, we observed unstable training and poor lane adherence, leading us to explore alternative approaches.
Through extensive experimentation, we determined that Proximal Policy Optimization (PPO) was a better fit for our project. PPO’s on-policy learning and sample efficiency made it a more stable choice for lane-following and collision avoidance.
3e-41024256100.990.950.30.010.50.5False250FalseFalse8484478Stable Baseline 3 PPO5,000,000loop_empty after fixing the camera resolution and adding frame stacking.small_loop after integrating the DDPG reward function.Initially, we worked individually to set up simulations and train models using both SAC and PPO. However, as our experiments progressed, we observed better results with PPO, leading us to focus on refining that approach.
While our model now successfully follows lanes, obstacle avoidance remains a major challenge. Our focus moving forward includes:
Metrics used for evaluation:
ep_rew_mean): Measures policy effectiveness. Currently, SAC struggles with lane adherence, leading to lower rewards.ep_len_mean): Shorter episodes indicate frequent collisions.Performance Charts:
In the beginning, absolutely nothing worked. In our early experiments with PPO, we trained several models by adjusting various factors, including the model’s policy architecture, using the vanilla implementation with Duckietown’s native wrappers, and experimenting with different image sizes. We applied a reward model that provided positive rewards while clipping very negative rewards to a reasonable value to prevent the Duckiebot from becoming “stale.” Additionally, we used an action wrapper to reduce turning speeds and tested with a 64x64 image size instead of the standard 84x84. While these experiments were introductory and did not go very far in terms of results, they were crucial for exploring our search space and understanding the nuances of Duckietown. They significantly contributed to the development of our lane-following model, allowing us to better train and optimize it for achieving higher rewards.
This model is frame-stacked, uses the specified hyperparameters, and is trained for 6 million timesteps on the small loop map in Duckietown before transitioning to the more sophisticated loop_empty map. The results show a sudden drop in rewards around 6 million timesteps, which occurs because the model’s reward dynamics change when it is trained on the new map. After this drop, the model recovers and achieves higher rewards, successfully adapting to the new environment. This figure demonstrates how we took a model trained on a small loop map, transferred it to the larger, more complex loop_empty map, and the model successfully adapted and transferred its learning.
This new TensorBoard plot shows two graphs: one for the mean episode length (ep len mean) and one for the mean episode reward (ep rew mean). It compares two implementations of the SAC model: 1) Vanilla SAC and 2) Vanilla SAC with rewards clipped to the range (-1, 1). Both models are frame-stacked with a history of 3 frames and use an updated reward function (penalizing reversing and lane centering). The plots reveal that SAC performed poorly on Duckietown, showing instability and failing to accumulate enough rewards. Clipping the rewards diminished their effectiveness, as it truncated valuable information, while the unclipped SAC showed slow learning. Ultimately, due to these issues, we decided to focus on PPO instead.
Initial experiments using a baseline PPO model.


Trained in 3M and 5M timesteps in loop_empty after fixing the camera resolution and adding frame stacking.
3M Training:

5M Training:

Trained in 3M and 5M timesteps in small_loop after integrating the DDPG reward function.
3M Training with DDPG Reward:

5M Training with DDPG Reward:

Further training in 3M and 5M timesteps using the DtReward function for better lane-following stability.
3M Training with DtReward:

5M Training with DtReward:

Applied a basic obstacle avoidance reward to improve real-world applicability.

We used AI tools in the following ways:
Further refinements will involve tuning hyperparameters and deploying the final trained model onto the DuckieBot for real-world testing.