Video

Embedded video of our project can be found below:

The video includes a brief problem description using images, screenshots, or screen captures, along with examples of a simple policy (e.g., random) and our best-performing approach. The video is under three minutes, recorded in at least 720p resolution, and contains comprehensible speech where applicable.

Project Summary

Our project focuses on developing an intelligent self-navigation system for DuckieBot within the DuckieTown simulation environment. The goal is to enable the DuckieBot to autonomously navigate through a miniature city-like environment using reinforcement learning (RL) techniques. This requires the robot to recognize lanes, follow them accurately, stop at traffic signals, and avoid obstacles such as walls, trees, and buses.

The primary challenge in this project stems from the complexity of autonomous driving, even in a constrained environment. Unlike traditional rule-based approaches, reinforcement learning offers a scalable solution but requires extensive training, fine-tuning, and robust environment modeling. In our case, the DuckieBot must learn to navigate dynamically without explicit programming for each scenario. To achieve this, we employ RL algorithms such as Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO), training the DuckieBot in a simulated environment before deploying it onto a real physical robot.

A major motivation for using AI/ML techniques is the inherent difficulty in defining precise rules for navigation in diverse and unpredictable conditions. The DuckieBot must recognize lane boundaries while adapting to different road layouts, varying obstacle placements, and dynamically changing environments. It needs to anticipate and react to traffic signals, avoid collisions with moving and stationary objects, and navigate turns effectively. Hand-coding these behaviors for every possible scenario is impractical, making RL a natural choice for training an adaptable and generalizable navigation policy.

Key Objectives:

With recent progress, we successfully ran the DuckieBot in an empty loop environment, demonstrating basic movement capabilities. The next steps involve refining policies for better performance in complex environments.

Approaches

SAC Approach

Our initial exploration included implementing a Soft Actor-Critic (SAC) model to train the DuckieBot. SAC is an off-policy reinforcement learning algorithm that optimizes for stability and exploration efficiency.

Despite our efforts in fine-tuning SAC, we observed unstable training and poor lane adherence, leading us to explore alternative approaches.

PPO Approach

Through extensive experimentation, we determined that Proximal Policy Optimization (PPO) was a better fit for our project. PPO’s on-policy learning and sample efficiency made it a more stable choice for lane-following and collision avoidance.

Custom Reward Function

Process and Iterative Improvements

Initially, we worked individually to set up simulations and train models using both SAC and PPO. However, as our experiments progressed, we observed better results with PPO, leading us to focus on refining that approach.

Key Iterations and Learnings

Next Steps

While our model now successfully follows lanes, obstacle avoidance remains a major challenge. Our focus moving forward includes:

Evaluation

Quantitative Evaluation

Metrics used for evaluation:

Performance Charts:

PPO Experimentation - Early Phases: Testing the Waters!









In the beginning, absolutely nothing worked. In our early experiments with PPO, we trained several models by adjusting various factors, including the model’s policy architecture, using the vanilla implementation with Duckietown’s native wrappers, and experimenting with different image sizes. We applied a reward model that provided positive rewards while clipping very negative rewards to a reasonable value to prevent the Duckiebot from becoming “stale.” Additionally, we used an action wrapper to reduce turning speeds and tested with a 64x64 image size instead of the standard 84x84. While these experiments were introductory and did not go very far in terms of results, they were crucial for exploring our search space and understanding the nuances of Duckietown. They significantly contributed to the development of our lane-following model, allowing us to better train and optimize it for achieving higher rewards.

PPO Lane Following Model





This model is frame-stacked, uses the specified hyperparameters, and is trained for 6 million timesteps on the small loop map in Duckietown before transitioning to the more sophisticated loop_empty map. The results show a sudden drop in rewards around 6 million timesteps, which occurs because the model’s reward dynamics change when it is trained on the new map. After this drop, the model recovers and achieves higher rewards, successfully adapting to the new environment. This figure demonstrates how we took a model trained on a small loop map, transferred it to the larger, more complex loop_empty map, and the model successfully adapted and transferred its learning.

SAC Unsuccessfull Attempt





This new TensorBoard plot shows two graphs: one for the mean episode length (ep len mean) and one for the mean episode reward (ep rew mean). It compares two implementations of the SAC model: 1) Vanilla SAC and 2) Vanilla SAC with rewards clipped to the range (-1, 1). Both models are frame-stacked with a history of 3 frames and use an updated reward function (penalizing reversing and lane centering). The plots reveal that SAC performed poorly on Duckietown, showing instability and failing to accumulate enough rewards. Clipping the rewards diminished their effectiveness, as it truncated valuable information, while the unclipped SAC showed slow learning. Ultimately, due to these issues, we decided to focus on PPO instead.

Qualitative Evaluation

Training Progress and GIFs

Basic PPO Training

Initial experiments using a baseline PPO model.

Framestacked PPO Training

Trained in 3M and 5M timesteps in loop_empty after fixing the camera resolution and adding frame stacking.
3M Training:

5M Training:

PPO with DDPG Reward

Trained in 3M and 5M timesteps in small_loop after integrating the DDPG reward function.
3M Training with DDPG Reward:

5M Training with DDPG Reward:

PPO with DtReward

Further training in 3M and 5M timesteps using the DtReward function for better lane-following stability.
3M Training with DtReward:

5M Training with DtReward:

Latest Update: Obstacle Avoidance

Applied a basic obstacle avoidance reward to improve real-world applicability.

References

Libraries Used

GitHub Repositories

Papers and Documentation

AI Tool Usage

We used AI tools in the following ways:

Further refinements will involve tuning hyperparameters and deploying the final trained model onto the DuckieBot for real-world testing.