- Adjust the exploration noise scale to ensure the agent explores actions across the entire range, not just at the max value.
- Ensure the reward function doesn't bias the agent towards always picking the maximum action value by providing incentives for exploring different actions.
- TD3 uses reward clipping to prevent the agent from learning from unrealistically large rewards. If the reward clipping is set too high, it could prevent the agent from learning from the negative consequences of taking the maximum action. This could lead to the agent getting stuck at the maximum action value.pen_spark
- Implement gradient clipping to prevent large updates that can lead to value explosion.
- Ensure the target network update rate is set to maintain training stability.
- Consider using batch normalization and weight regularization to stabilize the learning process.