Why `ep_rew_mean` much larger than the reward evaluated by the `evaluate_policy()` fuction

1k views Asked by At

I write a custom gym environment, and trained with PPO provided by stable-baselines3. The ep_rew_mean recorded by tensorboard is as follow:

the ep_rew_mean curve for total 100 million steps, each episode has 50 steps

As shown in the figure, the reward is around 15.5 after training, and the model converges. However, I use the function evaluate_policy() for the trained model, and the reward is much smaller than the ep_rew_mean value. The first value is mean reward, the second value is std of reward:

4.349947246664763 1.1806464511030819

the way I use function evaluate_policy() is:

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10000)

According to my understanding, the initial environment is randomly distributed in an area when using reset() fuction, so there should not be overfitting problem.

I have also tried different learning rate or other parameters, and this problem is not solved.

I have checked my environment, and I think there is no error.

I have searched on the internet, read the doc of stable-baselines3 and issues on github, but did not find the solution.

1

There are 1 answers

2
tacon On

evaluate_policy has deterministic to True by default (https://stable-baselines3.readthedocs.io/en/master/common/evaluation.html).

If you sample from the distribution during training, it may help to evaluate the policy without it selecting the actions with an argmax (by passing in deterministic=False).