I have a custom environment created using Stable Baselines 3 where the environment is a digital twin of a fermentation reaction. It observes the enzyme activity which is the output of the fermentation and takes a binary action on the substrate addition ie. if substrate needs to be added at a certain timestep or not.
The reward function observes the slope of the enzyme activity at each timestep. If the slope is high then the agent will be rewarded more and if the slope if low then the agent is not rewarded much. slope never goes less than 0. I also add a constant at the end of each experiment while training which check how close the highest enzyme value was from the target value. 4.5 U/L is the target value and closer the final enzyme activity was at the end of the experiment more it will be rewarded which is why the difference is inverted. 10000 and 10 are used to scale the values.
reward = slope * 10000
reward_at_end_of_episode = 4.5 - max_enzyme_activity
invert_end_of_episode_reward = (1/reward_at_end_of_episode) * 10
reward = reward + invert_end_of_episode_reward
While training using PPO this is how my agent performed.

Agent was exploring a lot of ways to add substrate which was giving me enzyme activity higher than what was expected but was not able to converge on it. Instead it was converging on another set of actions (add every hour) which was giving me lower enzyme activity.
I am not able to understand why its behaving like this and I am not sure how to approach to solve this. Is there any problem in my reward function? What changes can I make to debug this issue?