I’m implementing a Reinforcement Learning solution to control a home battery, similar to a Model Predictive Control (MPC) approach. The observation includes the current state of charge (SoC) and N-step forecasts for PV generation, electrical demand, import price, and export price.
In MPC, I calculate an optimal charge/discharge trajectory over the prediction horizon and output the entire plan. Now, I’m trying to implement the same using a DDPG agent in MATLAB.
My questions:
Should the DDPG agent output a scalar action (charging/discharging power) for each timestep, which is then used to update the SoC based on the sampling time,
or should the agent output a full trajectory, where I execute only the first action but discard the remaining ones, while still using the full trajectory for the reward calculation?
My thoughts:
In MPC, I get the entire optimal trajectory for charging and discharging over the horizon. Initially, I considered using the same approach with the DDPG agent. However, I’m wondering if this is necessary because the value function  already accounts for downstream benefits (future prices/loads) since they are included in the state, right?
 already accounts for downstream benefits (future prices/loads) since they are included in the state, right? But if the agent returns just one action for the next state, it seems like this would lead to a result similar to what I would get if I had no prediction horizon at all.
Thanks in advance for any suggestions.