Clear Filters
Clear Filters

How does PPO+LSTM work?Can anyone explain my confusion?

67 views (last 30 days)
Hello, everyone!
When I read about PPO in the official MATLAB documentation, I found this sentence: “When the agent uses a recurrent neural network, MiniBatchSize is treated as the training trajectory length.”
I'm puzzled how does PPO+LSTM sample and learn from the current set of experiences?
How to understand "MiniBatchSize is treated as the training trajectory length".

Answers (1)

Prasanna
Prasanna on 22 May 2024
Hi,
When training reinforcement learning (RL) agents, MiniBatchSize typically refers to the number of samples from the experience replay buffer that are used for one iteration of learning. In the case of standard (non-recurrent) neural networks, these samples can be randomly selected because the network treats each input independently.
When the documentation mentions that "MiniBatchSize is treated as the training trajectory length" for an agent using an LSTM, it implies a shift in how data samples are structured and utilized during training:
  • Trajectory-Based Learning: Instead of learning from randomly sampled individual experiences, the agent learns from sequences (or trajectories) of experiences. Each trajectory consists of states, actions, rewards, and next states that are sequentially connected, reflecting the temporal dependencies inherent in the decision-making process.
  • MiniBatchSize Interpretation: The MiniBatchSize value specifies the length of these trajectories. For example, if MiniBatchSize is set to 256, it means that the LSTM network will be trained on sequences of experiences where each sequence is 256 steps long. This setup allows the LSTM to effectively learn policies that depend on temporal sequences of events.
Therefore, when configuring a PPO agent with an LSTM network in MATLAB using rlPPOAgentOptions, setting the MiniBatchSize appropriately is crucial for effective learning. The choice of MiniBatchSize affects how well the LSTM can learn from the temporal dependencies in the data. Too short sequences might not capture enough of the temporal context, while too long sequences might be computationally expensive and harder to learn from due to the vanishing gradient problem common in RNNs.
For more information, refer the following documentation:
Hope this helps.
  1 Comment
Lance
Lance about 13 hours ago
Hi,
Here are two methods in creating a PPO agent:
Method 1: rlPPOAgent(Actor, Critic, AgentOps) in which custom Actor & Critic Networks are defined, in my case with LSTMs
Method 2: rlPPOAgent(ObsInfo,ActInfo, rlInitilizationOptions(UseRNN = "true"), AgentOpts) in which default networks are used (single LSTM) but RNN is specified as true
My question is, if method 1 is utilized, does the MATLAB/PPO algorithm recognize the actor/critic networks as being RNNs? This is important because you cannot use rlInitilizationOptions in method 1, so I am wondering if MiniBatchSize is treated correctly?
Thanks

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!