How does PPO+LSTM work?Can anyone explain my confusion？

Question

xiang on 17 May 2024

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/2119776-how-does-ppo-lstm-work-can-anyone-explain-my-confusion

Commented: Lance about 13 hours ago

Hello, everyone!

When I read about PPO in the official MATLAB documentation, I found this sentence: “When the agent uses a recurrent neural network, MiniBatchSize is treated as the training trajectory length.”

I'm puzzled how does PPO+LSTM sample and learn from the current set of experiences?

How to understand "MiniBatchSize is treated as the training trajectory length".

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Prasanna on 22 May 2024

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/2119776-how-does-ppo-lstm-work-can-anyone-explain-my-confusion#answer_1461396

Hi,

When training reinforcement learning (RL) agents, ‘MiniBatchSize’ typically refers to the number of samples from the experience replay buffer that are used for one iteration of learning. In the case of standard (non-recurrent) neural networks, these samples can be randomly selected because the network treats each input independently.

When the documentation mentions that "MiniBatchSize is treated as the training trajectory length" for an agent using an LSTM, it implies a shift in how data samples are structured and utilized during training:

Trajectory-Based Learning: Instead of learning from randomly sampled individual experiences, the agent learns from sequences (or trajectories) of experiences. Each trajectory consists of states, actions, rewards, and next states that are sequentially connected, reflecting the temporal dependencies inherent in the decision-making process.
‘MiniBatchSize’ Interpretation: The ‘MiniBatchSize’ value specifies the length of these trajectories. For example, if ‘MiniBatchSize’ is set to 256, it means that the LSTM network will be trained on sequences of experiences where each sequence is 256 steps long. This setup allows the LSTM to effectively learn policies that depend on temporal sequences of events.

Therefore, when configuring a PPO agent with an LSTM network in MATLAB using ‘rlPPOAgentOptions’, setting the ‘MiniBatchSize’ appropriately is crucial for effective learning. The choice of ‘MiniBatchSize’ affects how well the LSTM can learn from the temporal dependencies in the data. Too short sequences might not capture enough of the temporal context, while too long sequences might be computationally expensive and harder to learn from due to the vanishing gradient problem common in RNNs.

For more information, refer the following documentation:

https://www.mathworks.com/help/reinforcement-learning/ref/rl.option.rlppoagentoptions.html

Hope this helps.

1 Comment
Show -1 older commentsHide -1 older comments

Lance about 13 hours ago

Hi,

Here are two methods in creating a PPO agent:

Method 1: rlPPOAgent(Actor, Critic, AgentOps) in which custom Actor & Critic Networks are defined, in my case with LSTMs

Method 2: rlPPOAgent(ObsInfo,ActInfo, rlInitilizationOptions(UseRNN = "true"), AgentOpts) in which default networks are used (single LSTM) but RNN is specified as true

My question is, if method 1 is utilized, does the MATLAB/PPO algorithm recognize the actor/critic networks as being RNNs? This is important because you cannot use rlInitilizationOptions in method 1, so I am wondering if MiniBatchSize is treated correctly?

Thanks

Sign in to comment.

How does PPO+LSTM work?Can anyone explain my confusion？

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

How does PPO+LSTM work?Can anyone explain my confusion？

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments