Phần thưởng trung gian có thể được sử dụng trong học tập củng cố?

7

Có phải thông lệ trong RL chỉ có một phần thưởng được đưa ra khi kết thúc nhiệm vụ không? Hoặc cũng có thể giới thiệu các nhiệm vụ / mục tiêu trung gian, để phản hồi không bị trì hoãn và cần thêm phần thưởng (chức năng)?

machine-learning reinforcement-learning

— cái tôi
nguồn

2

Có phải thông lệ trong RL chỉ có một chức năng phần thưởng được trao khi kết thúc một nhiệm vụ?

Đây không phải là định nghĩa chính xác của chức năng phần thưởng. MDP có một hàm phần thưởng duy nhất , , trong đó là tập hợp các trạng thái, hành động trong vấn đề. Đôi khi bạn sẽ thấy các phiên bản có ít đối số hơn, giả sử hoặc . $R(s,a,s'): S \times A \times S \mapsto \mathbb{R}$ $S, A$ $R(s,a)$ $R(s)$

$R$ returns rewards for every state transition. Many of them, or even all but one, can be zero. Or, other intermediate states can include positive or negative rewards. Both are possible, and dependent on the particular application.

This the definition you'll find at the start of most reinforcement learning papers, e.g. this one on reward shaping, the related study of how one can alter the reward function without affecting the optimal policy.

— Sean Easter
nguồn

I was thinking of Q-learning. Eventually the reward that starts at the transition to target from one-step-away propagates/diffuses along the trajectories toward all viable initial states. It can be thought of as a partial reward. ... I wonder if heterogeneous agents could be contrived in Q-learning, one to learn, and one to more efficiently weight the trajectory to the target.

— EngrStudent

0

If you're interested in subtasks, you want to look at options. Aside from options, there is one reward function.

— Neil G
nguồn

Options framework by Rich Sutton?

— information_interchange

0

I think the short version to your question is yes, it appears to be common practice to only reward an agent for full completion of a task, but be careful with your wording, as Sean pointed out in his answer that a reward function is defined for all possible combinations of states, actions, and future states.

To add to Sean's answer, consider these snippets taken from Richard Sutton and Andrew Barto's intro book on Reinforcement Learning:

The reward signal is your way of communicating to the [agent] what you want it to achieve, not how you want it achieved (author emphasis).

For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such as taking its opponents pieces or gaining control of the center.

Although it does appear to be the recommended approach in their book, I'm sure you can find others who disagree.

— Mitch
nguồn

1

I don't understand how for very large games like go or with many moves its even possible that the agent actually accomplishes anything if it only gets rewards at the very end (e.g. Go only gets rewards if it wins...). I guess for those games they are guaranteed to get a reward because there is a finite number of pieces...

— Pinocchio