Reinforcement Learning Basics

Policies

A policy is a rule that determines what action to take, typically denoted as $μ$. When the action is selected stochastically, the policy is represented specifically as $π(⋅|s_t)$ at timestep $t$

When the policy is based on stochastic process, the action is sampled categorically if the action space is discrete, and sampled in a Guassian manner if the action space is continuous.

Value Functions

value functions are functions that return values of a specified state or state-action pair.

On-Policy Value Function $V^π(s)$: starting in state $s$ and always act according to policy $π$
$$ V^π(x) = \mathbb E_{τ\sim π} [R(τ)|s_o=s]$$
On-Policy Action-Value Function: starting in state $s$ and takes an action $a$
$$ Q^π(s,a) = \mathbb E_{τ\sim π} [R(τ)|s_0=s,a_0=a]$$

Advantage Functions

Advantage function $A^π(s,a)$: how much an action is better than others on average

$A^π(s,a)$ describes how much better it is to take a specific action $a$ in state s over randomly selected action according to $π(\cdot|s)$. Hence it can be represented as

$$ A^π(s,a) = Q^π(s,a) - V^π(s)$$

Discussion

References

[1] https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

Footnotes

'DL·ML > Study' 카테고리의 다른 글

RVOS Datasets (0)	2025.01.24
segmentation task들의 종류 (0)	2025.01.15
GIoU, CIoU metrics (0)	2025.01.06
Jaccrad Index(IoU)와 F1/Dice, Coutour Accuracy(F) (1)	2025.01.03

Policies

Value Functions

Advantage Functions

Discussion

References

Footnotes

'DL·ML > Study' 카테고리의 다른 글

티스토리툴바