Policies
A policy is a rule that determines what action to take, typically denoted as $μ$. When the action is selected stochastically, the policy is represented specifically as $π(⋅|s_t)$ at timestep $t$
When the policy is based on stochastic process, the action is sampled categorically if the action space is discrete, and sampled in a Guassian manner if the action space is continuous.
Value Functions
value functions are functions that return values of a specified state or state-action pair.
On-Policy Value Function $V^π(s)$: starting in state $s$ and always act according to policy $π$
$$ V^π(x) = \mathbb E_{τ\sim π} [R(τ)|s_o=s]$$On-Policy Action-Value Function: starting in state $s$ and takes an action $a$
$$ Q^π(s,a) = \mathbb E_{τ\sim π} [R(τ)|s_0=s,a_0=a]$$
Advantage Functions
Advantage function $A^π(s,a)$: how much an action is better than others on average
$A^π(s,a)$ describes how much better it is to take a specific action $a$ in state s over randomly selected action according to $π(\cdot|s)$. Hence it can be represented as
$$ A^π(s,a) = Q^π(s,a) - V^π(s)$$
Discussion
References
[1] https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
Footnotes
'DL·ML > Study' 카테고리의 다른 글
GIoU, CIoU metrics (0) | 2025.01.06 |
---|---|
Jaccrad Index(IoU)와 F1/Dice, Coutour Accuracy(F) (1) | 2025.01.03 |