cs
DL·ML/Study

Reinforcement Learning Basics

Policies

A policy is a rule that determines what action to take, typically denoted as $μ$. When the action is selected stochastically, the policy is represented specifically as $π(⋅|s_t)$ at timestep $t$


When the policy is based on stochastic process, the action is sampled categorically if the action space is discrete, and sampled in a Guassian manner if the action space is continuous.


Value Functions

value functions are functions that return values of a specified state or state-action pair.

  1. On-Policy Value Function $V^π(s)$: starting in state $s$ and always act according to policy $π$
    $$ V^π(x) = \mathbb E_{τ\sim π} [R(τ)|s_o=s]$$

  2. On-Policy Action-Value Function: starting in state $s$ and takes an action $a$
    $$ Q^π(s,a) = \mathbb E_{τ\sim π} [R(τ)|s_0=s,a_0=a]$$

Advantage Functions

Advantage function $A^π(s,a)$: how much an action is better than others on average


$A^π(s,a)$ describes how much better it is to take a specific action $a$ in state s over randomly selected action according to $π(\cdot|s)$. Hence it can be represented as


$$ A^π(s,a) = Q^π(s,a) - V^π(s)$$

Discussion


References

[1] https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

Footnotes

'DL·ML > Study' 카테고리의 다른 글

GIoU, CIoU metrics  (0) 2025.01.06
Jaccrad Index(IoU)와 F1/Dice, Coutour Accuracy(F)  (1) 2025.01.03