Data-Efficient Hierarchical Reinforcement Learning on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com
Data-Efficient Hierarchical Reinforcement Learning
Ofir Nachum and Shixiang Gu and Honglak Lee and Sergey Levine
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.AI, stat.ML
more
Summaries/Notes 1
[link] Summary by Felipe Martins 4 years ago
# Keypoints
- Proposes the HIerarchical Reinforcement learning with Off-policy correction (**HIRO**) algorithm.
 - Does not require careful task-specific design.
 - Generic goal representation to make it broadly applicable, without any manual design of goal spaces, primitives, or controllable dimensions.
- Use of off-policy experience using a novel off-policy correction.
- A two-level hierarchy architecture
 - A higher-level controller outputs a goal for the lower-level controller every **c** time steps and collects the rewards given by the environment, being the goal the desired change in state space
 - The lower level controller has the goal given added to its input and acts directly in the environment, the reward received is parametrized from the current state and the goal.

# Background
This paper adopts a standard continuous control reinforcement learning setting, in which an agent acts on an environment that yields a next state and a reward from unknown functions. This paper utilizes the TD3 learning algorithm.

## General and Efficient Hierarchical Reinforcement Learning

https://i.imgur.com/zAHoWWO.png

## Hierarchy of Two Policies

The higher-level policy $\mu^{hi}$ outputs a goal $g_t$, which correspond directly to desired relative changes in state that the lower-level policy $\mu^{lo}$ attempts to reach. $\mu^{hi}$ operates at a time abstraction, updating the goal $g_t$ and collecting the environment rewards $R_t$ every $c$ environment steps, the higher-level transition $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ is stored for off-policy training.

The lower-level policy $\mu^{lo}$ outputs an action to be applied directly to the environment, having as input the current environment observations $s_t$ and the goal $g_t$. The goal  $g_t$ is given by $\mu^{hi}$ every $c$ environment time steps, for the steps in between, the goal $g_t$ used by $\mu^{lo}$ is given by the transition function $g_t=h(s_{t−1},g_{t−1},s_t)$, the lower-level controller reward is provided by the parametrized reward function $r_t=r(s_t,g_t,a_t,s_{t+1})$.  The  lower-level  transition $(s_t,g_t,a_t,r_t,s_{t+1}, g_{t+1})$ is stored for off-policy training.

## Parameterized Rewards

The goal $g_t$ indicates a desired relative changes in state observations, the lower-level agent task is to take actions from state $s_t$ that yield it an observation $s_{t+c}$ that is close to $s_t+g_t$. To maintain the same absolute position of the goal regardless of state change, the goal transition model, used between $\mu^{hi}$ updates every $c$ steps, is defined as:

$h(s_t,g_t,s_{t+1}) =s_t+g_t−s_{t+1}$

And the reward given to the lower-level controller is defined as to reinforce reaching a state closer to the goal $g_t$, this paper parametrizes it by the function:
$r(s_t,g_t,a_t,s_{t+1}) =−||s_t+g_t−s_{t+1}||_2$.

## Off-Policy Corrections for Higher-Level Training

The higher-level transitions stored $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ have to be converted to state-action-reward transitions $(s_t,g_t,∑R_{t:t+c−1},s_{t+c})$ as they can be used in standard off-policy RL algorithms, however, since the lower-level controller is evolving, these past transitions do not accurately represent the actions tha would be taken by the current lower-level policy and must be corrected.

This paper correction technique used is to change the goal $g_t$ of past transitions using an out of date lower-level controller to a relabeled goal $g ̃_t$ which is likely to induce the same lower-level behavior with the updated $\mu^{lo}$. In other words, we want to find a goal $g ̃_t$ which maximizes the probability $μ_{lo}(a_{t:t+c−1}|s_{t:t+c−1},g ̃_{t:t+c−1})$, in which the $\mu^{lo}$ is the current policy and the actions $a_{t:t+c−1}$ and states $s_{t:t+c−1}$ are from the stored high level transition.

To approximately maximize this quantity in practice, the authors calculated the probability for 10 candidates $g ̃_t$, eight candidate goals sampled randomly from a Gaussian centered at $s_{t+c}−s_t$, the original goal $g_t$ and a goal corresponding to the difference $s_{t+c}−s_t$.

# Experiments
https://i.imgur.com/iko9nCd.png

https://i.imgur.com/kGx8fZv.png

The authors compared the $HIRO$ method to prior method in 4 different environments:
- Ant Gather;
- Ant Maze;
- Ant Push;
- Ant Fall.

They also performed an ablative analysis with the following variants:
- With lower-level re-labeling;
- With pre-training;
- No off-policy correction;
- No HRL.

# Closing Points
- The method proposed is interesting in the hierarchical reinforcement learning setting for not needing a specific design, the generic goal representation enables applicability without the need of designing a goal space manually;
- The off-policy correction method enables this algorithm to be sample efficient;
- The hierarchical structure with intermediate goals on state-space enables to better visualize the agent goals;
- The paper Appendix elaborates on possible alternative off-policy corrections.
Your comment:
Write your summary here (You can use $\LaTeX$ and markdown syntax):
Anon Private