Estimating Gradients for Discrete Random Variables by Sampling without Replacement on ShortScience.org

openreview.net
scholar.google.com

Estimating Gradients for Discrete Random Variables by Sampling without Replacement
Kool, Wouter and van Hoof, Herke and Welling, Max
International Conference on Learning Representations - 2020 via Local Bibsonomy
Keywords: readings, optimization, sampling

Summaries/Notes 1

[link] Summary by Gavin Gray 5 years ago

It's a shame that the authors weren't able to continue their series of [great][reinforce] [paper][attention] [titles][beams], although it looks like they thought about calling this paper **"Put Replacement In Your Basement"**. Also, although they don't say it in the title or abstract, this paper introduces an estimator the authors call the **"unordered set estimator"** which, as a name, is not the best. However, this is one of the most exciting estimators for gradients of non-differentiable expectations over structured discrete distributions (that sounds specific but includes a vast number of important problems, for example see [the first paragraph of this paper][mcts] and, less important but fun, adversarial examples or GANS on sequences).

For a distribution $p_\theta(s)$, where $s$ can take some set of discrete states, this paper is concerned with how to compute the gradient of the expectation of a function, $f$, over $p_\theta(s)$:

$$
\nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right]
$$

To jump directly to the definition of the estimator I've got to define the *leave-one-out* ratio. Given a distribution $p$ over unordered sets $S^k$, the leave-one-out ratio is
$$
R(S^k, s) = \frac{p^{D \backslash \{ s \}} (S^k \backslash \{ s \} )}{p(S^k)}.
$$
Where $p^{D \backslash \{ s \}}$ is the distribution over sets that don't contain $s$, which is the value of the first sampled element. For a given element $s$, it's the ratio of probability of sampling the other elements *given $s$* in $S^k$ divided by the probability of sampling $S^k$.

Then the *unordered set estimator* is
$$
\nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] = \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} p(s) R(S^k, s) f(s) \right] \\
=   \mathbb{E}_{s \sim p_\theta(s)} \left[ \nabla_\theta p_\theta (s) R(S^k, s) f(s) \right],
$$
and they proceed to show that it's a Rao-Blackwellization (so guaranteed as low or lower variance) than a single sample estimator, importance weighted estimators and the [the RB discrete estimator][rb].

The authors also show that they can use the multiple samples to compute a built-in baseline and further reduce variance. If we define the leave-one-out ratio on a distribution with one element already left out as $R^{D \backslash \{ s \} } (S^k, s')$ (the *second-order* leave-one-out ratio):
$$
\nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] \\
=  \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} \nabla_\theta p_\theta (s) R(S^k, s) \left( f(s) - \sum_{s' \in S^k} p_{\theta} (s') R^{D \backslash \{ s \} } (S^k, s') f(s')  \right) \right]
$$

Thanks to [stochastic beam search][beams] these sets can be sampled quickly and easily, so it all adds up to a powerful generic method. If I need gradients in an SGD setting of an expectation involving discrete variables, and I can take multiple samples, this is the most promising method I know about at the moment.

[reinforce]: https://openreview.net/forum?id=r1lgTGL5DE
[attention]: https://arxiv.org/abs/1803.08475
[beams]: https://arxiv.org/abs/1903.06059
[mcts]: https://arxiv.org/pdf/1910.06862.pdf
[rb]: https://arxiv.org/abs/1810.04777

It's a shame that the authors weren't able to continue their series of great paper titles, although it looks like they thought about calling this paper "Put Replacement In Your Basement". Also, although they don't say it in the title or abstract, this paper introduces an estimator the authors call the "unordered set estimator" which, as a name, is not the best. However, this is one of the most exciting estimators for gradients of non-differentiable expectations over structured discrete distributions (that sounds specific but includes a vast number of important problems, for example see the first paragraph of this paper and, less important but fun, adversarial examples or GANS on sequences).

For a distribution $p_\theta(s)$, where $s$ can take some set of discrete states, this paper is concerned with how to compute the gradient of the expectation of a function, $f$, over $p_\theta(s)$:

$$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] $$

To jump directly to the definition of the estimator I've got to define the leave-one-out ratio. Given a distribution $p$ over unordered sets $S^k$, the leave-one-out ratio is $$ R(S^k, s) = \frac{p^{D \backslash \{ s \}} (S^k \backslash \{ s \} )}{p(S^k)}. $$ Where $p^{D \backslash \{ s \}}$ is the distribution over sets that don't contain $s$, which is the value of the first sampled element. For a given element $s$, it's the ratio of probability of sampling the other elements given $s$ in $S^k$ divided by the probability of sampling $S^k$.

Then the unordered set estimator is $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] = \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} p(s) R(S^k, s) f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[ \nabla_\theta p_\theta (s) R(S^k, s) f(s) \right], $$ and they proceed to show that it's a Rao-Blackwellization (so guaranteed as low or lower variance) than a single sample estimator, importance weighted estimators and the the RB discrete estimator.

The authors also show that they can use the multiple samples to compute a built-in baseline and further reduce variance. If we define the leave-one-out ratio on a distribution with one element already left out as $R^{D \backslash \{ s \} } (S^k, s')$ (the second-order leave-one-out ratio): $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} \nabla_\theta p_\theta (s) R(S^k, s) \left( f(s) - \sum_{s' \in S^k} p_{\theta} (s') R^{D \backslash \{ s \} } (S^k, s') f(s') \right) \right] $$

Thanks to stochastic beam search these sets can be sampled quickly and easily, so it all adds up to a powerful generic method. If I need gradients in an SGD setting of an expectation involving discrete variables, and I can take multiple samples, this is the most promising method I know about at the moment.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private