Loading [MathJax]/extensions/Safe.js
openreview.net
scholar.google.com
Estimating Gradients for Discrete Random Variables by Sampling without Replacement
Kool, Wouter and van Hoof, Herke and Welling, Max
International Conference on Learning Representations - 2020 via Local Bibsonomy
Keywords: readings, optimization, sampling


[link]
Summary by Gavin Gray 5 years ago

It's a shame that the authors weren't able to continue their series of great paper titles, although it looks like they thought about calling this paper "Put Replacement In Your Basement". Also, although they don't say it in the title or abstract, this paper introduces an estimator the authors call the "unordered set estimator" which, as a name, is not the best. However, this is one of the most exciting estimators for gradients of non-differentiable expectations over structured discrete distributions (that sounds specific but includes a vast number of important problems, for example see the first paragraph of this paper and, less important but fun, adversarial examples or GANS on sequences).

For a distribution $p_\theta(s)$, where $s$ can take some set of discrete states, this paper is concerned with how to compute the gradient of the expectation of a function, $f$, over $p_\theta(s)$:

$$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] $$

To jump directly to the definition of the estimator I've got to define the leave-one-out ratio. Given a distribution $p$ over unordered sets $S^k$, the leave-one-out ratio is $$ R(S^k, s) = \frac{p^{D \backslash \{ s \}} (S^k \backslash \{ s \} )}{p(S^k)}. $$ Where $p^{D \backslash \{ s \}}$ is the distribution over sets that don't contain $s$, which is the value of the first sampled element. For a given element $s$, it's the ratio of probability of sampling the other elements given $s$ in $S^k$ divided by the probability of sampling $S^k$.

Then the unordered set estimator is $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] = \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} p(s) R(S^k, s) f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[ \nabla_\theta p_\theta (s) R(S^k, s) f(s) \right], $$ and they proceed to show that it's a Rao-Blackwellization (so guaranteed as low or lower variance) than a single sample estimator, importance weighted estimators and the the RB discrete estimator.

The authors also show that they can use the multiple samples to compute a built-in baseline and further reduce variance. If we define the leave-one-out ratio on a distribution with one element already left out as $R^{D \backslash \{ s \} } (S^k, s')$ (the second-order leave-one-out ratio): $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} \nabla_\theta p_\theta (s) R(S^k, s) \left( f(s) - \sum_{s' \in S^k} p_{\theta} (s') R^{D \backslash \{ s \} } (S^k, s') f(s') \right) \right] $$

Thanks to stochastic beam search these sets can be sampled quickly and easily, so it all adds up to a powerful generic method. If I need gradients in an SGD setting of an expectation involving discrete variables, and I can take multiple samples, this is the most promising method I know about at the moment.

more
Your comment:

Send Feedback
ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: