![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Variational Inference builds around the ELBO (Evidence Lower BOund) -- a lower bound on a marginal log-likelihood of the observed data $\log p(x) = \log \int p(x, z) dz$ (which is typically intractable). The ELBO makes use of an approximate posterior to form a lower bound: $$ \log p(x) \ge \mathbb{E}_{q(z|x)} \log \frac{p(x, z)}{q(z|x)} $$ # Introduction to Quasi Monte Carlo It's assumed that both the join $p(x, z)$ (or, equivalently the likelihood $p(x|z)$ and the prior $p(z)$) and the approximate posterior $q(z|x)$ are tractable (have closed-form density and are easy to sample from). Then one can estimate the ELBO via Monte Carlo as $$ \text{ELBO} \approx \frac{1}{N} \sum_{n=1}^N \log \frac{p(x, z_n)}{q(z_n|x)}, \quad\quad z_n \sim q(z|x) $$ This estimate can be used in stochastic optimization, essentially stochastically maximizing the ELBO, which leads to either increasing marginal log-likelihood or decreasing the gap between the true posterior distribution $p(z|x)$ and the approximate one $q(z|x)$. Efficiency of stochastic optimization depends on the amount of stochasticity. The bigger the variance is -- the harder it's to locate the optimum. It's well-known that in typical Monte Carlo variance scales as 1/N for a sample of size N, and hence typical error of such "approximation" has an order of $1/\sqrt{N}$ However, there are more efficient schemes to evaluate the integrals of the form of the expectation. To give you some intuition, consider $$ \mathbb{E}_{q(z)} f(z) = \int_\mathcal{Z} f(z) q(z) dz = \int_{[0, 1]^d} f(z(u)) du $$ Here I used the fact that any random variance can be expressed as a deterministic transformation of a uniform r.v. (by application of the inverse CDF of the former r.v.), so estimating the expectation using MC essentially means sampling a bunch of uniform r.v. $u_1, \dots, u_N$ and transforming them into the corresponding $z$s. However, uniformly distributed random variables sometimes clump together and leave some areas uncovered: https://i.imgur.com/fejsl2t.png Low Discrepancy sequences are designed to cover the unit cube more uniformly in a sense that points are unlikely to clump and should not leave "holes" in the landscape, effectively facilitating a better exploration. The Quasi Monte Carlo then employs these sequences to evaluate the integral at, giving (a deterministic!) approximation with an error of an order $\tfrac{(\log N)^d}{N}$. If you want some randomness, there are clever randomization techniques, that give you Randomized Quasi Monte Carlo with roughly the same guarantees. # RQMC applied to VI Authors estimate the ELBO using samples obtained from the Randomized QMC (scrambled Sobol sequence, in particular), and show experimentally that this leads to lower gradient variance and faster convergence. # Theoretical Properties Authors also analyse Stochastic Gradient Descent with RQMC and prove several convergence theorems. To the best of my knowledge, this is the first work considering stochastic optimization using QMC (which is understandable given that one needs to be able to control the gradients to do so) # Critique The paper was a great read, and spurred a great interest in me. I find the idea of using QMC very intriguing, however in my opinion there are several problems on the road to mass-adoption 1. Authors use RQMC to get the stochastic nature of $z_n$, however that essentially changes the effective distribution of generated $z$, which should be accounted for in the ELBO, otherwise the objective they're maximizing is not an ELBO (if only asymptotically) and hence not necessary a lower bound on the marginal log-likelihood. However, finding the correct proposal density $q(z|x)$ (and successfully using it) does not seem easy as most randomization schemes give you degenerate support, and KL is not well-defined. 2. Authors have an experiment on a Bayesian Neural Network, however a very small one, there are reasons to doubt their results will translate to real ones, as the positive effect of QMC vanishes as dimension grows (because it's harder for uniform samples to clump together) 3. Standard control variates might no longer reduce the variance, further research is needed. ![]()
1 Comments
|
[link]
A central question of this paper is: under what circumstances will you see agents that have been trained to optimize their own reward implement strategies - like tit for tat - that are are more sophisticated and higher overall reward than each agent simply pursuing its dominant strategy. The games under consideration here are “general sum” games like Iterated Prisoner’s Dilemma, where each agent’s dominant strategy is to defect, but with some amount of coordination or reciprocity, better overall outcomes are possible. Previously, models have achieved this via explicit hardcoding, but this paper strove to use a simpler, more general approach: allowing each agent A to optimize its reward not only with regard to a fixed opponent, but with regard to an opponent that will make a predictable update move in response to the action A is about to take. Specifically, this model - shorthanded as LOLA, Learning with Opponent-Learning Awareness - maximizes a given agent’s expected discount reward, but looks at reward *conditional on* the ways the opponent will update to a given action. In a simplified world where the explicit reward function is known, it’s possible to literally take the derivative through the opponent’s expected update step, taking into account the ways your expected reward is changed by the response you expect of your opponent. Outside of this simplified framework, in the world of policy gradients, there’s no analytic loss function; you can no longer directly differentiate your reward function with respect to your opponent’s actions, but you can differentiate your expected reward estimator with respect to them. This concept is quite similar to a 2016 paper, Metz et al, that used this concept to train a more effective GAN, by allowing each network in the adversarial pair to “look ahead” to their opponent’s expected response as a way to avoid getting stuck in repetitive action/response cycles. In circumstances where the parameters of the opponent are not known - obviously closer to realistic for an adversarial scenario - the paper demonstrates proof of concept ability to model an opponent’s strategy based on their past actions, and use that to conduct response-step estimates. https://i.imgur.com/5xddJRj.png It should of course be said in all this: even though this setup did produce results closer to what we would expect in rational reciprocity, it’s still very simplified. In most of the experiments, each agent had perfect knowledge of the opponent’s priorities and likely responses; in most game theory scenarios, constructing a model of your opponent is a nontrivial part of the difficulty. Nonetheless, I found it a ![]() |
[link]
The overall goal of the paper is measure how similar different layer activation profiles are to one another, in hopes of being able to quantify the similarity of the representations that different layers are learning. If you had a measure that captured this, you could ask questions like: “how similar are the representations that are learned by different networks on the same task”, and “what is the dynamic of representational change in a given layer throughout training”? Canonical Correlation Analysis is one way of approaching this question, and the way taken by this paper. The premise of CCA is that you have two multidimensional variable sets, where each set is made up of vectors representing dimensions within that variable set. Concretely, in this paper, the sets under examination are the activation profiles of two layers (either the same layer at different points in training, or different layers in the same network, or layers in different networks). An activation profile is thought of in terms of multiple vectors, where each vector represents a given neuron’s activation value, evaluated over some observation set X. Importantly, for the two layers that you’re comparing, the set of observations X needs to be of the same length, but the layers can have different number of neurons (and, consequently, different numbers of vectors making up that layer’s multivariate set). Given this setup, the goal of CCA is to find vectors that are linear combinations of the basis vectors of each set, to satisfy some constraint. In that broad sense, this is similar to the project of PCA, which also constructs linear-combination principal components to better represent the underlying data space. However, in PCA, the constraints that define these combinations are based on one multidimensional feature space, not two. In CCA, instead of generating k principal components, you generate k *pairs* of canonical correlates. Each canonical correlate pair, (U1, V1) is a linear combination of the activation vectors of sets L1 and L2 respectively, and is chosen with the goal of minimizing the the angle (cosine) distance between the correlates in each pair. If you think about L1 and L2 each only having two activations (that is: if you think about them as being two-dimensional spaces) then the goal of CCA is to find the cosine distance between the planes defined by the two activation spaces. An important intuition here is that in this framing, vector sets that are just linear transformations of one another (scalings, rotations, swaps in the arbitrary order of activations) will look the same, which wouldn’t be the case if you just looked at raw correlations between the individual activations. This is connected to the linear algebra idea that, if you have two vectors, and a third that is just a linear combination of the first two, the span of those vectors is still just that two-dimensional space. This property is important for the analysis of neural network representations because it means it will be able to capture similarities between representational spaces that have fundamental geometric similarities, even if they’re different on a more surface level. In prior papers, CCA had been used by calculating the CCA vectors between varying sets of layers, and then taking the mean CCA value over all of the pairs of vectors. This paper argues against that approach, on the theory that network layers are probably not using the full representational capacity of their activation dimensions (think, as analogy: a matrix with three columns, that only actually spans two), and so including in your average very low-order correlations is mostly adding uninformative noise to your similarity measure. Instead, this paper weights the correlation coefficients according to the magnitudes of the correlate vectors in the pair; as best I can tell, this is roughly analogous to weighting according to eigenvalues, in a PCA setting. Using this weighted-average similarity measure, the authors do some really interesting investigations into learning dynamics. These include: * Comparing the intermediate-layer representations learned by networks that achieve low train error via memorization vs via actually-generalizing solutions, and show that, during training, the intermediate representations of generalizing networks are more similar to one another than memorizing networks are to one another. Intuitively, this aligns with the idea that there are many ways to noisily memorize, but a more constrained number of ways to actually learn meaningful information about a dataset. A super interesting implication of this is the idea that representational similarity *on the training set* across multiple bootstrapped or randomized trainings could be used as a proxy for test set performance, which could be particularly valuable in contexts where test data is limited https://i.imgur.com/JwyHFmN.png * Across networks, lower layers tend to be more similar to one another than layers closer to the output; said another way, the very simple (e.g. edge detectors) tend to be quite similar across networks, but the higher level representations are more divergent and influenceable by smaller quirks of the training set. * Within a given dataset, you can cluster learned internal representations across many training sets and recover groups trained with the same learning rate, even though the final layer softmax is inherently similar across models that achieve the same training error. This implies that metrics like this can give us some idea of the different minima that the optimization algorithm finds, as a function of different learning rates. Overall, I found this paper a great example of a straightforward idea used to clearly answer important and interesting questions, which is always refreshing amidst a sea of “tiny hack for an extra 0.05 accuracy”. ![]() |
[link]
If one is a Bayesian he or she best expresses beliefs about next observation $x_{n+1}$ after observing $x_1, \dots, x_n$ using the **posterior predictive distribution**: $p(x_{n+1}\vert x_1, \dots, x_n)$. Typically one invokes the de Finetti theorem and assumes there exists an underlying model $p(x\vert\theta)$, hence $p(x_{n+1}\vert x_1, \dots, x_n) = \int p(x_{n+1} \vert \theta) p(\theta \vert x_1, \dots, x_n) d\theta$, however this integral is far from tractable in most cases. Nevertheless, having tractable posterior predictive is useful in cases like few-shot generative learning where we only observe a few instances of a given class and are asked to produce more of it. In this paper authors take a slightly different approach and build a neural model with tractable posterior predictive distribution $p(x_{n+1} | x_1, \dots, x_n)$ suited for complex objects like images. In order to do so the authors take a simple model with tractable posterior predictive $p(z_{n+1} | z_1, \dots, z_n)$ (like a Gaussian Process, but not quite) and use it as a latent code, which is obtained from observations using an analytically inversible encoder $f$. This setup lets you take a complex $x$ like an image, run it through $f$ to obtain $z = f(x)$ -- a simplified latent representation for which it's easier to build joint density of all possible representations and hence easier to model the posterior predictive. By feeding latent representations of $x_1, \dots, x_n$ (namely, $z_1, \dots, z_n$) to the posterior predictive $p(z_{n+1} | f(x_1), \dots, f(x_n))$ we obtain obtain a distribution of latent representations that are coherent with those of already observed $x$s. By sampling $z$ from this distribution and running it through $f^{-1}$ we recover an object in the observation space, $x_\text{pred} = f^{-1}(z)$ -- a sample most coherent with previous observations. Important choices are: * Model for latent representations $z$: one could use Gaussian Process, however authors claim it lacks some helpful properties and go for a more general [Student-T Process](http://www.shortscience.org/paper?bibtexKey=journals/corr/1402.4306). Then authors assume that each component of $z$ is a univariate sample from this process (and hence is independent from other components) * Encoder $f$. It has to be easily inversible and have an easy-to-evaluate Jacobian (the determinant of the Jacobi matrix). The former is needed to perform decoding of predictions in latent representations space and the later is used to efficiently compute a density of observations $p(x_1, \dots, x_n)$ using the standard change of variables formula $$p(x_1, \dots, x_n) = p(z_1, \dots, z_n) \left\vert\text{det} \frac{\partial f(x)}{\partial x} \right\vert$$The architecture of choice for this task is [RealNVP](http://www.shortscience.org/paper?bibtexKey=journals/corr/1605.08803) Done this way, it's possible to write out the marginal density $p(x_1, \dots, x_n)$ on all the observed $x$s and maximize it (as in the Maximum Likelihood Estimation). Authors choose to factor the joint density in an auto-regressive fashion (via the chain rule) $$p(x_1, \dots, x_n) = p(x_1) p(x_2 \vert x_1) p(x_3 \vert x_1, x_2) \dots p(x_n \vert x_1, \dots, x_{n-1}) $$with all the conditional marginals $p(x_i \vert x_1, \dots, x_{i-1})$ having analytic (student t times the jacobian) density -- this allows one to form a fully differentiable recurrent computation graph whose parameters (parameters of Student Processes for each component of $z$ + parameters of the encoder $f$) to be learned using any stochastic gradient method. https://i.imgur.com/yRrRaMs.png ![]() |
[link]
This paper is about interactive Visual Question Answering (VQA) setting in which agents must ask questions about images to learn. This closely mimics how people learn from each other using natural language and has a strong potential to learn much faster with fewer data. It is referred as learning by asking (LBA) through the paper. The approach is composed of three models: http://imisra.github.io/projects/lba/approach_HQ.jpeg 1. **Question proposal module** is responsible for generating _important_ questions about the image. It is a combination of 2 models: - **Question generator** model produces a question. It is LSTM that takes image features and question type (random choice from available options) as input and outputs a question. - **Question relevance** model that selects questions relevant to the image. It is a stacked attention architecture network (shown below) that takes in generated question and image features and filters out irrelevant to the image questions. https://i.imgur.com/awPcvYz.png 2. **VQA module** learns to predict answer given the image features and question. It is implemented as stacked attention architecture shown above. 3. **Question selection module** selects the most informative question to ask. It takes current state of VQA module and its output to calculate expected accuracy improvement (details are in the paper) to measure how fast the VQA module has a potential to improve for each answer. The single question selection (i.e. best question for VQA to improve the fastest) strategy is based on epsilon-greedy policy. This method (i.e. LBA) is shown to be about 50% more data efficient than naive VQA method. As an interesting future direction of this work, the authors propose to use real-world images and include a human in the training as an answer provider. ![]() |