ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Reward Augmented Maximum Likelihood for Neural Structured Prediction
Mohammad Norouzi and Samy Bengio and Zhifeng Chen and Navdeep Jaitly and Mike Schuster and Yonghui Wu and Dale Schuurmans
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Jon Gauthier 9 years ago

(See also a more thorough summary in [a LaTeX PDF][1].)

This paper has some nice clear theory which bridges maximum likelihood (supervised) learning and standard reinforcement learning. It focuses on *structured prediction* tasks, where we want to learn to predict $p_\theta(y \mid x)$ where $y$ is some object with complex internal structure.

We can agree on some deficiencies of maximum likelihood learning:

- ML training fails to assign **partial credit**. Models are trained to maximize the likelihood of the ground-truth outputs in the dataset, and all other outputs are equally wrong. This is an increasingly important problem as the space of possible solutions grows.
- ML training is potentially disconnected from **downstream task reward**. In machine translation, we usually want to optimize relatively complex metrics like BLEU or TER. Since these metrics are non-differentiable, we have to settle for optimizing proxy losses that we hope are related to the metric of interest.

Reinforcement learning offers an attractive alternative in theory. RL algorithms are designed to optimize non-differentiable (even stochastic) reward functions, which sounds like just what we want. But RL algorithms have their own problems with this sort of structured output space:

- Standard RL algorithms rely on samples from the model we are learning, $p_\theta(y \mid x)$. This becomes intractable when our output space is very complex (e.g. 80-token sequences where each word is drawn from a vocabulary of 80,000 words).
- The reward spaces for problems of interest are extremely sparse. Our metrics will assign 0 reward to most of the 80^80K possible outputs in the translation problem in the paper.
- Vanilla RL doesn't take into account the ground-truth outputs available to us in structured prediction.

This paper designs a solution which combines supervised learning with a reinforcement learning-inspired smoothing method. Concretely, the authors design an **exponentiated payoff distribution** $q(y \mid y^*; \tau)$ which assigns high mass to high-reward outputs $y$ and low mass elsewhere. This distribution is used to effectively smooth the loss function established by the ground-truth outputs in the supervised data. We end up optimizing the following objective:

$$\mathcal L_\text{RML} = - \mathbb E_{x, y^* \sim \mathcal D}\left[ \sum_y q(y \mid y^*; \tau) \log p_\theta(y \mid x) \right]$$

This optimization depends on samples from our dataset $\mathcal D$ and, more importantly, the stationary payoff distribution $q$. This contrasts strongly with standard RL training, where the objective depends on samples from the non-stationary model distribution $p_\theta$. To make that clear, we can rewrite the above with another expectation:

$$\mathcal L_\text{RML} = - \mathbb E_{x, y^* \sim \mathcal D, y \sim q(y \mid y^*; \tau)}\left[ \log p_\theta(y \mid x) \right]$$

### Model details

If you're interested in the low-level details, I wrote up the gist of the math in [this PDF][1].

### Analysis

#### Relationship to label smoothing

This training approach is mathematically equivalent to label smoothing, applied here to structured output problems. In next-word prediction language modeling, a popular trick involves smoothing the target distributions by combining the ground-truth output with some simple base model, e.g. a unigram word frequency distribution. (This just means we take a weighted sum of the one-hot vector from our supervised data and a normalized frequency vector calculated on some corpus.) Mathematically, the cross entropy with label smoothing is

$$\mathcal L_\text{ML-smooth} = - \mathbb E_{x, y^* \sim \mathcal D} \left[ \sum_y p_\text{smooth}(y; y^*) \log p_\theta(y \mid x) \right]$$

(The equation above leaves out a constant entropy term.)

The gradient of this objective looks exactly the same as the reward-augmented ML gradient from the paper:

$$\nabla_\theta \mathcal L_\text{ML-smooth} = \mathbb E_{x, y^* \sim \mathcal D, y \sim p_\text{smooth}} \left[ \log p_\theta(y \mid x) \right]$$

So reward-augmented likelihood is equivalent to label smoothing, where our smoothing distribution is log-proportional to our downstream reward function.

#### Relationship to distillation

Optimizing the reward-augmented maximum likelihood is equivalent to minimizing the KL divergence $$D_\text{KL}(q(y \mid y^*; \tau) \mid\mid p_\theta(y \mid x))$$

This divergence reaches zero iff $q = p$. We can say, then, that the effect of optimizing on $\mathcal L_\text{RML}$ is to **distill** the reward function (which parameterizes $q$) into the model parameters $\theta$ (which parameterize $p_\theta$).

It's exciting to think about other sorts of more complex models that we might be able to distill in this framework. The unfortunate (?) restriction is that the "source" model of the distillation ($q$ in this paper) must admit to efficient sampling.

#### Relationship to adversarial training

We can also view reward-augmented maximum likelihood training as a data augmentation technique: it synthesizes new "partially correct" examples using the reward function as a guide. We then train on all of the original and synthesized data, again weighting the gradients based on the reward function.

Adversarial training is a similar data augmentation technique which generates examples that force the model to be robust to changes in its input space (robust to changes of $x$). Both adversarial training and the RML objective encourage the model to be robust "near" the ground-truth supervised data. A high-level comparison:

- Adversarial training can be seen as data augmentation in the input space; RML training performs data augmentation in the output space.
- Adversarial training is a **model-based data augmentation**: the samples are generated from a process that depends on the current parameters during training. RML training performs **data-based augmentation**, which could in theory be done independent of the actual training process.

---

Thanks to Andrej Karpathy, Alec Radford, and Tim Salimans for interesting discussion which contributed to this summary.

[1]: https://drive.google.com/file/d/0B3Rdm_P3VbRDVUQ4SVBRYW82dU0/view

papers.nips.cc
scholar.google.com

Variational Dropout and the Local Reparameterization Trick
Blum, Avrim and Haghtalab, Nika and Procaccia, Ariel D.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 10 years ago

This paper starts by introducing a trick to reduce the variance of stochastic gradient variational Bayes (SGVB) estimators. In neural networks, SGVB consists in learning a variational (e.g. diagonal Gaussian) posterior over the weights and biases of neural networks, through a procedure that (for the most part) alternates between adding (Gaussian) noise to the model's parameters and then performing a model update with backprop.

The authors present a local reparameterization trick, which exploits the fact that the Gaussian noise added into the weights could instead be added directly into the pre-activation (i.e. before the activation fonction) vectors during forward propagation. This is due to the fact that computing the pre-activation is a linear operation, thus noise at that level is also Gaussian. The advantage of doing so is that, in the context of minibatch training, one can efficiently then add independent noise to the pre-activation vectors for each example of the minibatch. The nature of the local reparameterization trick implies that this is equivalent to using one corrupted version of the weights for each example in the minibatch, something that wouldn't be practical computationally otherwise. This is in fact why, in normal SGVB, previous work would normally use a single corrupted version of the weights for all the minibatch.

The authors demonstrate that using the local reparameterization trick yields stochastic gradients with lower variance, which should improve the speed of convergence.

Then, the authors demonstrate that the Gaussian version of dropout (one that uses multiplicative Gaussian noise, instead of 0-1 masking noise) can be seen as the local reparameterization trick version of a SGVB objective, with some specific prior and variational posterior. In this SGVB view of Gaussian dropout, the dropout rate is an hyper-parameter of this prior, which can now be tuned by optimizing the variational lower bound of SGVB. In other words, we now have a method to also train the dropout rate! Moreover, it becomes possible to tune an individual dropout rate parameter for each layer, or even each parameter of the model.

Experiments on MNIST confirm that tuning that parameter works and allows to reach good performance of various network sizes, compared to using a default dropout rate.

##### My two cents

This is another thought provoking connection between Bayesian learning and dropout. Indeed, while Deep GPs have allowed to make a Bayesian connection with regular (binary) dropout learning \cite{journals/corr/GalG15}, this paper sheds light on a neat Bayesian connection for the Gaussian version of dropout. This is great, because it suggests that Gaussian dropout training is another legit way of modeling uncertainty in the parameters of neural networks. It's also nice that that connection also yielded a method for tuning the dropout rate automatically.

I hope future work (by the authors or by others) can evaluate the quality of the corresponding variational posterior in terms of estimating uncertainty in the network and, in particular, in obtaining calibrated output probabilities.

Little detail: I couldn't figure out whether the authors tuned a single dropout rate for the whole network, or used many rates, for instance one per parameter, as they suggest can be done.

arxiv.org
scholar.google.com

Building Machines That Learn and Think Like People
Brenden M. Lake and Tomer D. Ullman and Joshua B. Tenenbaum and Samuel J. Gershman
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.AI, cs.CV, cs.LG, cs.NE, stat.ML
more

[link] Summary by Abhishek Das 8 years ago

This paper performs a comparitive study of recent advances in deep learning with human-like learning from a cognitive science point of view. Since natural intelligence is still the best form of intelligence, the authors list a core set of ingredients required to build machines that reason like humans.

- Cognitive capabilities present from childhood in humans.
- Intuitive physics; for example, a sense of plausibility of object trajectories, affordances.
- Intuitive psychology; for example, goals and beliefs.
- Learning as rapid model-building (and not just pattern recognition).
- Based on compositionality and learning-to-learn.
- Humans learn by inferring a general schema to describe goals, object types and interactions. This enables learning from few examples.
- Humans also learn richer conceptual models.
- Indicator: variety of functions supported by these models: classification, prediction, explanation, communication, action, imagination and composition.
- Models should hence have strong inductive biases and domain knowledge built into them; structural sharing of concepts by compositional reuse of primitives.
- Use of both model-free and model-based learning.
- Model-free, fast selection of actions in simple associative learning and discriminative tasks.
- Model-based learning when a causal model has been built to plan future actions or maximize rewards.
- Selective attention, augmented working memory, and experience replay are low-level promising trends in deep learning inspired from cognitive psychology.
- Need for higher-level aforementioned ingredients.

arxiv.org
arxiv-vanity.com
scholar.google.com

Online Continual Learning with Maximally Interfered Retrieval
Rahaf Aljundi and Lucas Caccia and Eugene Belilovsky and Massimo Caccia and Laurent Charlin and Tinne Tuytelaars
arXiv e-Print archive - 2019 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by Massimo Caccia 6 years ago

Disclaimer: I am an author

# Intro

Experience replay (ER) and generative replay (GEN) are two effective continual learning strategies. In the former, samples from a stored memory are replayed to the continual learner to reduce forgetting. In the latter, old data is compressed with a generative model and generated data is replayed to the continual learner. Both of these strategies assume a random sampling of the memories. But learning a new task doesn't cause **equal** interference (forgetting) on the previous tasks!  

In this work, we propose a controlled sampling of the replays. Specifically, we retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. The method is called Maximally Interfered Retrieval (MIR).

## Cartoon for explanation

https://i.imgur.com/5F3jT36.png

Learning about dogs and horses might cause more interference on lions and zebras than on cars and oranges. Thus, replaying lions and zebras would be a more efficient strategy.

# Method

1) incoming data: $(X_t,Y_t)$

2) foreseen parameter update: $\theta^v= \theta-\alpha\nabla\mathcal{L}(f_\theta(X_t),Y_t)$

### applied to ER (ER-MIR)
3) Search for the top-$k$ values $x$ in the stored memories using the criterion $$s_{MI}(x) = \mathcal{L}(f_{\theta^v}(x),y) -\mathcal{L}(f_{\theta}(x),y)$$

### or applied to GEN (GEN-MIR)
3)   
$$
     \underset{Z}{\max} \, \mathcal{L}\big(f_{\theta^v}(g_\gamma(Z)),Y^*\big) -\mathcal{L}\big(f_{\theta}(g_\gamma(Z)),Y^*\big)
$$
$$
         \text{s.t.}   \quad ||z_i-z_j||_2^2 > \epsilon \forall  z_i,z_j \in Z \,\text{with} \, z_i\neq z_j
$$
i.e. search in the latent space of a generative model $g_\gamma$ for samples that are the most forgotten given the foreseen update.

4) Then add theses memories to incoming data $X_t$ and train $f_\theta$

# Results

### qualitative
https://i.imgur.com/ZRNTWXe.png

Whilst learning 8s and 9s (first row), GEN-MIR mainly retrieves 3s and 4s (bottom two rows) which are similar to 8s and 9s respectively.

### quantitative 

GEN-MIR was tested on MNIST SPLIT and Permuted MNIST, outperforming the baselines in both cases.

ER-MIR was tested on MNIST SPLIT, Permuted MNIST and Split CIFAR-10, outperforming the baselines in all cases.


# Other stuff
### (for avid readers)

We propose a hybrid method (AE-MIR) in which the generative model is replaced with an autoencoder to facilitate the compression of harder dataset like e.g. CIFAR-10.

arxiv.org
arxiv-vanity.com
scholar.google.com

Language GANs Falling Short
Massimo Caccia and Lucas Caccia and William Fedus and Hugo Larochelle and Joelle Pineau and Laurent Charlin
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CL, cs.LG
more

[link] Summary by CodyWild 7 years ago

This paper’s high-level goal is to evaluate how well GAN-type structures for generating text are performing, compared to more traditional maximum likelihood methods. In the process, it zooms into the ways that the current set of metrics for comparing text generation fail to give a well-rounded picture of how models are performing.

In the old paradigm, of maximum likelihood estimation, models were both trained and evaluated on a maximizing the likelihood of each word, given the prior words in a sequence. That is, models were good when they assigned high probability to true tokens, conditioned on past tokens. However, GANs work in a fundamentally new framework, in that they aren’t trained to increase the likelihood of the next (ground truth) word in a sequence, but to generate a word that will make a discriminator more likely to see the sentence as realistic. Since GANs don’t directly model the probability of token t, given prior tokens, you can’t evaluate them using this maximum likelihood framework.

This paper surveys a range of prior work that has evaluated GANs and MLE models on two broad categories of metrics, occasionally showing GANs to perform better on one or the other, but not really giving a way to trade off between the two.
- The first type of metric, shorthanded as “quality”, measures how aligned the generated text is with some reference corpus of text: to what extent your generated text seems to “come from the same distribution” as the original. BLEU, a heuristic frequently used in translation, and also leveraged here, measures how frequently certain sets of n-grams occur in the reference text, relative to the generated text. N typically goes up to 4, and so in addition to comparing the distributions of single tokens in the reference and generated, BLEU also compares shared bigrams, trigrams, and quadgrams (?) to measure more precise similarity of text.
- The second metric, shorthanded as “diversity” measures how different generated sentences are from one another. If you want to design a model to generate text, you presumably want it to be able to generate a diverse range of text - in probability terms, you want to fully sample from the distribution, rather than just taking the expected or mean value. Linguistically, this would be show up as a generator that just generates the same sentence over and over again. This sentence can be highly representative of the original text, but lacks diversity. One metric used for this is the same kind of BLEU score, but for each generated sentence against a corpus of prior generated sentences, and, here, the goal is for the overlap to be as low as possible

The trouble with these two metrics is that, in their raw state, they’re pretty incommensurable, and hard to trade off against one another. The authors of this paper try to address this by observing that all models trade off diversity and quality to some extent, just by modifying the entropy of the conditional token distribution they learn. If a distribution is high entropy, that is, if it spreads probability out onto more tokens, it’s likelier to bounce off into a random place, which increases diversity, but also can make the sentence more incoherent. By contrast, if a distribution is too low entropy, only ever putting probability on one or two words, then it will be only ever capable of carving out a small number of distinct paths through word space.
The below table shows a good example of what language generation can look like at high and low levels of entropy
https://i.imgur.com/YWGXDaJ.png

The entropy of a softmax distribution be modified, without changing the underlying model, by changing the *temperature* of the softmax calculation. So, the authors do this, and, as a result, they can chart out that model’s curve on the quality/diversity axis. Conceptually, this is asking “at a range of different quality thresholds, how good is this model’s diversity,” and vice versa. I mentally analogize this to a ROC curve, where it’s not really possible to compare, say, precision of models that use different thresholds, and so you instead need to compare the curve over a range of different thresholds, and compare models on that.

https://i.imgur.com/C3zdEjm.png

When they do this for GANs and MLEs, they find that, while GANs might dominate on a single metric at a time, when you modulate the temperature of MLE models, they’re able to achieve superior quality when you tune them to commensurate levels of diversity.