ShortScience.org - Making Science Accessible!

2

[link] Summary by CodyWild 7 years ago

It’s a commonly understood problem in Reinforcement Learning: that it is difficult to fully specify your exact reward function for an agent you’re training, especially when that agent will need to operate in conditions potentially different than those it was trained in. The canonical example of this, used throughout the Inverse Rewards Design paper, is that of an agent trained on an environment of grass and dirt, that now encounters an environment with lava. In a typical problem setup, the agent would be indifferent to passing or not passing over the lava, because it was never disincentivized from doing so during training. 

The fundamental approach this paper takes is to explicitly assume that there exists a program designer who gave the agent some proxy reward, and that that proxy reward is a good approximation of the true reward on training data, but might not be so on testing. This framing, of the reward as a noisy signal, allows the model to formalize its uncertainty about scenarios where the proxy reward might be a poor mapping to the real one. 

The way the paper tests this is through a pretty simplified model. In the example, the agent is given a reward function expressed by a weighting of different squares it could navigate into: it has a strong positive weight on dirt, and a strong negative one on grass. The agent then enters an environment where there is lava, which, implicitly, it has a 0 penalty for in its rewards function. However, it’s the case that, if you integrate over all possible weight values for “lava”, none of them would have produced different behavior over the training trajectories. Thus, if you assume high uncertainty, and adopt a risk-averse policy where under cases of uncertainty you assume bad outcomes, this leads to avoiding values of the environment feature vector that you didn’t have data weighting against during training. 

Overall, the intuition of this paper makes sense to me, but it’s unclear to me if the formulation it uses generalizes outside of a very trivial setting, where your reward function is an explicit and given function of your feature vectors, rather than (as is typical) a scalar score not explicitly parametrized by the states of game prior to the very last one. It’s certainly possible that it might, but, I don’t feel like I quite have the confidence to say at this point.

2

[link] Summary by CodyWild 7 years ago

This paper has an unusual and interesting goal, compared to those I more typically read: it wants to develop a “translation” between the messages produced by a model, and natural language used by a human. More specifically, the paper seeks to do this in the context of an two-player game, where one player needs to communicate information to the other. A few examples of this are:
- Being shown a color, and needing to communicate to your partner so they can choose that color
- Driving, in an environment where you can’t see the other car, but you have to send a coordinating message so that you don’t collide

Recently, people have started training multi-agent that play games like these, where they send “message” vectors back and forth, in a way fully integrated with the rest of the backpropogation procedure. From just observing the agents’ actions, it’s not necessarily clear which communication strategy they’re using. That’s why this paper poses as an explicit problem: how can we map between the communication vectors produced by the agents and the words that would be produced by a human in a similar environment?

Interestingly, the paper highlights two different ways you could think about structuring a translation objective. The first is “pragmatic interpretation,” under which you optimize what you communicate about something according to the operation that needs to be performed afterwards. To make that more clear, take a look at the attached picture. Imagine that player one is shown a shape, and needs to use a phrase from the bottom language (based on how many sides the shape has) to describe it to player two, who then needs to guess the size of the shape (big or small), and is rewarded for guessing correctly. Because “many” corresponds to both a large and a small shape, the strategy that optimizes the action that player two takes, conditional on getting player one’s message, is to lie and describe a hexagon as “few”, since that will lead to correct inference about the size of the shape, which is what’s most salient here. This example shows how, if you optimize a translation mapping by trying to optimize the reward that the post-translation agent can get, you might get a semantically incorrect translation. That might be good for the task at hand, but, because it leaves you with incorrect beliefs about the true underlying mapping, it will generalize poorly to different tasks.

The alternate approach, championed by the paper, is to train a translation such that the utterances in both languages are similar insofar as, conditional on hearing them, and having some value for their own current state, the listening player arrives at similar beliefs about the current state of the player sending the message. This is mathematically framed as by defining a metric q, representing the quality of the translation between two z vectors, as: “taking an expectation over all possible contextual states of (player 1, player 2), what is the difference between the distribution of beliefs about the state of player 1 (the sending player) induced in player 2 by hearing each of the z vectors. Because taking the full expectation over this joint distribution is intractable, the approach is instead done by sampling.

These equations require that you have reasonable models of human language, and understanding of human language, in the context of games. To do this, the authors used two types of datasets:
1. Linguistic descriptions of objects of things, like the xkcd color dataset. Here, the player’s hidden state is the color that they are trying to describe using some communication scheme.
2. Mechanical turk game runs playing the aforementioned driver game, where they have to communicate to the other driver. Here, the player’s “hidden state” represents a combination of its current location and intentions.
From these datasets, they can train simple emulator models that learn “what terms is a human most likely to use for a given color” [p(z|x)], and “what colors will a human guess, conditional on those terms”.

The paper closes by providing a proof as to how much reward-based value is lost by optimizing for the true semantic meaning, rather than the most pragmatically useful translation. They find that there is a bound on the gap, and that, in many empirical cases, the observed gap is quite small.

Overall, this paper was limited in scope, but provided an interesting conceptual framework for thinking about how you might structure a translation, and the different implications that structure might have on your results.

6

[link] Summary by CodyWild 7 years ago

At NIPS 2017, Ali Rahimi was invited on stage to give a keynote after a paper he was on received the “Test of Time” award. While there, in front of several thousand researchers, he gave an impassioned argument for more rigor: more small problems to validate our assumptions, more visibility into why our optimization algorithms work the way they do. The now-famous catchphrase of the talk was “alchemy”; he argued that the machine learning community has been effective at finding things that work, but less effective at understanding why the techniques we use work. A central example he used in his talk is that of Batch Normalization: a now nearly-universal step in optimizing deep nets, but one where our accepted explanation of “reducing internal covariate shift” is less rigorous than one might hope. 

With apologies for the long preamble, this is the context in which today’s paper is such a welcome push in the direction of what Rahimi was advocating for - small, focused experimentation that tries to build up knowledge from principles, and, specifically, asks the question: “Does Batch Norm really work via reducing covariate shift”. 

To answer the question of whether internal covariate shift is a likely mechanism of the - empirically very solid - improved performance of Batch Norm, the authors do a few simple experience. First, and most straightforwardly, they train a basic convolutional net with and without BatchNorm, pick a layer, and visualize the activation distribution of that layer over time, both in the Batch Norm and non-Batch Norm case. While they saw the expected performance boost, the Batch Norm case didn’t seem to be meaningfully more stable over time, relative to the normal case. Second, the authors tested what would happen if they added non-zero-mean random noise *after* Batch Norm in the network. The upshot of this was that they were explicitly engineering internal covariate shift, and, if control thereof was the primary useful purpose of Batch Norm, you would expect that to neutralize BN’s good performance. In this experiment, while the authors did indeed see noisier, less stable activation distributions in the noise + BN case (in particular: look at layer 13 activations in the attached image), but noisy BN performed nearly as well as non-noisy, and meaningfully better than the standard model without noise, but also without BN. 

As a final test, they approached the idea of “internal covariate shift” from a different definitional standpoint. Maybe a better way of thinking about it is in terms of stability of your gradients, in the face of updates made by lower layers of the network. That is to say: each parameter of the network pushes itself in the direction of lower loss all else held equal, but in practice, you change lower-level parameters simultaneously, which could cause the directional change the higher-layer parameter thought it needed to be off. So, the authors calculated the “gradient delta” between the gradient the model trains on, and what the gradient would be if you estimated it *after* all of the lower layers of the model had updated, such that the distribution of inputs to that layer has changed. Although the expectation would be that this gradient delta is smaller for batch norm, in fact, the authors found that, if anything, the opposite was true. 

So, in the face of none of these ideas panning out, the authors then introduce the best idea they’ve found for what motivates BN’s improved performance: a smoothing out of the loss function that SGD is optimizing. A smoother curve means, generally speaking, that the magnitudes of your gradients will be smaller, and also that the value of the gradient will change more slowly (i.e. low second derivative). As support for this idea, they show really different results for BN vs standard models in terms of, for example, how predictive a gradient at one point is of a gradient taken after you take a step in the direction of the first gradient. BN has meaningfully more predictive gradients, tied to lower variance in the values of the loss function in the direction of the gradient. The logic for why the mechanism of BN would cause this outcome is a bit tied up in math that’s hard to explain without LaTeX visuals, but basically comes from the idea that Batch Norm decreases the magnitude of the gradient of each layer output with respect to individual weight parameters, by averaging out those magnitudes over the batch. 

As Rahimi said in his initial talk, a lot of modern modeling is “applying brittle optimization techniques to loss surfaces we don’t understand.” And, by and large, that is in fact true: it’s devilishly difficult to get a good handle on what loss surfaces are doing when they’re doing it in several-million-dimensional space. But, it being hard doesn’t mean we should just give up on searching for principles we can build our understanding on, and I think this paper is a really fantastic example of how that can be done well.

1 Comments

5

[link] Summary by CodyWild 7 years ago

If you were to survey researchers, and ask them to name the 5 most broadly influential ideas in Machine Learning from the last 5 years, I’d bet good money that Batch Normalization would be somewhere on everyone’s lists. Before Batch Norm, training meaningfully deep neural networks was an unstable process, and one that often took a long time to converge to success. When we added Batch Norm to models, it allowed us to increase our learning rates substantially (leading to quicker training) without the risk of activations either collapsing or blowing up in values. It had this effect because it addressed one of the key difficulties of deep networks: internal covariate shift. 

To understand this, imagine the smaller problem, of a one-layer model that’s trying to classify based on a set of input features. Now, imagine that, over the course of training, the input distribution of features moved around, so that, perhaps, a value that was at the 70th percentile of the data distribution initially is now at the 30th. We have an obvious intuition that this would make the model quite hard to train, because it would learn some mapping between feature values and class at the beginning of training, but that would become invalid by the end. This is, fundamentally, the problem faced by higher layers of deep networks, since, if the distribution of activations in a lower layer changed even by a small amount, that can cause a “butterfly effect” style outcome, where the activation distributions of higher layers change more dramatically. Batch Normalization - which takes each feature “channel” a network learns, and normalizes [normalize = subtract mean, divide by variance] it by the mean and variance of that feature over spatial locations and over all the observations in a given batch - helps solve this problem because it ensures that, throughout the course of training, the distribution of inputs that a given layer sees stays roughly constant, no matter what the lower layers get up to. 

On the whole, Batch Norm has been wildly successful at stabilizing training, and is now canonized - along with the likes of ReLU and Dropout - as one of the default sensible training procedures for any given network. However, it does have its difficulties and downsides. One salient one of these comes about when you train using very small batch sizes - in the range of 2-16 examples per batch. Under these circumstance, the mean and variance calculated off of that batch are noisy and high variance (for the general reason that statistics calculated off of small sample sizes are noisy and high variance), which takes away from the stability that Batch Norm is trying to provide. 

One proposed alternative to Batch Norm, that didn’t run into this problem of small sample sizes, is Layer Normalization. This operates under the assumption that the activations of all feature “channels” within a given layer hopefully have roughly similar distributions, and, so, you an normalize all of them by taking the aggregate mean over all channels, *for a given observation*, and use that as the mean and variance you normalize by. Because there are typically many channels in a given layer, this means that you have many “samples” that go into the mean and variance. However, this assumption - that the distributions for each feature channel are roughly the same - can be an incorrect one. 

A useful model I have for thinking about the distinction between these two approaches is the idea that both are calculating approximations of an underlying abstract notion: the in-the-limit mean and variance of a single feature channel, at a given point in time. Batch Normalization is an approximation of that insofar as it only has a small sample of points to work with, and so its estimate will tend to be high variance. Layer Normalization is an approximation insofar as it makes the assumption that feature distributions are aligned across channels: if this turns out not to be the case, individual channels will have normalizations that are biased, due to being pulled towards the mean and variance calculated over an aggregate of channels that are different than them. 

Group Norm tries to find a balance point between these two approaches, one that uses multiple channels, and normalizes within a given instance (to avoid the problems of small batch size), but, instead of calculating the mean and variance over all channels, calculates them over a group of channels that represents a subset. The inspiration for this idea comes from the fact that, in old school computer vision, it was typical to have parts of your feature vector that - for example - represented a histogram of some value (say: localized contrast) over the image. Since these multiple values all corresponded to a larger shared “group” feature. If a group of features all represent a similar idea, then their distributions will be more likely to be aligned, and therefore you have less of the bias issue. 

One confusing element of this paper for me was that the motivation part of the paper strongly implied that the reason group norm is sensible is that you are able to combine statistically dependent channels into a group together. However, as far as I an tell, there’s no actually clustering or similarity analysis of channels that is done to place certain channels into certain groups; it’s just done so semi-randomly based on the index location within the feature channel vector. So, under this implementation, it seems like the benefits of group norm are less because of any explicit seeking out of dependant channels, and more that just having fewer channels in each group means that each individual channel makes up more of the weight in its group, which does something to reduce the bias effect anyway.

The upshot of the Group Norm paper, results-wise, is that Group Norm performs better than both Batch Norm and Layer Norm at very low batch sizes. This is useful if you’re training on very dense data (e.g. high res video), where it might be difficult to store more than a few observations in memory at a time. However, once you get to batch sizes of ~24, Batch Norm starts to do better, presumably since that’s a large enough sample size to reduce variance, and you get to the point where the variance of BN is preferable to the bias of GN.

2

[link] Summary by CodyWild 7 years ago

I have a lot of fondness for this paper as a result of its impulse towards clear explanations, simplicity, and pushing back against complexity for complexity’s sake. The goal of the paper is pretty straightforward. Long Short Term Memory networks (LSTM) work by having a memory vector, and pulling information into and out of that vector through a gating system. These gates take as input the context of the network at a given timestep (the prior hidden state, and the current input), apply weight matrices and a sigmoid activation, and produce “mask” vectors with values between 0 and 1. A typical LSTM learns three separate gates: a “forget” gate that controls how much of the old memory vector is remembered, an “input” gate that controls how much new contextual information is added to the memory, an “output” gate that controls how much of the output (a sum of the gated memory information, and the gated input information) is passed outward into a hidden state context that’s visible to the rest of the network. Note here that “hidden” is an unfortunate word here, since this is actually the state that is visible to the rest of the network, whereas the “memory” vector is only visible to the next-step memory updating calculations. Also note that “forget gate” is an awkward name insofar as the higher the value of the forget gate, the more that the model *remembers* of its past memory. This is confusing, but we appear to be stuck with this terminology

The Gated Recurrent Unit, or GRU, did away with the output gate. In this system, the difference between “hidden” and “memory” vectors is removed, and so the network no longer has separate information channels for communicating with subsequent layers, and simple memory passed to future timesteps. On a wide range of problems, the GRU has performed comparably to the LSTM. This makes the authors ask: if a two-gate model can do as well, can a single gate model? In particular: how well does a LSTM-style model perform, if it only has a forget gate.

The answer, to not bury the probably-obvious lede, is: quite well. Models that only have a forget gate perform comparably to or better than traditional LSTM models for the tasks at which they were tried. On a mechanical level, not having an input gate means that, instead of having individual scaling for “how much old memory do you remember” and “how much new context do you take in”, so that those values could be, for example, 0.2 and 0.15, these numbers are defined as a convex combination of a single value, which is the forget gate. That’s a fancy way of saying: we calculate some x between 0 and 1, and that’s the weight on the forget gate, and then (1-x) is the weight on the input gate. This model, for reasons that are entirely unjustified, and obviously the result of some In Joke, is called JANET, because with a single gate, it’s Just Another NETwork. Image is attached to prove I’m Not Making This Shit Up.

The authors go down a few pathways of explaining why this forget-only model performs well, of which the most compelling is that it gives the model an easier and more efficient way to learn a skip connection, where information is passed down more or less intact to a future point in the model. It’s more straightforward to learn because the “skip-ness” of the connection, or, how strongly the information wants to propogate into the future, is just controlled by one set of parameters, and not a complex interaction of input, forget, and output. An interesting side investigation they perform is how the initialization of the bias term in the forget gate (which is calculated by applying weights to the input and former hidden state, and then adding a constant bias term) effects a model’s ability to learn long term dependencies. In particular, they discuss the situation where the model gets some signal, and then a long string of 0 values. If the bias term of the model is quite low, then all of those 0 values being used to calculate the forget gate will mean that only the bias is left, and the more times the bias is multiplied by itself, the smaller and closer to 0 it gets. The paper suggests initializing the bias of the forget gate according to the longest dependencies you expect the model to have, with the idea that you should more strongly bias your model towards remembering old information, regardless of what new information comes in, if you expect long term dependencies to be strongly relevant.