![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
The general goal of meta-learning systems is to learn useful shared structure across a broad distribution of tasks, in such a way that learning on a new task can be faster. Some of the historical ways this has been done have been through initializations (i.e. initializing the network at a point such that it is easy to further optimize on each individual task, drawn from some distribution of tasks), and recurrent network structures (where you treat the multiple timesteps of a recurrent network as the training iterations on a single task, and train the recurrent weights of the network based on generalization performance on a wide range of tasks). This paper proposes a different approach: a learned proxy loss function. The idea here is that, often, early in the learning process, handcoded rewards aren’t the best or most valuable signal to use to guide a network, both because they may be high variance, and because they might not natively incentivize things like exploration rather than just exploitation. A better situation would be if we had some more far-sighted loss function we could use, that had proved to be a good proxy over a variety of different rewards. This is exactly what this method proposes to give us. Training consists of an inner loop, and an outer loop. Each instantiation of the inner loop corresponds to a single RL task, drawn from a distribution over tasks (for example, all tasks involving the robot walking to a position, with a single instantiated task being the task of walking to one specific position). Within the inner loop, we apply a typical policy gradient loop of optimizing the parameters of our policy, except, instead of expected rewards, we optimize our policy parameters according to a loss function we specifically parametrize. Within the outer loop, we take as signal the final reward on the trained policy on this task, and use that to update our parametrized loss. This parametrized loss is itself a neural network, that takes in the agent’s most recent set of states, actions, and rewards at a rolling window of recent timesteps, and performs temporal convolutions on those, to get a final loss value out the other side. In short, this auxiliary network takes in information about the agent’s recent behavior, and outputs an assessment of how well the agent is doing according to this longer-view loss criteria. Because it’s not possible to directly formulate the test performance of a policy in terms of the loss function that was used to train the policy (which would be necessary for backprop), the weights of this loss-calculating network are instead learned via evolutionary strategies. At a zoomed-out level of complexity, this means: making small random perturbations to the current parameters of the network, and moving in the direction of the random change that works the best. So, ultimately, you end up with a loss network that takes in recent environmental states and the behavior of the agent, and returns an estimate of the proxy loss value, that has hopefully been trained such that it captures environmental factors that indicate progress on the task, over a wide variety of similar tasks. Then, during testing, the RL agent can use that loss function to adapt its behavior. An interesting note here is that for tasks where the parameters of the task being learned are inferable from the environment - for example, where the goal is “move towards the green dot”, you don’t actually need to give the agent the rewards from a new task; ideally, it will have learned how to infer the task from the environment. One of the examples they use to prove their method has done something useful is train their model entirely on tasks where an ant-agent’s goal is to move towards various different targets on the right, and then shift it to a scenario where its target is towards the left. In the EPG case, the ant was able to quickly learn to move left, because it’s loss function was able to adapt to the new environment where the target had moved. By contrast, RL^2 (a trained learning algorithm implemented as a recurrent network) kept on moving right as its initial strategy, and seemed unable to learn the specifics of a task outside its original task distribution of “always move right”. I think this paper could benefit from being a little bit more concrete about what it’s expected use cases are (like: what kinds of environments lend themselves to having proxy loss functions inferred from environmental data? Which don’t?), but overall, I find the kernel of idea this model introduces interesting, and will be interested to see if other researchers run with it. ![]() |
[link]
Meta learning is an area sparking a lot of research curiosity these days. It’s framed in different ways: models that can adapt, models that learn to learn, models that can learn a new task quickly. This paper uses a somewhat different lens: that of neural plasticity, and argues that applying the concept to modern neural networks will give us an effective, and biologically inspired way of building adaptable models. The basic premise of plasticity from a neurobiology perspective (at least how it was framed in the paper: I’m not a neuroscientist myself, and may be misunderstanding) is that plasticity performs a kind of gating function on the strength of a neural link being upregulated by experience. The most plastic a connection is, the more quickly it can get modified by new data; the less plastic, the more fixed it is. In concrete terms, this is implemented by subdividing the weight on each connection in the network into two parts: the “fixed” component, and the “plastic” component. (see picture). The fixed component acts like a typical weight: it gets modified during training, but stays fixed once training is done. The plastic component is composed of an alpha weight, multiplied by a term H. H is basically a decaying running average of the past input*output activations of this weight. Activations that are high in magnitude, and the same sign, for both the input and the output will lead to H being pushed higher. Note that that this H can continue to be updated even after the model is done training, because it builds up information whenever you pass a new input X through the network. The plastic component’s learned weight, alpha, controls how strong the influence of this is on the model. If alpha is near zero, then the connection behaves basically identically to a “typical” neural network, with weights that don’t change as a function of activation values. If alpha is positive, that means that strong co-activation within H will tend to make the connection weight higher. If alpha is negative, the opposite is true, and strong co-activation will make the connection weight more negative. (As an aside, I’d be really interested to see the distribution over alpha values in a trained model, relative to the weight values, and look at how often they go in the same direction as the weights, and increase magnitude, and how often they have the opposite direction and attenuate the weight towards zero). These models are trained by running them for fixed size “episodes” during which the H value gets iteratively changed, and then the alpha parameters of H get updated in the way that would have reduced error over the episode. One area in which they seem to show strong performance is that of memorization (where the network is shown an image once, and needs to reconstruct it later). The theory for why this is true is that the weights are able to store short-term information about which pixels are in the images it sees by temporarily boosting themselves higher for inputs and activations they’ve recently seen. There are definitely some intuitional gaps for me in this paper. The core one is: this framework just makes weights able to update themselves as a function of the values of their activations, not as a function of an actual loss function. That is to say: it seems like a potentially better analogy to neural plasticity is just a network that periodically gets more training data, and has some amount of connection plasticity to update as a result of that. ![]() |
[link]
DeepMind’s recently released paper (one of a boatload coming out in the wake of ICLR, which just finished in Vancouver) addresses the problem of building an algorithm that can perform well on tasks that don’t just stay fixed in their definition, but instead evolve and change, without giving the agent a chance to re-train in the middle. An example of this, is one used at various points in the paper: of an agent trying to run East, that finds two of its legs (a different two each time) slowly less functional. The theoretical framework they use to approach this problem is that of meta learning. Meta Learning is typically formulated as: how can I learn to do well on a new task, given only a small number of examples of that task? That’s why it’s called “meta”: it’s an extra, higher-level optimization loop applied around the process of learning. Typical learning learns parameters of some task, meta learning learns longer-scale parameters that make the short-scale, typical learning work better. Here, the task that evolves and changes over time (i.e. a nonstationary task) is seen as a close variant of the the multi-task problem. And, so, the hope is that a model that can quickly adapt to arbitrary new tasks can also be used to learn the ability to adapt to a gradually changing task environment. The meta learning algorithm that got most directly adapted for this paper is MAML: Model Agnostic Meta Learning. This algorithm works by, for a large number of tasks, initializing the model at some parameter set theta, evaluating the loss for a few examples on that task, and moving the gradients from the initialization theta, to a task-specific parameter set phi. Then, it calculating the “test set” performance of the one-step phi parameters, on the task. But then - the crucial thing here - the meta learning model updates its initialization parameters, theta. So, the meta learning model is learning a set of parameters that provides a good jumping off point for any given task within the distribution of tasks the model is trained on. In order to do this well, the theta parameters need to both 1) learn any general information, shared across all tasks, and 2) position the parameters such that an initial update step moves the model in the profitable direction. They adapted this idea, of training a model that could quickly update to multiple tasks, to the environment of a slowly/continuously changing environment, where certain parameters of the task the agent is facing. In this formulation, our set of tasks is no longer random draws from the distribution of possible tasks, but a smooth, Markov-walk gradient over tasks. The main change that the authors made to the original MAML algorithm was to say that each general task would start at theta, but then, as that task gradually evolved, it would perform multiple updates: theta to phi1, phi1 to phi2, and so on. The original theta parameters would then be updated according to a similar principle as the MAML parameters: so as to make the loss, summed over the full non-stationary task (notionally composed of many little sub-tasks) is as low as possible. ![]() |
[link]
The problem setting of the paper is the desire to perform translation in a monolingual setting, where datasets exist of each language independently, but little or no paired sentence data (paired here meaning that you know you have the same sentence or text in both languages). The paper outlines the prior methods in this area as being, first, training a single-language language model (i.e. train a model to take in a sentence, and return how coherent of a sentence it is in a given language) and using that to supplement a machine translation system. The authors honestly don’t go into this much, so I can’t tell exactly what they mean by it. The second baseline they talk about is bootstrapping themselves additional training data, by training a model using a small amount of training data, then using that mediocre model to translate additional sentences, which they use as additional training data to train the mediocre model to a higher performance. It doesn’t seem like this should work, but I’ve seen this or similar approaches used in a few cases, and it typically does add benefit. But, the authors claim, they can do better. The core intuition of this paper is pretty simple, and will be familiar to anyone who read my summary of CycleGAN, lo these many weeks ago. Their approach rests on the idea that, even if you can’t push translation models to be objectively correct in a paired sense, you can push translation models to be symmetric with one another, insofar as translating from language A to B (let’s say English to French), and then back from French to English, gets you something in English that looks like your original input. This forces the model to maintain an informative mapping, so that enough information about the English sentence is stored to allow it to be reconstructed.However, unconstrained, the model could just develop a 1:1 word mapping that gives you information about the English input, but doesn’t actually map to the translation in French. If you can additionally confirm that the translation into French looks like a coherent French sentence (which, recall, we can do with a language model trained on French independently), we can get closer to to generating a mapping that is hopefully more coherent. One interesting aspect of this paper is the fact that the model they describe is trained with reinforcement learning. Typically, reinforcement learning is used for scenarios where you don’t have direct visibility into how the actions you take impact your loss function. Compare this to a supervised network (where you can take the derivative of your loss with respect to the last layer, and backpropogate that back through to your inputs), or even a GAN, where you can take the derivative of the discriminator-created loss back through the input the discriminator, and through into the GAN that created it. This model treats the translation models that are learned as policies; that is, probability distributions over sets of words. It samples multiple A -> B translations using something called beam search, which, as it samples the sequence of words, samples several at each timestep, and then keeps that chain alive by continuing to sample along it. This helps the sequential translation not fall into a pit where it samples one highly probable word, but then, as you add more words, it doesn’t lead towards a good sentence. This ultimately results in taking multiple (k=12, in this case) samples from each translation distribution, and so the model uses as its objective the expected rewards over these samples, where the rewards is constructed as a combination of in-language coherence loss (scored by using the log likelihood of a trained single-language model) and reconstruction loss (scored by close the A -> B -> A’ is to the original A). My confusion about the use of reinforcement loss here mostly comes from the question of whether it just wasn’t possible to build an end to end model, where, like a GAN, you backpropogated through a constructed input, back to the model that constructed it (in this case, through both translator models). Is the issue just that a sequence of words is fundamentally discrete, in a way that images aren’t, and in a way that impedes backprop? That seems possible, but also, I think it’s the case that a typical encoder-decoder model, that outputs softmaxes over words, is able to be backpropogated through. Overall, it’s hard for me to tell if I’m missing something basic about why reinforcement learning is the obvious choice here, or if other more GAN-like approaches were an option, but that meme hadn’t spread into the literature yet, and RL was the more historically canonical choice. One other minor disappointing note: it looks like their results are based on a scenario that does use a small number of bilingual training pairs, as a way to pretrain the translation models to a reasonable, non-random starting point. It’s not clear whether this method would have worked with an actual cold start, i.e. a translation model that has no idea what it’s doing, and is using only this as signal. That said, they used a much smaller number of bilingual pairs than a true supervised method, and so even with a need for a warm start, a method like this could still give you leverage over language pairs where there exists some paired data, but not enough to build a full, sophisticated model on top of. ![]() |
[link]
This paper builds on the paper "Learned in Translation: Contextualized Word Vectors" , which learned contextualized word representations by using the sequence of encodings generated by a Bidirectional LSTM as the representation of the sequence of input words. This paper says “if we're learning a deep LSTM, ie one with more than one layer, why should we use only the last layer that it produces as the representation of the word?”. This paper instead suggests that it could be valuable for transfer learning if each task can learn a weighting of layer encodings that is most valuable for that task. In a prime example of “your model is a special case of my model,” they note that this framework can easily learn the approach of only using the final encoding layer by just only giving that layer a non-zero weight. As intuition for why this might be a valuable thing to do: different layers tend to capture different levels of meaning, with lower layers more likely to capture part of speech information, and higher layers more likely to capture more rich semantic context. https://i.imgur.com/s8Qn6YY.png One difficulty in comparing this paper directly to the “take the top layer encoding from a LSTM” paper is that they were trained on different problems: the top layer paper learned using a machine translation objective, where, by contrast, this one learns by using a much simpler language model. Here, a simple language model means a RNN that is trained to predict the next word, given the hidden state built up over all prior words. Because we want to pull in word context from both directions, this is’t just a LSTM but a bidirectional LSTM, which - surprise surprise - tries to pick the word *before* a word in a sentence by using all of the words that come after it. This has the advantage of not requiring parallel data, the way machine translation does, but also makes it difficult to make direct comparisons to prior work that isolate the effect of multi-layer combination, as separate from the switch between machine translation and direct language modeling. Although this is likely also a benefit you see with just top-layer contextual vectors, it is interesting to examine the attached table, and look at how effectively the model is able to learn different representations of the word “play” depending on the context in which it appears; in each case, the nearest neighbor of a context of “play” is a sentence in which the word is used in the same context. ![]() |