![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
This paper explores the problem of question answering based on natural text. While this has been explored recently in the context of Memory Networks, the problems tackled so far have been synthetically generated. In this paper, the authors propose to extract from news sites more realistic question answering examples, by treating the main body of a news article as the content (the "facts") and extracting questions from the article's bullet point summaries. Specifically, by detecting the entities in these bullet points and replacing them with a question place older (e.g. "Producer X will not press charges"), they are able to generate queries which, while grammatically not being questions, do require to perform a form of question answering. Thanks to this procedure, two large *supervised* datasets are created, with several thousands of questions, based on the CNN and Daily Mail news sites. Then, the authors investigate neural network based systems for solving this task. They consider a fairly simple Deep LSTM network, which is first fed the article's content and then the query. They also consider two architectures that incorporate an attentional mechanism, based on softmax weighting. The first ("Attentive Reader") attends once in the document (i.e. uses a single softmax weight vector) while the second ("Impatient Reader") attends after every word in the query (akin to the soft attention architecture in the "Show Attend and Tell" paper). These neural network architectures are also compared with simpler baselines, which are closer to what a more "classical" statistical NLP solution might look like. Results on both datasets demonstrate that the neural network approaches have superior performance, with the attentional models being significantly better than the simpler Deep LSTM model. #### My two cents This is welcome development in the research on reasoning models based on neural networks. I've always thought it was unfortunate that the best benchmark available is based on synthetically generated cases. This work fixes this problem in a really clever way, while still being able to generate a large amount of training data. Particularly clever is the random permutation of entity markers when processing each case. Thanks to that, a system cannot simply use general statistics on words to answer questions (e.g. just from the query "The hi-tech bra that helps you beat breast X" it's obvious that "cancer" is an excellent answer). In this setup, the system is forced to exploit the content of the article, thus ensuring that the benchmark is indeed measuring the system's question-answering abilities. Since the dataset itself is an important contribution of this paper, I hope the authors release it publicly in the near future. The evaluation of the different neural architectures is also really thoroughly done. The non-neural baselines are reasonable and the comparison between the neural nets is itself interesting, bringing more evidence that the softmax weighted attentional mechanism (which has been gaining in popularity) indeed brings something over a regular LSTM approach. ![]() |
[link]
The authors introduce their contribution as an alternative way to approximate the KL divergence between prior and variational posterior used in [Variational Dropout and the Local Reparameterization Trick][kingma] which allows unbounded variance on the multiplicative noise. When the noise variance parameter associated with a weight tends to infinity you can say that the weight is effectively being removed, and in their implementation this is what they do. There are some important details differing from the [original algorithm][kingma] on per-weight variational dropout. For both methods we have the following initialization for each dense layer: ``` theta = initialize weight matrix with shape (number of input units, number of hidden units) log_alpha = initialize zero matrix with shape (number of input units, number of hidden units) b = biases initialized to zero with length the number of hidden units ``` Where `log_alpha` is going to parameterise the variational posterior variance. In the original paper the algorithm was the following: ``` mean = dot(input, theta) + b # standard dense layer # marginal variance over activations (eq. 10 in [original paper][kingma]) variance = dot(input^2, theta^2 * exp(log_alpha)) # sample from marginal distribution by scaling Normal activations = mean + sqrt(variance)*unit_normal(number of output units) ``` The final step is a standard [reparameterization trick][shakir], but since it is a marginal distribution this is referred to as a local reparameterization trick (directly inspired by the [fast dropout paper][fast]). The sparsifying algorithm starts with an alternative parameterisation for `log_alpha` ``` log_sigma2 = matrix filled with negative constant (default -8) with size (number of input units, number of hidden units) log_alpha = log_sigma2 - log(theta^2) log_alpha = log_alpha clipped between 8 and -8 ``` The authors discuss this in section 4.1, the $\sigma_{ij}^2$ term corresponds to an additive noise variance on each weight with $\sigma_{ij}^2 = \alpha_{ij}\theta_{ij}^2$. Since this can then be reversed to define `log_alpha` the forward pass remains unchanged, but the variance of the gradient is reduced. It is quite a counter-intuitive trick, so much so I can't quite believe it works. They then define a mask removing contributions to units where the noise variance has gone too high: ``` clip_mask = matrix shape of log_alpha, equals 1 if log_alpha is greater than thresh (default 3) ``` The clip mask is used to set elements of `theta` to zero, and then the forward pass is exactly the same as in the original paper. The difference in the approximation to the KL divergence is illustrated in figure 1 of the paper; the sparsifying version tends to zero as the variance increases, which matches the true KL divergence. In the [original paper][kingma] the KL divergence would explode, forcing them to clip the variances at a certain point. [kingma]: https://arxiv.org/abs/1506.02557 [shakir]: http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/ [fast]: http://proceedings.mlr.press/v28/wang13a.html ![]() |
[link]
If you were to survey researchers, and ask them to name the 5 most broadly influential ideas in Machine Learning from the last 5 years, I’d bet good money that Batch Normalization would be somewhere on everyone’s lists. Before Batch Norm, training meaningfully deep neural networks was an unstable process, and one that often took a long time to converge to success. When we added Batch Norm to models, it allowed us to increase our learning rates substantially (leading to quicker training) without the risk of activations either collapsing or blowing up in values. It had this effect because it addressed one of the key difficulties of deep networks: internal covariate shift. To understand this, imagine the smaller problem, of a one-layer model that’s trying to classify based on a set of input features. Now, imagine that, over the course of training, the input distribution of features moved around, so that, perhaps, a value that was at the 70th percentile of the data distribution initially is now at the 30th. We have an obvious intuition that this would make the model quite hard to train, because it would learn some mapping between feature values and class at the beginning of training, but that would become invalid by the end. This is, fundamentally, the problem faced by higher layers of deep networks, since, if the distribution of activations in a lower layer changed even by a small amount, that can cause a “butterfly effect” style outcome, where the activation distributions of higher layers change more dramatically. Batch Normalization - which takes each feature “channel” a network learns, and normalizes [normalize = subtract mean, divide by variance] it by the mean and variance of that feature over spatial locations and over all the observations in a given batch - helps solve this problem because it ensures that, throughout the course of training, the distribution of inputs that a given layer sees stays roughly constant, no matter what the lower layers get up to. On the whole, Batch Norm has been wildly successful at stabilizing training, and is now canonized - along with the likes of ReLU and Dropout - as one of the default sensible training procedures for any given network. However, it does have its difficulties and downsides. One salient one of these comes about when you train using very small batch sizes - in the range of 2-16 examples per batch. Under these circumstance, the mean and variance calculated off of that batch are noisy and high variance (for the general reason that statistics calculated off of small sample sizes are noisy and high variance), which takes away from the stability that Batch Norm is trying to provide. One proposed alternative to Batch Norm, that didn’t run into this problem of small sample sizes, is Layer Normalization. This operates under the assumption that the activations of all feature “channels” within a given layer hopefully have roughly similar distributions, and, so, you an normalize all of them by taking the aggregate mean over all channels, *for a given observation*, and use that as the mean and variance you normalize by. Because there are typically many channels in a given layer, this means that you have many “samples” that go into the mean and variance. However, this assumption - that the distributions for each feature channel are roughly the same - can be an incorrect one. A useful model I have for thinking about the distinction between these two approaches is the idea that both are calculating approximations of an underlying abstract notion: the in-the-limit mean and variance of a single feature channel, at a given point in time. Batch Normalization is an approximation of that insofar as it only has a small sample of points to work with, and so its estimate will tend to be high variance. Layer Normalization is an approximation insofar as it makes the assumption that feature distributions are aligned across channels: if this turns out not to be the case, individual channels will have normalizations that are biased, due to being pulled towards the mean and variance calculated over an aggregate of channels that are different than them. Group Norm tries to find a balance point between these two approaches, one that uses multiple channels, and normalizes within a given instance (to avoid the problems of small batch size), but, instead of calculating the mean and variance over all channels, calculates them over a group of channels that represents a subset. The inspiration for this idea comes from the fact that, in old school computer vision, it was typical to have parts of your feature vector that - for example - represented a histogram of some value (say: localized contrast) over the image. Since these multiple values all corresponded to a larger shared “group” feature. If a group of features all represent a similar idea, then their distributions will be more likely to be aligned, and therefore you have less of the bias issue. One confusing element of this paper for me was that the motivation part of the paper strongly implied that the reason group norm is sensible is that you are able to combine statistically dependent channels into a group together. However, as far as I an tell, there’s no actually clustering or similarity analysis of channels that is done to place certain channels into certain groups; it’s just done so semi-randomly based on the index location within the feature channel vector. So, under this implementation, it seems like the benefits of group norm are less because of any explicit seeking out of dependant channels, and more that just having fewer channels in each group means that each individual channel makes up more of the weight in its group, which does something to reduce the bias effect anyway. The upshot of the Group Norm paper, results-wise, is that Group Norm performs better than both Batch Norm and Layer Norm at very low batch sizes. This is useful if you’re training on very dense data (e.g. high res video), where it might be difficult to store more than a few observations in memory at a time. However, once you get to batch sizes of ~24, Batch Norm starts to do better, presumably since that’s a large enough sample size to reduce variance, and you get to the point where the variance of BN is preferable to the bias of GN. ![]() |
[link]
This paper continues in the tradition of curiosity-based models, which try to reward models for exploring novel parts of their environment, in the hopes this can intrinsically motivate learning. However, this paper argues that it’s insufficient to just treat novelty as an occasional bonus on top of a normal reward function, and that instead you should figure out a process that’s more specifically designed to increase novelty. Specifically: you should design a policy whose goal is to experience transitions and world-states that are high novelty. In this setup, like in other curiosity-based papers, “high novelty” is defined in terms of a state being unpredictable given a prior state, history, and action. However, where other papers saw novelty reward as something only applied when the agent arrived at somewhere novel, here, the authors build a model (technically, an ensemble of models) to predict the state at various future points. The ensemble is important here because it’s (quasi) bootstrapped, and thus gives us a measure of uncertainty. States where the predictions of the ensemble diverge represent places of uncertainty, and thus of high value to explore. I don’t 100% follow the analytic specification of this idea (even though the heuristic/algorithmic description makes sense). The authors frame the Utility function of a state and action as being equivalent to the Jenson Shannon Divergence (~distance between probability distributions) shown below. https://i.imgur.com/YIuomuP.png Here, P(S | S, a, T) is the probability of a state given prior state and action under a given model of the environment (Transition Model), and P(gamma) is the distribution over the space of possible transition models one might learn. A “model” here is one network out of the ensemble of networks that makes up our bootstrapped (trained on different sets) distribution over models. Conceptually, I think this calculation is measuring “how different is each sampled model/state distribution from all the other models in the distribution over possible models”. If the models within the distribution diverge from one another, that indicates a location of higher uncertainty. What’s important about this is that, by building a full transition model, the authors can calculate the expected novelty or “utility” of future transitions it might take, because it can make a best guess based on this transition model (which, while called a “prior”, is really something trained on all data up to this current iteration). My understanding is that these kinds of models function similarly to a Q(s,a) or V(s) in a pure-reward case: they estimate the “utility reward” of different states and actions, and then the policy is updated to increase that expected reward. I’ve recently read papers on ICM, and I was a little disappointed that this paper didn’t appear to benchmark against that, but against Bootstrapped DQN and Exploration Bonus DQN, which I know less well and can less speak to the conceptual differences from this approach. Another difficulty in actually getting a good sense of results was that the task being tested on is fairly specific, and different from RL results coming out of the world of e.g. Atari and Deep Mind Labs. All of that said, this is a cautiously interesting idea, if the results generate to beat more baselines on more environments. ![]()
2 Comments
|
[link]
* GANs are based on adversarial training. * Adversarial training is a basic technique to train generative models (so here primarily models that create new images). * In an adversarial training one model (G, Generator) generates things (e.g. images). Another model (D, discriminator) sees real things (e.g. real images) as well as fake things (e.g. images from G) and has to learn how to differentiate the two. * Neural Networks are models that can be trained in an adversarial way (and are the only models discussed here). ### How * G is a simple neural net (e.g. just one fully connected hidden layer). It takes a vector as input (e.g. 100 dimensions) and produces an image as output. * D is a simple neural net (e.g. just one fully connected hidden layer). It takes an image as input and produces a quality rating as output (0-1, so sigmoid). * You need a training set of things to be generated, e.g. images of human faces. * Let the batch size be B. * G is trained the following way: * Create B vectors of 100 random values each, e.g. sampled uniformly from [-1, +1]. (Number of values per components depends on the chosen input size of G.) * Feed forward the vectors through G to create new images. * Feed forward the images through D to create ratings. * Use a cross entropy loss on these ratings. All of these (fake) images should be viewed as label=0 by D. If D gives them label=1, the error will be low (G did a good job). * Perform a backward pass of the errors through D (without training D). That generates gradients/errors per image and pixel. * Perform a backward pass of these errors through G to train G. * D is trained the following way: * Create B/2 images using G (again, B/2 random vectors, feed forward through G). * Chose B/2 images from the training set. Real images get label=1. * Merge the fake and real images to one batch. Fake images get label=0. * Feed forward the batch through D. * Measure the error using cross entropy. * Perform a backward pass with the error through D. * Train G for one batch, then D for one (or more) batches. Sometimes D can be too slow to catch up with D, then you need more iterations of D per batch of G. ### Results * Good looking images MNIST-numbers and human faces. (Grayscale, rather homogeneous datasets.) * Not so good looking images of CIFAR-10. (Color, rather heterogeneous datasets.)  *Faces generated by MLP GANs. (Rightmost column shows examples from the training set.)* ------------------------- ### Rough chapter-wise notes * Introduction * Discriminative models performed well so far, generative models not so much. * Their suggested new architecture involves a generator and a discriminator. * The generator learns to create content (e.g. images), the discriminator learns to differentiate between real content and generated content. * Analogy: Generator produces counterfeit art, discriminator's job is to judge whether a piece of art is a counterfeit. * This principle could be used with many techniques, but they use neural nets (MLPs) for both the generator as well as the discriminator. * Adversarial Nets * They have a Generator G (simple neural net) * G takes a random vector as input (e.g. vector of 100 random values between -1 and +1). * G creates an image as output. * They have a Discriminator D (simple neural net) * D takes an image as input (can be real or generated by G). * D creates a rating as output (quality, i.e. a value between 0 and 1, where 0 means "probably fake"). * Outputs from G are fed into D. The result can then be backpropagated through D and then G. G is trained to maximize log(D(image)), so to create a high value of D(image). * D is trained to produce only 1s for images from G. * Both are trained simultaneously, i.e. one batch for G, then one batch for D, then one batch for G... * D can also be trained multiple times in a row. That allows it to catch up with G. * Theoretical Results * Let * pd(x): Probability that image `x` appears in the training set. * pg(x): Probability that image `x` appears in the images generated by G. * If G is now fixed then the best possible D classifies according to: `D(x) = pd(x) / (pd(x) + pg(x))` * It is proofable that there is only one global optimum for GANs, which is reached when G perfectly replicates the training set probability distribution. (Assuming unlimited capacity of the models and unlimited training time.) * It is proofable that G and D will converge to the global optimum, so long as D gets enough steps per training iteration to model the distribution generated by G. (Again, assuming unlimited capacity/time.) * Note that these things are proofed for the general principle for GANs. Implementing GANs with neural nets can then introduce problems typical for neural nets (e.g. getting stuck in saddle points). * Experiments * They tested on MNIST, Toronto Face Database (TFD) and CIFAR-10. * They used MLPs for G and D. * G contained ReLUs and Sigmoids. * D contained Maxouts. * D had Dropout, G didn't. * They use a Parzen Window Estimate aka KDE (sigma obtained via cross validation) to estimate the quality of their images. * They note that KDE is not really a great technique for such high dimensional spaces, but its the only one known. * Results on MNIST and TDF are great. (Note: both grayscale) * CIFAR-10 seems to match more the texture but not really the structure. * Noise is noticeable in CIFAR-10 (a bit in TFD too). Comes from MLPs (no convolutions). * Their KDE score for MNIST and TFD is competitive or better than other approaches. * Advantages and Disadvantages * Advantages * No Markov Chains, only backprob * Inference-free training * Wide variety of functions can be incorporated into the model (?) * Generator never sees any real example. It only gets gradients. (Prevents overfitting?) * Can represent a wide variety of distributions, including sharp ones (Markov chains only work with blurry images). * Disadvantages * No explicit representation of the distribution modeled by G (?) * D and G must be well synchronized during training * If G is trained to much (i.e. D can't catch up), it can collapse many components of the random input vectors to the same output ("Helvetica") ![]() |