![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Variational Autoencoders are a type of generative model that seek to learn how to generate new data by incentivizing the model to be able to reconstruct input data, after compressing it to a low-dimensional space. Typically, the way that the reconstruction is scored against the original is by comparing the pixel by pixel values: a reconstruction gets a high score if it is able to place pixels of color in the same places that the original did. However, there are compelling reasons why this is a sub-par way of scoring images. The central one is: it focuses on and penalizes superficial differences, so if the model accurately reproduces the focal object of the image, but does so, say, 10 pixels to the right of where it was previously, that will incur a penalty we might not actually want to apply. The flip side of this is that a direct pixel-comparison loss doesn’t differentiate between pixel differences that do or don’t change the fundamental substance of the image. For instance, having 100 pixels wrong around the border of a dog, making it seem very slightly larger, would be the same amount of error as having 100 pixels concentrated in a weird bulb that appears to be growing out of a dog’s ear, even though the former does a better job of being recognizable as a dog. The authors of the VAE/GAN paper have a clever approach to solving this problem, that involves taking the typical pixel loss, and breaking it up into two conceptual parts. The first focuses on aligning the conceptual features of the reconstructed image with the conceptual features of the input image. It does so by running both the input and the reconstruction through a discriminative convolutional model which - in the typical way of deep learning - learns ever more abstract features at each layer of the network. These “conceptual features” abstract out the precise pixel values, and instead capture the higher level features of the image. So, instead of calculating the pixelwise squared loss between the specific input x, and its after-bottleneck reconstruction x~, you take the squared loss between the feature maps at some layer for both x and x~, and push them to be closer together, so that the reconstruction shares the same features as the original. The second focuses on detail-level specifics of images, but, cleverly, does so in a general, rather than a observation-specific way. This is done by training a GAN-style discriminator to tell the difference between generated images* and original image, and then using that loss to train the decoder part of the VAE. The cleverness of this comes from the fact that they are still enforcing that the details and structural features of the reconstructed image are not distinguishable from real images, but doing so in a general sense, rather than requiring the details to be an exact match to the details found in a given input x. https://i.imgur.com/Bmtmac2.png The authors freely admit that existing metrics of scoring images (which themselves *use* pixelwise similarity) rate their method as being worse than existing VAEs. However, they argue, that’s inherently a flawed metric, that doesn’t capture the aspects of clean visual quality we want in generated image. A metric they propose instead involves using an dataset where a list of attributes are attached to each image (old, black, blond, etc). They add these as additional input while training the network, so that whatever signals the decoder part of the model needs to turn someone blonde, it gets those from the externally-given attribute vector, rather than a learned representation. This means that, once the model is trained, we can set some value of the attribute vector, and have the decoder generate samples conditional on that. The metric is constructed by taking the decoded samples conditioned on some attribute set, and then taking a classifier model that is trained on the real images to detect attribute values from the images. The generated images are then scored by how closely the predictions from the classifier model match the true values of the attributes. If the generator model were working perfectly, this error rate would as low as for real data. By this metric (which: grain of salt, since they invented), the VAE/GAN model is superior to both GANs and vanilla VAEs. ![]() |
[link]
There are mathematicians, still today, who look at deep learning, and get real salty over the lack of convex optimization. That is to say: convex functions are ones where you have an actual guarantees that gradient descent will converge, and mathematicians of olden times (i.e. 2006) spent reams of paper arguing that this or that function had convex properties, and thus could be guaranteed to converge, under this or that set of arcane conditions. And then, Deep Learning came along, with its huge, nonlinear, very much nonconvex objective functions, that it was nonetheless trying to optimize via gradient descent. From the perspective of an optimization theorist, this had the whiff of heresy, but exceptionally effective heresy. And, so, the field of DL has half-exploded, half-stumbled along, showcasing a portfolio of very impressive achievements, but with theory very much a secondary priority relative to performance. Something else that gradient descent isn’t supposed to be able to do is learn models that include discrete (i.e. non-continuous) operators. Without continuous gradients, the functions don’t have an obvious way to “push” in a certain direction, to modulate the loss at the end of the network. Discrete nodes mean that the value just jumps from being in one state, to being in the other, with no intermediate values. This has historically posed a problem for algorithms fueled by gradient descent. The authors of this paper came up with a solution that is 60% cleverness, and 40% just guessing that “even if we ignore the theory, things will probably work well enough”. But, first, their overall goal: to create a Variational Auto Encoder where the latent states, the compressed internal representation that is typically an array of continuous values, is instead an array of categorical values. The goal of this was 1) to have a representation type that was a better match for the discrete nature of data types like speech (which has distinct phonemes we might like to discretely capture), and, 2) to have a more compressed latent space that would (of necessity) focus on more global information, and leave local pixel-level information to be learned by the expressive PixelCNN decoder. The way they do this is remarkably simple. First, they learn a typical VAE encoder, mapping from the input pixels to a continuous z space. (An interesting sidenote here is that this paper uses spatially organized z; instead of using one single z vector to represent the whole image, they may have 32x32 spatial locations, each of which has its own z vector, to represent at 128x128 image). Then, for each of the spatial regions, they take the continuous vector produced by the network, and compare it to a fixed set of “embedding” vectors, of the same shape. That spatial location is then lumped into the category of the embedding that it’s closest to, meaning that you end up with a compressed layer of 32x32 (in this case) spatial regions, each of which is represented by a categorical number between 0 and max-num-categories. Then, the network passes forward the embedding that this input vector was just “snapped” to being, Then, the decoder uses the full spatial location set of embeddings to do its decoding. https://i.imgur.com/P8LQRYJ.png The clever thing here comes when you ask how to train the encoder to produce a different embedding, when there was this discrete “jump” that happened. The authors choose to just avoid the problem, more or less. They do that by just taking the gradient signals that come back from the end of the network to the embedding, and just pass those directly to the vector that was used to nearest-neighbors-lookup the embedding. Basically, they pretend that they passed the vector through the rest of the network, rather than the embedding. The embeddings are then trained in a K Means Clustering kind of way; with the embeddings being iteratively updated to be closer to the points that were assigned to their embedding in each round of training. This is the “Vector Quantization” part of VQ-VAE Overall, this seems to perform quite well: with the low capacity of the latente space meaning that it is incentivized to handle more global structure, while leaving low level pixel details to the decoder. It is also much easier to fit after-the-fact distributions over; once we’ve trained a VQ-VAE, we can easily learn a global model that represents the location by location dependencies between the categories (i.e. a 1 in this corner means at 5 in this other corner is more probable). This gives us the ability to have an analytically specified distribution, in latent space, that actually represents the structure of how these “concept level categories” relate to each other. By contrast, with most continuous latent spaces, it’s intractable to learn an explicit density function after the fact, and thus if we want to be able to sample we need to specify and enforce a prior distribution over z ahead of time. ![]() |
[link]
I’ve spent the last few days pretty deep in the weeds of GAN theory - with all its attendant sample-squinting and arcane training diagnosis - and so today I’m shifting gears to an applied paper, that mostly showcases some clever modifications of an underlying technique. The goal of the MusicVAE is as you might expect: to make music. But the goal isn’t just the ability to produce patterns of notes that sound musical, it’s the ability to learn a vector space where we can modify the values along each dimension, and cause the music we produce to vary along conceptually meaningful directions. In an ideal world, we might learn a dimension that corresponds to tempo, another that corresponds to the key we’re in, etc. To achieve this goal, the modelers use the structure of a Variational AutoEncoder, a model where we pass in the input, compress it down to some latent code (read: a low-dimensional vector of continuous values), and then, starting from that latent code, use a decoder to try to recreate (or “reconstruct”) the output. Think of this as describing a scene to a friend behind their back, and trying to describe it in a maximally informative way, so that they can draw it themselves, and get as close as possible to the original. Ideally, this set of constraints incentives you to learn an informative code, which will contain the kind of conceptually meaningful information that we want it to. One problem this can run into is that, given certain mathematical facts about the structure of autoencoders, if you use a decoder with a lot of capacity, like a RNN, the model can “decide” to use the RNN to model the data directly, storing all that conceptual information we’d like to have pulled out in the latent code in the parameters of the RNN instead. And, so, to solve this, the authors of the paper came up with a clever solution: instead of generating the full piece of music at once, they would instead build a hierarchical model, with a “conductor” layer that prescribes what a medium-sized chunk of the reconstructed piece will sound like, and a lower level “decoder” layer that takes the conductor’s direction for that chunk, and unspools it into a series of notes. On a more mechanical level, when the encoder spits out a latent code for a given piece of music, we pass that to the conductor. The conductor then produces - say - 10 embeddings, with each embedding corresponding to a set of 4 measures. Each decoder only sees the embedding for its chunk, and is only responsible for mapping that embedding into a series of concrete notes. This inability of each decoder to see what the decoders before and after it are doing means that, in order for the piece to sound coherent, the network needs to learn to develop a condensed set of instructions to give to the conductor. https://i.imgur.com/PQKoraX.png In practice, they come up with some really neat results: the example they show on the linked page demonstrates a learned concept-dimension that maps to “how much is this piece composed of long, held notes, vs short staccato ones”. They show that they can “interpolate” across this dimension (that is: slowly change its value) and see that the output slowly morphs from very long held notes, to a high density of different ones. ![]() |
[link]
Despite their difficulties in training, Generative Adversarial Networks are still one of the most exciting recent ideas in machine learning; a way to generate data without the fuzziness and averaging of earlier methods. However, up until recently, there had been major way in which the GAN’s primary competitor in the field, the Variational Autoencoder, was superior: it could do inference. Intuitively, inference is the inverse of generation. Whereas generation works by taking some source of randomness - a random vector, the setting of some latent code - and transforming that recipe into an observation, an inference process tries to work in reverse, taking in the observation as input and trying to guess what “recipe” was used to generate it. (As a note: in real world data, it’s generally not the case that there were explicit numerical factors used to generate data; this framing is a simplified model meant to represent the way a small set of latent settings of an object jointly cause a lot of that object’s feature values). The authors of this paper proposed the BiGAN to fix that deficiency in GAN literature. https://i.imgur.com/vZZzWH5.png The BiGAN - short for Bidirectional GAN - works by having two generators, not one. One generator works in the typical fashion of a GAN: taking in a random vector z, and transforming that into G(z) = x. The second generator works in reverse, taking in as input data from the underlying dataset, and transforming it into a code z, E(x) = z. Once these generators are in place, the discriminators work, not by trying to differentiate the x and z values separately, but all together. That works by giving the discriminator a pair, (x, z), and asking the discriminator to decide whether that pair came from the z -> x decoder, or the x -> z encoder. If this model fully converges, it becomes the case that G(z) and E(x) are inverse transformations, giving us a way to take in a new input x, and infer its underlying factors z. This is valuable because it’s been shown that, in typical GANs, changes in z often correspond to latent values we care about, and it would be useful to be able to generate z from x for purposes of representation learning. The authors offer quite a nice intuitive proof for why the model learns this inverse mapping. For each pair of (x, z), it’s either the case that E(x) = z (if the pair came from the encoder), or that G(z) = x (if the pair came from the decoder). But if only one of those is the case, then it’s easy for the discriminator to tell which generation process produced the pair. So, in order to fool the discriminator, G(z) and E(x) need to synchronize their decoding and encoding processes. The authors also tried a method where, instead of having this bidirectional GAN structure, they instead simply built a network on top of the generated samples, that tries to predict the original z used, taking the generated x as input. They show that this performs less well on subjective quality measures of the learned representation, which they attribute to the fact that GANs notoriously only learn some modes of the data, and thus a x -> z encoder that only takes the generated z as input will not have good coverage over the full distribution of x. ![]() |
[link]
Generative Adversarial Networks (GANs) are an exciting technique, a kernel of an effective concept that has been shown to be able to overcome many of the problems of previous generative models: particularly the fuzziness of VAEs. But, as I’ve mentioned before, and as you’ve doubtless read if you’re read any material about the topic, they’re finicky things, difficult to train in a stable way, and particularly difficult to not devolve into mode collapse. Mode collapse is a phenomenon where, at each iteration, the generator places all of its mass on one single output or dense cluster of outputs, instead of representing the full distribution of output space, they way we’d like it to. One proposed solution to this is the one I discussed yesterday, of explicitly optimizing the generator according to not only what the discriminator thinks about its current allocation of probability, but what the discriminator’s next move will be (thus incentivizing the generator not to take indefensible strategies like “put all your mass in one location the discriminator can push down next round”. An orthogonal approach to that one is the one described in LSGANs: to change the objective function of the network, away from sigmoid cross-entropy, and instead to a least squares loss. While I don’t have the latex capabilities to walk through the exact mathematics in this format, what this means on a conceptual level is that instead of incentivizing the generator to put all of its mass on places that the discriminator is sure is a “true data” region, we’re instead incentivizing the generator to put mass right on the true/fake data decision boundary. Likely this doesn’t make very much sense yet (it didn’t for me, at this point in reading). Occasionally, delving deeper into math and theory behind an idea provides you rigor, but without much intuition. I found the opposite to be true in this case, where learning more (for the first time!) about f divergences actually made this method make more sense. So, bear with me, and hopefully trust me not to take you to deep into the weeds without a good reason. On a theoretical level, this paper’s loss function means that you end up minimizing a chi squared divergence between the distributions, instead of a KL divergence. "F divergences" are a quantity that calculates a measure of how different two distributions are from one another, and does that by taking an average of the density q, weighted at each point by f, which is some function of the ratio of densities, p/q. (You could also think of this as being an average of the function f, weighted by the density q; they’re equivalent statements). For the KL divergence, this function is x*logx. For chi squared it’s (x-1)^2. All of this starts to coalesce into meaning with the information that, typically the behavior of a typical GAN looks like the divergence FROM the generator’s probability mass, TO the discriminator’s probability mass. That means that we take the ratio of how much mass a generator puts somewhere to how much mass the data has there, and we plug it into the x*logx function seen below. https://i.imgur.com/BYRfi0u.png Now, look how much the function value spikes when that ratio goes over 1. Intuitively, what this means is that we heavily punish the generator when it puts mass in a place that’s unrealistic, i.e. where there isn’t representation from the data distribution. But - and this is the important thing - we don’t symmetrically punish it when it its mass at a point is far higher than the mass put their in the real data; or when the ratio is much smaller than one. This means that we don’t have a way of punishing mode collapse, the scenario where the generator puts all of its mass on one of the modes of the data; we don’t do a good job of pushing the generator to have mass everywhere that the data has mass. By contrast, the Chi Squared divergence pushes the ratio of (generator/data) to be equal to 1 *from both directions*. So, if there’s more generator mass than data mass somewhere, that’s bad, but it’s also bad for there to be more data mass than generator mass. This gives the network a stronger incentive to not learn mode collapsed solutions. ![]() |