ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
The website has 1583 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

arxiv.org
dblp.org
sci-hub
scholar.google.com

{ELECTRA:} Pre-training Text Encoders as Discriminators Rather Than Generators
Clark, Kevin and Luong, Minh{-}Thang and Le, Quoc V. and Manning, Christopher D.
CoRR - 2020 via Local Bibsonomy
Keywords: electra, from:tobias.koopmann, ma, thema, seminar, available, bert, thema:ma

[link] Summary by CodyWild 3 years ago

I'm a little embarrassed that I'm only just now reading what seems like a fairly important paper from a year and a half ago, but, in my defense, March 2020 was not the best time for keeping up with the literature in a disciplined way. 

Anyhow, musings aside: this paper proposes an alternative training procedure for large language models, which the authors claim result in models that reach strong performance more efficiently than previous BERT, XLNet, or RoBERTa baselines. As some background context, the previously-canonical Masked Learning Model (MLM) task works by: 

- Replacing some percentage of tokens with a [MASK] indicator
- Using the final-layer representation at the locations of those [MASK]s to predict the true input token
- Using as a training signal the Maximum Likelihood of that prediction, or, how high the model's predicted probability on the true input.

ELECTRA authors argue that there are a few notable disadvantages to this structure, if your goal is to train useful representations for downstream tasks. Firstly, your loss only consists of information (i.e. the true token) from the tokens you randomly masked, so a good amount of the data is going in some sense unused (except as context). Secondly, learning a full generative model of language requires a lot of data and training time, and it may not be all that beneficial for performance on your downstream tasks of interest. 

As an alternative, they propose: 

- Co-learning a (small) generator, trained in typical MLM fashion, alongside a discriminator. Randomly select tokens from the input to replace with fake tokens drawn from the distribution of the discriminator
- The goal of the discriminator is to distinguish the true tokens from the fake ones. (minor note: if the generator happens to get lucky and generate the real token, that's counted as a "real" rather than "fake" token, even though it was generated by a generator). This uses more of the training data in the loss, since you can ask "real or fake" for every token in the input data, not (obviously) just the ones that are actually fake
- An important note for those familiar with GANs is that the generator isn't trained to confuse the discriminator (as is GAN-standard), but is simply trained with it's own maximum likelihood loss, independent of the discriminator's performance.

They argue, and show fairly convincingly, that ELECTRA is able to reach a higher efficiency-to-performance trade-off curve compared to BERT - matching the performance of previous models with notably less training, and outperforming them with comparable amounts of training. 

They go on to perform a few ablations, some of which felt more convincing than others. The most confusing ablation, which I'm not sure if I just misunderstood, was meant to ask how much of the value of ELECTRA came from calculating its loss over all the tokens in the training data, rather than just the masked ones. So, they tried just calculating the loss for the masked/replaced tokens. The resulting discriminator performs very poorly downstream. But, I find this a little odd as a design choice, since couldn't the discriminator learn to almost always predict that a replaced token was fake, since the only way it could be otherwise would be if the generator got lucky and produced the true word? They also did the (more sensible, to me) experiment of calculating the loss on a similarly-sized percentage of tokens, but not fully overlapping with the replacement mask, and that did more similarly to base ELECTRA. 

They also tested training a combined MLM/ELECTRA loss, where generated tokens were used in lieu of masking, and the full-sized MLM generator predicts the true token at every point in the sequence (which could be the token it gets as input, or could not be, in the case of a replacement). That model performed more closely to ELECTRA than BERT, which suggests that the efficiency gain of calculating a loss on every element in the training set was more important in practice than the gain from focusing a discriminator more directly on what was valuable for downstream tasks, rather than generating.

I'm a little embarrassed that I'm only just now reading what seems like a fairly important paper from a year and a half ago, but, in my defense, March 2020 was not the best time for keeping up with the literature in a disciplined way.

Anyhow, musings aside: this paper proposes an alternative training procedure for large language models, which the authors claim result in models that reach strong performance more efficiently than previous BERT, XLNet, or RoBERTa baselines. As some background context, the previously-canonical Masked Learning Model (MLM) task works by:

Replacing some percentage of tokens with a [MASK] indicator
Using the final-layer representation at the locations of those [MASK]s to predict the true input token
Using as a training signal the Maximum Likelihood of that prediction, or, how high the model's predicted probability on the true input.

ELECTRA authors argue that there are a few notable disadvantages to this structure, if your goal is to train useful representations for downstream tasks. Firstly, your loss only consists of information (i.e. the true token) from the tokens you randomly masked, so a good amount of the data is going in some sense unused (except as context). Secondly, learning a full generative model of language requires a lot of data and training time, and it may not be all that beneficial for performance on your downstream tasks of interest.

As an alternative, they propose:

Co-learning a (small) generator, trained in typical MLM fashion, alongside a discriminator. Randomly select tokens from the input to replace with fake tokens drawn from the distribution of the discriminator
The goal of the discriminator is to distinguish the true tokens from the fake ones. (minor note: if the generator happens to get lucky and generate the real token, that's counted as a "real" rather than "fake" token, even though it was generated by a generator). This uses more of the training data in the loss, since you can ask "real or fake" for every token in the input data, not (obviously) just the ones that are actually fake
An important note for those familiar with GANs is that the generator isn't trained to confuse the discriminator (as is GAN-standard), but is simply trained with it's own maximum likelihood loss, independent of the discriminator's performance.

They argue, and show fairly convincingly, that ELECTRA is able to reach a higher efficiency-to-performance trade-off curve compared to BERT - matching the performance of previous models with notably less training, and outperforming them with comparable amounts of training.

They go on to perform a few ablations, some of which felt more convincing than others. The most confusing ablation, which I'm not sure if I just misunderstood, was meant to ask how much of the value of ELECTRA came from calculating its loss over all the tokens in the training data, rather than just the masked ones. So, they tried just calculating the loss for the masked/replaced tokens. The resulting discriminator performs very poorly downstream. But, I find this a little odd as a design choice, since couldn't the discriminator learn to almost always predict that a replaced token was fake, since the only way it could be otherwise would be if the generator got lucky and produced the true word? They also did the (more sensible, to me) experiment of calculating the loss on a similarly-sized percentage of tokens, but not fully overlapping with the replacement mask, and that did more similarly to base ELECTRA.

They also tested training a combined MLM/ELECTRA loss, where generated tokens were used in lieu of masking, and the full-sized MLM generator predicts the true token at every point in the sequence (which could be the token it gets as input, or could not be, in the case of a replacement). That model performed more closely to ELECTRA than BERT, which suggests that the efficiency gain of calculating a loss on every element in the training set was more important in practice than the gain from focusing a discriminator more directly on what was valuable for downstream tasks, rather than generating.

arxiv.org
arxiv-vanity.com
scholar.google.com

Perceiver: General Perception with Iterative Attention
Andrew Jaegle and Felix Gimeno and Andrew Brock and Andrew Zisserman and Oriol Vinyals and Joao Carreira
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CV, cs.AI, cs.LG, cs.SD, eess.AS
more

[link] Summary by CodyWild 3 years ago

This new architecture out of Deepmind applies combines information extraction and bottlenecks to a traditional Transformer base to get a model that can theoretically apply self-attention to meaningfully larger input sizes than earlier architectures allowed. 

Currently, self-attention models are quite powerful and capable, but because attention is quadratic-in-sequence-length in both time, and, often more saliently, memory, it's infeasible to use on long sequences without some modification. This paper propose what they call "cross-attention," where some smaller-dimensional latent vector attends to the input (the latent generates the queries, the input the keys and values). This lets the network pull information out of the larger-dimensional input into a smaller and fixed-by-hyperparameter, size of latent. From there, multiple self-attention layers are applied to generate a new latent, which can be fed back into the beginning of the process to query new information from the input, accounting for the "iterative" in the title of this work. 

The authors argue this approach lets them take larger inputs, and create deeper models, because the cost of each self-attention layer (going from latent-dim to latent-dim) is small and controlled. Like many other Transformer-based architectures, they use positional encodings, theirs based on Fourier features at different frequencies. 

https://i.imgur.com/Wc8rzII.png

My overall take from the results presented is that it is competitive on many of the audio and vision tasks tested, with none of the convolutional priors that even something like Vision Transformer (which does course convolution-style preprocessing before going into Transformer layers) require, though it didn't dramatically outperform the state-of-the-art on any of the tested tasks. One thing that was strange to me was that they didn't (at least in the main paper, haven't read the appendix) seem to evaluate on text, which would seem like an obvious benchmark if you're proposing a Transformer-alternate architecture.

This new architecture out of Deepmind applies combines information extraction and bottlenecks to a traditional Transformer base to get a model that can theoretically apply self-attention to meaningfully larger input sizes than earlier architectures allowed.

Currently, self-attention models are quite powerful and capable, but because attention is quadratic-in-sequence-length in both time, and, often more saliently, memory, it's infeasible to use on long sequences without some modification. This paper propose what they call "cross-attention," where some smaller-dimensional latent vector attends to the input (the latent generates the queries, the input the keys and values). This lets the network pull information out of the larger-dimensional input into a smaller and fixed-by-hyperparameter, size of latent. From there, multiple self-attention layers are applied to generate a new latent, which can be fed back into the beginning of the process to query new information from the input, accounting for the "iterative" in the title of this work.

The authors argue this approach lets them take larger inputs, and create deeper models, because the cost of each self-attention layer (going from latent-dim to latent-dim) is small and controlled. Like many other Transformer-based architectures, they use positional encodings, theirs based on Fourier features at different frequencies.

My overall take from the results presented is that it is competitive on many of the audio and vision tasks tested, with none of the convolutional priors that even something like Vision Transformer (which does course convolution-style preprocessing before going into Transformer layers) require, though it didn't dramatically outperform the state-of-the-art on any of the tested tasks. One thing that was strange to me was that they didn't (at least in the main paper, haven't read the appendix) seem to evaluate on text, which would seem like an obvious benchmark if you're proposing a Transformer-alternate architecture.

arxiv.org
scholar.google.com

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
Dai, Zihang and Lai, Guokun and Yang, Yiming and Le, Quoc V.
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 3 years ago

This was an amusingly-timed paper for me to read, because just yesterday I was listening to a different paper summary where the presenter offhandedly mentioned the idea of compressing the sequence length in Transformers through subsequent layers (the way a ConvNet does pooling to a smaller spatial dimension in the course of learning), and it made me wonder why I hadn't heard much about that as an approach. And, lo, I came on this paper in my list the next day, which does exactly that. 

As a refresher, Transformers work by starting out with one embedding per token in the first layer, and, on each subsequent layer, they create new representations for each token by calculating an attention mechanism over all tokens in the prior layer. This means you have one representation per token for the full sequence length, and for the full depth of the network. In addition, you typically have a CLS token that isn't connected to any particular word, but is the designated place where sequence-level representations aggregate and are used for downstream tasks. 

This paper notices that many applications of trained transformers care primarily about that aggregated representation, rather than precise per-word representations. For cases where that's true, you're spending a lot of computation power on continually calculating the SeqLength^2 attention maps in later layers, when they might not be bringing you that much value in your downstream transfer tasks. 

A central reason why you do generally need per-token representations in training Transformers, though, even if your downstream tasks need them less, is that the canonical Masked Language Model and newer ELECTRA loss functions require token-level predictions for the specific tokens being masked. To accommodate this need, the authors of this paper structure their "Funnel" Transformer as more of an hourglass. It turns it into basically a VAE-esque Encoder/Decoder structure, where attention downsampling layers reduce the length of the internal representation down, and then a "decoder" amplifies it back to the full sequence size, so you have one representation per token for training purposes (more on the exact way this works in a bit). The nifty thing here is that, for downstream tasks, you can chop off the decoder, and be left with a network with comparatively less computation cost per layer of depth. 

https://i.imgur.com/WC0VQXi.png

The exact mechanisms of downsampling and upsampling in this paper are quite clever. 

To perform downsampling at a given attention layer, you take a sequence of representations h, and downsampling it to h' of half the size by mean-pooling adjacent tokens. However, in the attention calculation, you only use h' for the queries, and use the full sequence h for the keys and values. Essentially, this means that you have an attention layer where the downsampled representations attend to and pull information from the full scope of the (non-downsampled) representations of the layer below. This means you have a much more flexible downsampling operation, since the attention mechanism can choose to pull information into the downsampled representation, rather than it being calculated automatically by a pooling operation 

The paper inflates the bottleneck-ed representations back up to the full sequence length by first tiling the downsampled representation (for example, if you had downsampled from 20 to 5, you would tile the first representation 4 times, then the second representation 4 times, and so on until you hit 20). That tiled representation, which can roughly be though to represent a large region of the sequence, is then added, ResNet-style, to the full-length sequence of representations that came out of the first attention layer, essentially combining shallow token-level representations with deep region-level representations. This aggregated representation is then used for token-level loss prediction 

The authors benchmark again common baseline models, using deeper models with fewer tokens per layer, and find that they can reach similar or higher levels of performance with fewer FLOPs on text aggregation tasks. They fall short of full-sequence models for tasks that require strong per-token representations, which fits with my expectation.

This was an amusingly-timed paper for me to read, because just yesterday I was listening to a different paper summary where the presenter offhandedly mentioned the idea of compressing the sequence length in Transformers through subsequent layers (the way a ConvNet does pooling to a smaller spatial dimension in the course of learning), and it made me wonder why I hadn't heard much about that as an approach. And, lo, I came on this paper in my list the next day, which does exactly that.

As a refresher, Transformers work by starting out with one embedding per token in the first layer, and, on each subsequent layer, they create new representations for each token by calculating an attention mechanism over all tokens in the prior layer. This means you have one representation per token for the full sequence length, and for the full depth of the network. In addition, you typically have a CLS token that isn't connected to any particular word, but is the designated place where sequence-level representations aggregate and are used for downstream tasks.

This paper notices that many applications of trained transformers care primarily about that aggregated representation, rather than precise per-word representations. For cases where that's true, you're spending a lot of computation power on continually calculating the SeqLength^2 attention maps in later layers, when they might not be bringing you that much value in your downstream transfer tasks.

A central reason why you do generally need per-token representations in training Transformers, though, even if your downstream tasks need them less, is that the canonical Masked Language Model and newer ELECTRA loss functions require token-level predictions for the specific tokens being masked. To accommodate this need, the authors of this paper structure their "Funnel" Transformer as more of an hourglass. It turns it into basically a VAE-esque Encoder/Decoder structure, where attention downsampling layers reduce the length of the internal representation down, and then a "decoder" amplifies it back to the full sequence size, so you have one representation per token for training purposes (more on the exact way this works in a bit). The nifty thing here is that, for downstream tasks, you can chop off the decoder, and be left with a network with comparatively less computation cost per layer of depth.

The exact mechanisms of downsampling and upsampling in this paper are quite clever.

To perform downsampling at a given attention layer, you take a sequence of representations h, and downsampling it to h' of half the size by mean-pooling adjacent tokens. However, in the attention calculation, you only use h' for the queries, and use the full sequence h for the keys and values. Essentially, this means that you have an attention layer where the downsampled representations attend to and pull information from the full scope of the (non-downsampled) representations of the layer below. This means you have a much more flexible downsampling operation, since the attention mechanism can choose to pull information into the downsampled representation, rather than it being calculated automatically by a pooling operation

The paper inflates the bottleneck-ed representations back up to the full sequence length by first tiling the downsampled representation (for example, if you had downsampled from 20 to 5, you would tile the first representation 4 times, then the second representation 4 times, and so on until you hit 20). That tiled representation, which can roughly be though to represent a large region of the sequence, is then added, ResNet-style, to the full-length sequence of representations that came out of the first attention layer, essentially combining shallow token-level representations with deep region-level representations. This aggregated representation is then used for token-level loss prediction

The authors benchmark again common baseline models, using deeper models with fewer tokens per layer, and find that they can reach similar or higher levels of performance with fewer FLOPs on text aggregation tasks. They fall short of full-sequence models for tasks that require strong per-token representations, which fits with my expectation.

arxiv.org
arxiv-vanity.com
scholar.google.com

Nerfies: Deformable Neural Radiance Fields
Keunhong Park and Utkarsh Sinha and Jonathan T. Barron and Sofien Bouaziz and Dan B Goldman and Steven M. Seitz and Ricardo Martin-Brualla
arXiv e-Print archive - 2020 via Local arXiv
Keywords: cs.CV, cs.GR
more

[link] Summary by CodyWild 3 years ago

This summary builds substantially on my summary of NERFs, so if you haven't yet read that, I recommend doing so first! 

The idea of a NERF is learn a neural network that represents a 3D scene, and from which you can, once the model is trained, sample an image of that scene from any desired angle. This involves structuring your neural network as a function that predicts the RGB color and density/opacity for a given point in 3D space (x, y, z), from a given viewing angle (theta, phi). With such a function, you can generate predictions of what images taken from certain angles would look like by sampling along a viewing ray, and integrating the combined hue and opacity into an aggregated view. This prediction can then be compared to a true image taken from that direction, and gradients passed backwards into the prediction model. 

An important assumption of this model is that the scene being photographed is static; specifically, that every point in space is always inhabited by the same part of the 3D object, regardless of what angle it's viewed from. This is a reasonable assumption for photos of inanimate objects, or of humans in highly controlled lab settings, but it is often not true for humans when you, say, ask them to take a selfie video of themselves. Even if they're trying to keep roughly still, there will be slight shifts in the location and position of their head between frames, and the authors of this paper show that this can lead to strange artifacts if you naively try to train a NERF from the images (including a particularly odd one where it hallucinates tiny copies of the image in the air surrounding the face). 

https://i.imgur.com/IUVh6uM.png

The fix proposed by this paper is to apply a learnable deformation field to each image, where the notion is to deform each view into being in one canonical position (fixed per network, since, again, one network corresponds to a single scene). This means that, along with learning the parameters of the NERF itself, you're also learning what deformation to apply to each training image to get it into this canonical position. This is done by parametrizing the deformation in a particular way, and then having that deformation be conditioned by a latent vector that's trained similar to how you'd train an embedding (one learned vector per image example). The parametrization of the deformation is honestly a little bit over my head, given my lack of grounding in 3D modeling, but my general sense is that it applies some constraints and regularization to ensure that the learned deformations are realistic, insofar as humans are mostly rigid (one patch of skin on my forehead generally doesn't move except in concordance with the rest of my forehead), but with some possibility for elasticity (skin can stretch if I, say, smile). The authors also include an annealing scheme whereby, early in training, the model focuses on learning course (large-scale) deformations, and later in training, it's allowed to also learn weights for more precise deformations. This is to hopefully match macro-scale shifts before adding the noise of precise changes. 

This addition of a learned deformation is most of the contribution of this method: with it applied, they show that they're able to learn realistic NERFs from selfies, which they term "NERFIES". They mention a few pieces of concurrent work that try to solve the same problem of non-static human subjects in different ways, but I haven't had a chance to read those, so I can't really comment on how NERFIES stacks up to alternate approaches, but it appears to be as least one empirically convincing solution to the problem it's aiming at.

This summary builds substantially on my summary of NERFs, so if you haven't yet read that, I recommend doing so first!

The idea of a NERF is learn a neural network that represents a 3D scene, and from which you can, once the model is trained, sample an image of that scene from any desired angle. This involves structuring your neural network as a function that predicts the RGB color and density/opacity for a given point in 3D space (x, y, z), from a given viewing angle (theta, phi). With such a function, you can generate predictions of what images taken from certain angles would look like by sampling along a viewing ray, and integrating the combined hue and opacity into an aggregated view. This prediction can then be compared to a true image taken from that direction, and gradients passed backwards into the prediction model.

An important assumption of this model is that the scene being photographed is static; specifically, that every point in space is always inhabited by the same part of the 3D object, regardless of what angle it's viewed from. This is a reasonable assumption for photos of inanimate objects, or of humans in highly controlled lab settings, but it is often not true for humans when you, say, ask them to take a selfie video of themselves. Even if they're trying to keep roughly still, there will be slight shifts in the location and position of their head between frames, and the authors of this paper show that this can lead to strange artifacts if you naively try to train a NERF from the images (including a particularly odd one where it hallucinates tiny copies of the image in the air surrounding the face).

The fix proposed by this paper is to apply a learnable deformation field to each image, where the notion is to deform each view into being in one canonical position (fixed per network, since, again, one network corresponds to a single scene). This means that, along with learning the parameters of the NERF itself, you're also learning what deformation to apply to each training image to get it into this canonical position. This is done by parametrizing the deformation in a particular way, and then having that deformation be conditioned by a latent vector that's trained similar to how you'd train an embedding (one learned vector per image example). The parametrization of the deformation is honestly a little bit over my head, given my lack of grounding in 3D modeling, but my general sense is that it applies some constraints and regularization to ensure that the learned deformations are realistic, insofar as humans are mostly rigid (one patch of skin on my forehead generally doesn't move except in concordance with the rest of my forehead), but with some possibility for elasticity (skin can stretch if I, say, smile). The authors also include an annealing scheme whereby, early in training, the model focuses on learning course (large-scale) deformations, and later in training, it's allowed to also learn weights for more precise deformations. This is to hopefully match macro-scale shifts before adding the noise of precise changes.

This addition of a learned deformation is most of the contribution of this method: with it applied, they show that they're able to learn realistic NERFs from selfies, which they term "NERFIES". They mention a few pieces of concurrent work that try to solve the same problem of non-static human subjects in different ways, but I haven't had a chance to read those, so I can't really comment on how NERFIES stacks up to alternate approaches, but it appears to be as least one empirically convincing solution to the problem it's aiming at.

arxiv.org
scholar.google.com

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Mildenhall, Ben and Srinivasan, Pratul P. and Tancik, Matthew and Barron, Jonathan T. and Ramamoorthi, Ravi and Ng, Ren
- 2020 via Local Bibsonomy
Keywords: view_sythesis, deeplearning, eccv20, neural_rendering

[link] Summary by CodyWild 3 years ago

This summary builds extensively on my prior summary of SIRENs, so if you haven't read that summary or the underlying paper yet, I'd recommend doing that first! 

At a high level, the idea of SIRENs is to use a neural network to learn a compressed, continuous representation of an image, where the neural network encodes a mapping from (x, y) to the pixel value at that location, and the image can be reconstructed (or, potentially, expanded in size) by sampling from that function across the full range of the image. To do this effectively, they use sinusoidal activation functions, which let them match not just the output of the neural network f(x, y) to the true image, but also the first and second derivatives of the neural network to the first and second derivatives of the true image, which provides a more robust training signal. 

NERFs builds on this idea, but instead of trying to learn a continuous representation of an image (mapping from 2D position to 3D RGB), they try to learn a continuous representation of a scene, mapping from position (specified with with three coordinates) and viewing direction (specified with two angles) to the RGB color at a given point in a 3D grid (or "voxel", analogous to "pixel"), as well as the *density* or opacity of that point. 

Why is this interesting? Because if you have a NERF that has learned a good underlying function of a particular 3D scene, you can theoretically take samples of that scene from arbitrary angles, even angles not seen during training. It essentially functions as a usable 3D model of a scene, but one that, because it's stored in the weights of a neural network, and specified in a continuous function, is far smaller than actually storing all the values of all the voxels in a 3D scene (the authors give an example of 5MB vs 15GB for a NERF vs a full 3D model). To get some intuition for this, consider that if you wanted to store the curve represented by a particular third-degree polynomial function between 0 and 10,000 it would be much more space-efficient to simply store the 3 coefficients of that polynomial, and be able to sample from it at your desired granularity at will, rather than storing many empirically sampled points from along the curve. 

https://i.imgur.com/0c33YqV.png

How is a NERF model learned?

- The (x, y, z) position of each point is encoded as a combination of sine-wave, Fourier-style curves of increasingly higher frequency. This is similar to the positional encoding used by transformers. In practical turns, this means a location in space will be represented as a vector calculated as [some point on a low-frequency curve, some point on a slightly higher frequency curve..., some point on the highest-frequency curve]. This doesn't contain any more *information* than the (x, y, z) representation, but it does empirically seem to help training when you separate the frequencies like this
- You take a dataset of images for which viewing direction is known, and simulate sending a ray through the scene in that direction, hitting some line (or possibly tube?) of voxels on the way. You calculate the perceived color at that point, which is an integral of the color information and density/opacity returned by your model, for each point. Intuitively, if you have a high opacity weight early on, that part of the object blocks any voxels further in the ray, whereas if the opacity weight is lower, more of the voxels behind will contribute to the overall effective color perceived. You then compare these predicted perceived colors to the actual colors captured by the 2D image, and train on the prediction error.
- (One note on sampling: the paper proposes a hierarchical sampling scheme to help with sampling efficiently along the ray, first taking a course sample, and then adding additional samples in regions of high predicted density)
- At the end of training, you have a network that hopefully captures the information from *that particular scene*. A notable downside of this approach is that it's quite slow for any use cases that require training on many scenes, since each individual scene network takes about 1-2 days of GPU time to train

This summary builds extensively on my prior summary of SIRENs, so if you haven't read that summary or the underlying paper yet, I'd recommend doing that first!

At a high level, the idea of SIRENs is to use a neural network to learn a compressed, continuous representation of an image, where the neural network encodes a mapping from (x, y) to the pixel value at that location, and the image can be reconstructed (or, potentially, expanded in size) by sampling from that function across the full range of the image. To do this effectively, they use sinusoidal activation functions, which let them match not just the output of the neural network f(x, y) to the true image, but also the first and second derivatives of the neural network to the first and second derivatives of the true image, which provides a more robust training signal.

NERFs builds on this idea, but instead of trying to learn a continuous representation of an image (mapping from 2D position to 3D RGB), they try to learn a continuous representation of a scene, mapping from position (specified with with three coordinates) and viewing direction (specified with two angles) to the RGB color at a given point in a 3D grid (or "voxel", analogous to "pixel"), as well as the density or opacity of that point.

Why is this interesting? Because if you have a NERF that has learned a good underlying function of a particular 3D scene, you can theoretically take samples of that scene from arbitrary angles, even angles not seen during training. It essentially functions as a usable 3D model of a scene, but one that, because it's stored in the weights of a neural network, and specified in a continuous function, is far smaller than actually storing all the values of all the voxels in a 3D scene (the authors give an example of 5MB vs 15GB for a NERF vs a full 3D model). To get some intuition for this, consider that if you wanted to store the curve represented by a particular third-degree polynomial function between 0 and 10,000 it would be much more space-efficient to simply store the 3 coefficients of that polynomial, and be able to sample from it at your desired granularity at will, rather than storing many empirically sampled points from along the curve.

How is a NERF model learned?

The (x, y, z) position of each point is encoded as a combination of sine-wave, Fourier-style curves of increasingly higher frequency. This is similar to the positional encoding used by transformers. In practical turns, this means a location in space will be represented as a vector calculated as [some point on a low-frequency curve, some point on a slightly higher frequency curve..., some point on the highest-frequency curve]. This doesn't contain any more information than the (x, y, z) representation, but it does empirically seem to help training when you separate the frequencies like this
You take a dataset of images for which viewing direction is known, and simulate sending a ray through the scene in that direction, hitting some line (or possibly tube?) of voxels on the way. You calculate the perceived color at that point, which is an integral of the color information and density/opacity returned by your model, for each point. Intuitively, if you have a high opacity weight early on, that part of the object blocks any voxels further in the ray, whereas if the opacity weight is lower, more of the voxels behind will contribute to the overall effective color perceived. You then compare these predicted perceived colors to the actual colors captured by the 2D image, and train on the prediction error.
(One note on sampling: the paper proposes a hierarchical sampling scheme to help with sampling efficiently along the ray, first taking a course sample, and then adding additional samples in regions of high predicted density)
At the end of training, you have a network that hopefully captures the information from that particular scene. A notable downside of this approach is that it's quite slow for any use cases that require training on many scenes, since each individual scene network takes about 1-2 days of GPU time to train