ShortScience.org - Making Science Accessible!

4

[link] Summary by Alexander Jung 7 years ago

### What is BN:
  * Batch Normalization (BN) is a normalization method/layer for neural networks.
  * Usually inputs to neural networks are normalized to either the range of [0, 1] or [-1, 1] or to mean=0 and variance=1. The latter is called *Whitening*.
  * BN essentially performs Whitening to the intermediate layers of the networks.

### How its calculated:
  * The basic formula is $x^* = (x - E[x]) / \sqrt{\text{var}(x)}$, where $x^*$ is the new value of a single component, $E[x]$ is its mean within a batch and `var(x)` is its variance within a batch.
  * BN extends that formula further to $x^{**} = gamma * x^* +$ beta, where $x^{**}$ is the final normalized value. `gamma` and `beta` are learned per layer. They make sure that BN can learn the identity function, which is needed in a few cases.
  * For convolutions, every layer/filter/kernel is normalized on its own (linear layer: each neuron/node/component). That means that every generated value ("pixel") is treated as an example. If we have a batch size of N and the image generated by the convolution has width=P and height=Q, we would calculate the mean (E) over `N*P*Q` examples (same for the variance).

### Theoretical effects:
  * BN reduces *Covariate Shift*. That is the change in distribution of activation of a component. By using BN, each neuron's activation becomes (more or less) a gaussian distribution, i.e. its usually not active, sometimes a bit active, rare very active.
  * Covariate Shift is undesirable, because the later layers have to keep adapting to the change of the type of distribution (instead of just to new distribution parameters, e.g. new mean and variance values for gaussian distributions).
  * BN reduces effects of exploding and vanishing gradients, because every becomes roughly normal distributed. Without BN, low activations of one layer can lead to lower activations in the next layer, and then even lower ones in the next layer and so on.

### Practical effects:
  * BN reduces training times. (Because of less Covariate Shift, less exploding/vanishing gradients.)
  * BN reduces demand for regularization, e.g. dropout or L2 norm. (Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.)
  * BN allows higher learning rates. (Because of less danger of exploding/vanishing gradients.)
  * BN enables training with saturating nonlinearities in deep networks, e.g. sigmoid. (Because the normalization prevents them from getting stuck in saturating ranges, e.g. very high/low values for sigmoid.)


![MNIST and neuron activations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Batch_Normalization__performance_and_activations.png?raw=true "MNIST and neuron activations")

*BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.*

-------------------------

### Rough chapter-wise notes

* (2) Towards Reducing Covariate Shift
  * Batch Normalization (*BN*) is a special normalization method for neural networks.
  * In neural networks, the inputs to each layer depend on the outputs of all previous layers.
  * The distributions of these outputs can change during the training. Such a change is called a *covariate shift*.
  * If the distributions stayed the same, it would simplify the training. Then only the parameters would have to be readjusted continuously (e.g. mean and variance for normal distributions).
  * If using sigmoid activations, it can happen that one unit saturates (very high/low values). That is undesired as it leads to vanishing gradients for all units below in the network.
  * BN fixes the means and variances of layer inputs to specific values (zero mean, unit variance).
  * That accomplishes:
    * No more covariate shift.
    * Fixes problems with vanishing gradients due to saturation.
  * Effects:
    * Networks learn faster. (As they don't have to adjust to covariate shift any more.)
    * Optimizes gradient flow in the network. (As the gradient becomes less dependent on the scale of the parameters and their initial values.)
    * Higher learning rates are possible. (Optimized gradient flow reduces risk of divergence.)
    * Saturating nonlinearities can be safely used. (Optimized gradient flow prevents the network from getting stuck in saturated modes.)
    * BN reduces the need for dropout. (As it has a regularizing effect.)
  * How BN works:
    * BN normalizes layer inputs to zero mean and unit variance. That is called *whitening*.
    * Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change.
    * A proper method has to include the current example *and* all previous examples in the normalization step.
    * This leads to calculating in covariance matrix and its inverse square root. That's expensive. The authors found a faster way.

* (3) Normalization via Mini-Batch Statistics
  * Each feature (component) is normalized individually. (Due to cost, differentiability.)
  * Normalization according to: `componentNormalizedValue = (componentOldValue - E[component]) / sqrt(Var(component))`
  * Normalizing each component can reduce the expressitivity of nonlinearities. Hence the formula is changed so that it can also learn the identiy function.
  * Full formula: `newValue = gamma * componentNormalizedValue + beta` (gamma and beta learned per component)
  * E and Var are estimated for each mini batch.
  * BN is fully differentiable. Formulas for gradients/backpropagation are at the end of chapter 3 (page 4, left).

* (3.1) Training and Inference with Batch-Normalized Networks
  * During test time, E and Var of each component can be estimated using all examples or alternatively with moving averages estimated during training.
  * During test time, the BN formulas can be simplified to a single linear transformation.

* (3.2) Batch-Normalized Convolutional Networks
  * Authors recommend to place BN layers after linear/fully-connected layers and before the ninlinearities.
  * They argue that the linear layers have a better distribution that is more likely to be similar to a gaussian.
  * Placing BN after the nonlinearity would also not eliminate covariate shift (for some reason).
  * Learning a separate bias isn't necessary as BN's formula already contains a bias-like term (beta).
  * For convolutions they apply BN equally to all features on a feature map. That creates effective batch sizes of m\*pq, where m is the number of examples in the batch and p q are the feature map dimensions (height, width). BN for linear layers has a batch size of m.
  * gamma and beta are then learned per feature map, not per single pixel. (Linear layers: Per neuron.)

* (3.3) Batch Normalization enables higher learning rates
  * BN normalizes activations.
  * Result: Changes to early layers don't amplify towards the end.
  * BN makes it less likely to get stuck in the saturating parts of nonlinearities.
  * BN makes training more resilient to parameter scales.
  * Usually, large learning rates cannot be used as they tend to scale up parameters. Then any change to a parameter amplifies through the network and can lead to gradient explosions.
  * With BN gradients actually go down as parameters increase. Therefore, higher learning rates can be used.
  * (something about singular values and the Jacobian)

* (3.4) Batch Normalization regularizes the model
  * Usually: Examples are seen on their own by the network.
  * With BN: Examples are seen in conjunction with other examples (mean, variance).
  * Result: Network can't easily memorize the examples any more.
  * Effect: BN has a regularizing effect. Dropout can be removed or decreased in strength.

* (4) Experiments
* (4.1) Activations over time
** They tested BN on MNIST with a 100x100x10 network. (One network with BN before each nonlinearity, another network without BN for comparison.)
** Batch Size was 60.
** The network with BN learned faster. Activations of neurons (their means and variances over several examples) seemed to be more consistent during training.
** Generalization of the BN network seemed to be better.

* (4.2) ImageNet classification
** They applied BN to the Inception network.
** Batch Size was 32.
** During training they used (compared to original Inception training) a higher learning rate with more decay, no dropout, less L2, no local response normalization and less distortion/augmentation.
** They shuffle the data during training (i.e. each batch contains different examples).
** Depending on the learning rate, they either achieve the same accuracy (as in the non-BN network) in 14 times fewer steps (5x learning rate) or a higher accuracy in 5 times fewer steps (30x learning rate).
** BN enables training of Inception networks with sigmoid units (still a bit lower accuracy than ReLU).
** An ensemble of 6 Inception networks with BN achieved better accuracy than the previously best network for ImageNet.

* (5) Conclusion
** BN is similar to a normalization layer suggested by Gülcehre and Bengio. However, they applied it to the outputs of nonlinearities.
** They also didn't have the beta and gamma parameters (i.e. their normalization could not learn the identity function).

2

[link] Summary by Alexander Jung 7 years ago

* They propose a CNN-based approach to detect faces in a wide range of orientations using a single model. However, since the training set is skewed, the network is more confident about up-right faces.
* The model does not require additional components such as segmentation, bounding-box regression, segmentation, or SVM classifiers

### How
* __Data augmentation__: to increase the number of positive samples (24K face annotations), the authors used randomly sampled sub-windows of the images with IOU > 50% and also randomly flipped these images. In total, there were 20K positive and 20M negative training samples.
* __CNN Architecture__: 5 convolutional layers followed by 3 fully-connected. The fully-connected layers were converted to convolutional layers. Non-Maximal Suppression is applied to merge predicted bounding boxes.
* __Training__: the CNN was trained using Caffe Library in the AFLW dataset with the following parameters:
* Fine-tuning with AlexNet model
* Input image size = 227x227
* Batch size = 128 (32+, 96-)
* Stride = 32
* __Test__: the model was evaluated on PASCAL FACE, AFW, and FDDB dataset.
* __Running time__: since the fully-connected layers were converted to convolutional layers, the input image in running time may be of any size, obtaining a heat map as output. To detect faces of different sizes though, the image is scaled up/down and new heatmaps are obtained. The authors found that rescaling image 3 times per octave gives reasonable good performance.
![DDFD heatmap](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/DDFD__heatmap.png?raw=true "DDFD heatmap")
* The authors realized that the model is more confident about up-right faces than rotated/occluded ones. This trend is because the lack of good training examples to represent such faces in the training process. Better results can be achieved by using better sampling strategies and more sophisticated data augmentation techniques.
![DDFD example](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/DDFD__example.png?raw=true "DDFD example")
* The authors tested different strategies for NMS and the effect of bounding-box regression for improving face detection. They NMS-avg had better performance compared to NMS-max in terms of average precision. On the other hand, adding a bounding-box regressor degraded the performance for both NMS strategies due to the mismatch between annotations of the training set and the test set. This mismatch is mostly for side-view faces.

### Results:
* In comparison to R-CNN, the proposed face detector had significantly better performance independent of the NMS strategy. The authors believe the inferior performance of R-CNN due to the loss of recall since selective search may miss some of the face regions; and loss in localization since bounding-box regression is not perfect and may not be able to fully align the segmentation bounding-boxes, provided by selective search, with the ground truth.
* In comparison to other state-of-art methods like structural model, TSM and cascade-based methods the DDFD achieve similar or better results. However, this comparison is not completely fair since the most of methods use extra information of pose annotation or information about facial landmarks during the training.

1

[link] Summary by Alexander Jung 7 years ago

* They describe an architecture that merges classical convolutional networks and residual networks.
* The architecture can (theoretically) learn anything that a classical convolutional network or a residual network can learn, as it contains both of them.
* The architecture can (theoretically) learn how many convolutional layers it should use per residual block (up to the amount of convolutional layers in the whole network).

### How
* Just like residual networks, they have "blocks". Each block contains convolutional layers.
* Each block contains residual units and non-residual units.
* They have two "streams" of data in their network (just matrices generated by each block):
* Residual stream: The residual blocks write to this stream (i.e. it's their output).
* Transient stream: The non-residual blocks write to this stream.
* Residual and non-residual layers receive *both* streams as input, but only write to *their* stream as output.
* Their architecture visualized:
![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Resnet_in_Resnet__architecture.png?raw=true "Architecture")
* Because of this architecture, their model can learn the number of layers per residual block (though BN and ReLU might cause problems here?):
![Learning layercount](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Resnet_in_Resnet__learning_layercount.png?raw=true "Learning layercount")
* The easiest way to implement this should be along the lines of the following (some of the visualized convolutions can be merged):
* Input of size CxHxW (both streams, each C/2 planes)
* Concat
* Residual block: Apply C/2 convolutions to the C input planes, with shortcut addition afterwards.
* Transient block: Apply C/2 convolutions to the C input planes.
* Apply BN
* Apply ReLU
* Output of size CxHxW.
* The whole operation can also be implemented with just a single convolutional layer, but then one has to make sure that some weights stay at zero.

### Results
* They test on CIFAR-10 and CIFAR-100.
* They search for optimal hyperparameters (learning rate, optimizer, L2 penalty, initialization method, type of shortcut connection in residual blocks) using a grid search.
* Their model improves upon a wide ResNet and an equivalent non-residual CNN by a good margin (CIFAR-10: 0.5-1%, CIFAR-100: 1-2%).

2

[link] Summary by Alexander Jung 7 years ago

  * They suggest a new architecture for GANs.
  * Their architecture adds another Generator for a reverse branch (from images to noise vector `z`).
  * Their architecture takes some ideas from VAEs/variational neural nets.
  * Overall they can improve on the previous state of the art (DCGAN).

### How
  * Architecture
    * Usually, in GANs one feeds a noise vector `z` into a Generator (G), which then generates an image (`x`) from that noise.
    * They add a reverse branch (G2), in which another Generator takes a real image (`x`) and generates a noise vector `z` from that.
      * The noise vector can now be viewed as a latent space vector.
    * Instead of letting G2 generate *discrete* values for `z` (as it is usually done), they instead take the approach commonly used VAEs and use *continuous* variables instead.
      * That is, if `z` represents `N` latent variables, they let G2 generate `N` means and `N` variances of gaussian distributions, with each distribution representing one value of `z`.
      * So the model could e.g. represent something along the lines of "this face looks a lot like a female, but with very low probability could also be male".
  * Training
    * The Discriminator (D) is now trained on pairs of either `(real image, generated latent space vector)` or `(generated image, randomly sampled latent space vector)` and has to tell them apart from each other.
    * Both Generators are trained to maximally confuse D.
      * G1 (from `z` to `x`) confuses D maximally, if it generates new images that (a) look real and (b) fit well to the latent variables in `z` (e.g. if `z` says "image contains a cat", then the image should contain a cat).
      * G2 (from `x` to `z`) confuses D maximally, if it generates good latent variables `z` that fit to the image `x`.
    * Continuous variables
      * The variables in `z` follow gaussian distributions, which makes the training more complicated, as you can't trivially backpropagate through gaussians.
      * When training G1 (from `z` to `x`) the situation is easy: You draw a random `z`-vector following a gaussian distribution (`N(0, I)`). (This is basically the same as in "normal" GANs. They just often use uniform distributions instead.)
      * When training G2 (from `x` to `z`) the situation is a bit harder.
        * Here we need to use the reparameterization trick here.
        * That roughly means, that G2 predicts the means and variances of the gaussian variables in `z` and then we draw a sample of `z` according to exactly these means and variances.
        * That sample gives us discrete values for our backpropagation.
        * If we do that sampling often enough, we get a good approximation of the true gradient (of the continuous variables). (Monte Carlo approximation.)

* Results
  * Images generated based on Celeb-A dataset:
    * ![Celeb-A samples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adversarially_Learned_Inference__celeba-samples.png?raw=true "Celeb-A samples")
  * Left column per pair: Real image, right column per pair: reconstruction (`x -> z` via G2, then `z -> x` via G1)
    * ![Celeb-A reconstructions](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adversarially_Learned_Inference__celeba-reconstructions.png?raw=true "Celeb-A reconstructions")
  * Reconstructions of SVHN, notice how the digits often stay the same, while the font changes:
    * ![SVHN reconstructions](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adversarially_Learned_Inference__svhn-reconstructions.png?raw=true "SVHN reconstructions")
  * CIFAR-10 samples, still lots of errors, but some quite correct:
    * ![CIFAR10 samples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adversarially_Learned_Inference__cifar10-samples.png?raw=true "CIFAR10 samples")

4

[link] Summary by Alexander Jung 7 years ago

* They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp.
* Stochastic optimization methods have to find parameters that minimize/maximize a stochastic function.
* A function is stochastic (non-deterministic), if the same set of parameters can generate different results. E.g. the loss of different mini-batches can differ, even when the parameters remain unchanged. Even for the same mini-batch the results can change due to e.g. dropout.
* Their method tends to converge faster to optimal parameters than the existing competitors.
* Their method can deal with non-stationary distributions (similar to e.g. SGD, Adadelta, RMSProp).
* Their method can deal with very sparse or noisy gradients (similar to e.g. Adagrad).

### How
* Basic principle
* Standard SGD just updates the parameters based on `parameters = parameters - learningRate * gradient`.
* Adam operates similar to that, but adds more "cleverness" to the rule.
* It assumes that the gradient values have means and variances and tries to estimate these values.
* Recall here that the function to optimize is stochastic, so there is some randomness in the gradients.
* The mean is also called "the first moment".
* The variance is also called "the second (raw) moment".
* Then an update rule very similar to SGD would be `parameters = parameters - learningRate * means`.
* They instead use the update rule `parameters = parameters - learningRate * means/sqrt(variances)`.
* They call `means/sqrt(variances)` a 'Signal to Noise Ratio'.
* Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via `learningRate * mean/sqrt(highValue)`.
* If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via `learningRate * mean/sqrt(lowValue)`.
* Exponential moving averages
* In order to approximate the mean and variance values you could simply save the last `T` gradients and then average the values.
* That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs).
* A simple average also has the disadvantage, that it would completely ignore all gradients before `T` and weight all of the last `T` gradients identically. In reality, you might want to give more weight to the last couple of gradients.
* Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula `avg = alpha * avg + (1 - alpha) * avg`.
* Let the gradient at timestep (batch) `t` be `g`, then we can approximate the mean and variance values using:
* `mean = beta1 * mean + (1 - beta1) * g`
* `variance = beta2 * variance + (1 - beta2) * g^2`.
* `beta1` and `beta2` are hyperparameters of the algorithm. Good values for them seem to be `beta1=0.9` and `beta2=0.999`.
* At the start of the algorithm, `mean` and `variance` are initialized to zero-vectors.
* Bias correction
* Initializing the `mean` and `variance` vectors to zero is an easy and logical step, but has the disadvantage that bias is introduced.
* E.g. at the first timestep, the mean of the gradient would be `mean = beta1 * 0 + (1 - beta1) * g`, with `beta1=0.9` then: `mean = 0.9 * g`. So `0.9g`, not `g`. Both the mean and the variance are biased (towards 0).
* This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit.
* So to fix this pretty they perform bias-corrections of the mean and the variance:
* `correctedMean = mean / (1-beta1^t)` (where `t` is the timestep).
* `correctedVariance = variance / (1-beta2^t)`.
* Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep).

![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adam__algorithm.png?raw=true "Algorithm")