ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Visualizing the Loss Landscape of Neural Nets
Hao Li and Zheng Xu and Gavin Taylor and Christoph Studer and Tom Goldstein
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.CV, stat.ML
more

[link] Summary by CodyWild 6 years ago

This paper was a real delight to read, and even though I’m summarizing it here, I’d really encourage you, if you’re reading this, to read the paper itself, since I found it to be unusually clearly written. It tackles the problem of understanding how features of loss functions - these integral, yet arcane, objects defined in millions of parameter-dimensions - impact model performance. Loss function analysis is generally a difficult area, since the number of dimensions and number of points needed to evaluate to calculate loss are both so high. The latter presents computational challenges, the former ones of understanding: human brains and many-dimensional spaces are not a good fit. Overall, this paper contributes by 1) arguing for a new way of visualizing loss functions, 2) demonstrating how and in what cases “flatness” of loss function contributes to performance and trainability, and 3)) 

The authors review a few historically common ways of visualizing loss functions, before introducing their variant. The simplest, one-dimensional visualization technique, 1-D Linear Interpolation, works by taking two parameter settings (say, a random initialization, and the final network minimum), and smoothly interpolating between the two, by taking a convex combination mediated by alpha. Then, you can plot the value of the loss at all of these parameter configurations as a function of alpha. If you want to plot in 2D, with a contour plot, you can do so in a pretty similar manner, by picking two random “direction vectors” of the same size as the parameter vector, and then adding amounts of those directions, weighted by alpha and beta, to your starting point. These random directions become your axes, and you get a snapshot of the change in your loss function as you move along them. 


The authors then make the observation that these techniques can’t natively be used to compare two different models, if the parameters of those models are on different scales. If you take a neural net, multiply one layer by 10, and then divide the next layer by 10, you’ve essentially done a no-op that won’t impact the outcome. However, if you’re moving by a fixed amount along your random direction in parameter space, you’ll have to move much farther to go the commensurate amount of distance in the network that’s been multiplied by 10. To address this problem, they suggest a simple fix: after you’ve selected each of your random directions, scale the value in each direction vector by the norm of the filter that corresponds to that value. This gets rid of the sensitivity of your plots to the scale of weights. (One thing I admit I’m a little confused by here is the fact that each value in the direction vector corresponds to a filter, rather than to a weight; I would have natively thought theta, and all the direction vectors, are of length number-of-model-parameters, and each value is a single weight. I think I still broadly grasp the intuition, but I’d value having a better sense of this). 

To demonstrate the value of their normalization system, they compare the interpolation plots for a model with small and large batch size, with and without weight decay. Small batches are known to increase flatness of the loss function around the eventual minimum, which seems co-occurrent with good generalization results. And, that bears out in the original model’s linear interpolation (figs a, b, c), where the small model has the wider solution basin, and also better performance. However, once weight decay is applied (figs d, e, f), the small-batch basin appears to shrink to be very narrow, although small-batch still has dominant performance. At first glance, this would seem to be a contradiction of the “flatter solutions mean more generalization” rule.  

https://i.imgur.com/V0H13kK.png

But this is just because weight decay hits smaller models more strongly, because they have more distinct updates during which they apply the weight decay penalty. This means that when weight decay is applied, the overall scale of weights in the small-batch network is lower, and so it’s solution looked “sharp” when plotted on the same weight scale as the large-batch network. When normalization was used, this effect by and large went away, and you once again saw higher performance with flatter loss functions. (batch size and performance with and without weight decay, shown normalized below)

https://i.imgur.com/vEUIgo0.png

A few other, more scattered observations from the paper:

- I’ve heard explanations of skip connections in terms of “giving the model shorter gradient paths between parameters and output,” but haven’t really seen an argument for why skip connections lead to smoother loss functions, even they empirically seem to

https://i.imgur.com/g3QqRzh.png

- The authors also devise a technique for visualizing the change in loss function along the trajectory taken by the optimization algorithm, so that different ones can be compared.  The main problem in previous methods for this has been that optimization trajectories happen in a low-dimensional manifold within parameter space, so if you just randomly select directions, you won’t see any interesting movement along the trajectory. To fix this, they choose as their axes the principal components you get from making a matrix out of the parameter values at each epoch: this prioritizes the parameters that had the most variance throughout training.

dx.doi.org
sci-hub
scholar.google.com

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Girshick, Ross B. and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by nandini 7 years ago

# Object detection system overview.

https://i.imgur.com/vd2YUy3.png

1. takes an input image,
2. extracts around 2000 bottom-up region proposals,
3. computes features for each proposal using a large convolutional neural network (CNN), and then
4. classifies each region using class-specific linear SVMs.
* R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010.
* On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%.

## There is a 2 challenges faced in object detection
1. localization problem
2. labeling the data

1 localization problem :
* One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method.
* An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm.

2 labeling the data:
* The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning
* supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL),
* fine-tuning for detection improves mAP performance by 8 percentage points.
* Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs)

## Object detection with R-CNN
This system consists of three modules
* The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector.
* The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* The third module is a set of class specific linear SVMs.

Module design

1 Region proposals
* which detect mitotic cells by applying a CNN to regularly-spaced square crops.
* use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute).
* the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU)

2 Feature extraction.
* extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN
* Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers.
* warp all pixels in a tight bounding box around it to the required size
* The feature matrix is typically 2000x4096

3 Test time detection
* At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments).
* warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class.
* Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold.
## Training

1 Supervised pre-training:
* pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data)

2 Domain-specific fine-tuning.
* use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001.

3 Object category classifiers.
* use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3.
* Once features are extracted and training labels are applied, we optimize one linear SVM per class.
* adopt the standard hard negative mining method to fit large training data in memory.

### Results on PASCAL VOC 201012

1 VOC 2010
* compared against four strong baselines including SegDPM, DPM, UVA, Regionlets.
* Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster
https://i.imgur.com/0dGX9b7.png
2 ILSVRC2013 detection.
* ran R-CNN on the 200-class ILSVRC2013 detection dataset
* R-CNN achieves a mAP of 31.4%
https://i.imgur.com/GFbULx3.png
#### Performance layer-by-layer, without fine-tuning
1 pool5 layer
* which is the max pooled output of the network’s fifth and final convolutional layer.
*The pool5 feature map is 6 x6 x 256 = 9216 dimensional
* each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input

2 Layer fc6
* fully connected to pool5
* it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases

3 Layer fc7
* It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification
#### Performance layer-by-layer, with fine-tuning
* CNN’s parameters fine-tuned on PASCAL.
* fine-tuning increases mAP by 8.0 % points to 54.2%

### Network architectures
* 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
* RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0%
* drawback in terms of compute time, with in terms of compute time, with than T-Net.

1 The ILSVRC2013 detection dataset
* dataset is split into three sets: train (395,918), val (20,121), and test (40,152)

#### CNN features for segmentation.
* full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap.
* fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction.
* full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features
https://i.imgur.com/n1bhmKo.png

1 Comments

papers.nips.cc
scholar.google.com

Adversarially Robust Generalization Requires More Data
Schmidt, Ludwig and Santurkar, Shibani and Tsipras, Dimitris and Talwar, Kunal and Madry, Aleksander
Neural Information Processing Systems Conference - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 6 years ago

Schmidt et al. theoretically and experimentally show that training adversarially robust models requires a higher sample complexity compared to regular generalization. Theoretically, they analyze two very simple families of datasets, e.g., consisting of two Gaussian distributions corresponding to a two-class problem. On such datasets, they proof that “robust generalization”, i.e., generalization to adversarial examples, required much higher sample complexity compared to regular generlization, i.e., generalization to the test set. These results are interesting because they suggest that the sample complexity might be even worse for more complex and realistic data distributions – as we commonly tackle in computer vision.

Experimentally, they show similar result son MNIST, CIFAR-10 and SVHN. Varying the size of the training set and plotting the accuracy on adversarially computed examples results in Figure 1. As can be seen, there seems to be a clear advantage of having larger training sets. Note that these models were trained using adversarial training using an $L_\infty$ adversary constrained by the given $\epsilon$.

https://i.imgur.com/SriBAt4.png
Figure 1: Training set size plotted against the adversarial test accuracy on MNIST, CIFAR-10 and SVHN. The models were trained using adversarial training.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
scholar.google.com

Adversarial Self-Defense for Cycle-Consistent GANs
Bashkirova, Dina and Usman, Ben and Saenko, Kate
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

Domain translation - for example, mapping from a summer to a winter scene, or from a photorealistic image to an object segmentation map - is often performed by GANs through something called cycle consistency loss. This model works by having, for each domain, a generator to map domain A into domain B, and a discriminator to differentiate between real images from domain B, and those that were constructed through the cross-domain generator. With a given image in domain A, training happens by using the A→B generator to map it into domain B, and then then B→ A generator to map it back the original domain. These generators are then trained using two losses: one based on the B-domain discriminator, to push the generated image to look like it belongs from that domain, and another based on the L2 loss between the original domain A image, and the image you get on the other end when you translate it into B and back again. 

This paper addresses an effect (identified originally in an earlier paper) where in domains with a many to one mapping between domains (for example, mapping a realistic scene into a domain segmentation map, where information is inherently lost by translating pixels to object outlines), the cycle loss incentivizes the model to operate in a strange, steganographic way, where it saves information about the that would otherwise be lost in the form of low-amplitude random noise in the translated image. This low-amplitude information can't be isolated, but can be detected in a few ways. First, we can simply examine images and notice that information that could not have been captured in the lower-information domain is being perfectly reconstructed. Second, if you add noise to the translation in the lower-information domain, in such a way as to not perceptibly change the translation to human eyes, this can cause the predicted image off of that translation to deteriorate considerably, suggesting that the model was using information that could be modified by such small additions of noise to do its reconstruction. 

https://i.imgur.com/08i1j0J.png

The authors of this paper ask whether it's possible to train models that don't perform this steganographic information-storing (which they call "self adversarial examples"). A typical approach to such a problem would be to train generators to perform translations with and without the steganographic information, but even though we can prove the existence of the information, we can't isolate it in a way that would allow us to remove it, and thus create these kinds of training pairs. The two tactics the paper uses are: 

1) Simply training the generators to be able to translate a domain-mapped image with noise as well as one without noise, in the hope that this would train it not use information that can be interfered with by the application of such noise. 

2) In addition to a L2 cycle loss, adding a discriminator to differentiate between the back-translated image and the original one. I believe the idea here is that if both of the encoders are adding in noise as a kind of secret signal, this would be a way for the discriminator to distinguish between the original and reconstructed image, and would thus be penalized.  

They find that both of these methods reduce the use of steganographic information, as determined both by sensitivity to noise (where less sensitivity of reconstruction to noise means less use of coded information) and reconstruction honesty (which constrains accuracy of reconstruction in many to one domains to be no greater than the prediction that a supervised predictor could make given the image from the compressed domain

papers.nips.cc
scholar.google.com

Sanity Checks for Saliency Maps
Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian J. and Hardt, Moritz and Kim, Been
Neural Information Processing Systems Conference - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hadrien Bertrand 5 years ago

The paper designs some basic tests to compare saliency methods. It founds that some of the most popular methods are independent of model parameters and the data, meaning they are effectively useless.

## Methods compared

The paper compare the following methods: gradient explanation, gradient x input, integrated gradients, guided backprop, guided GradCam and SmoothGrad. They provide a refresher on those methods in the appendix.

All those methods can be put in the same framework. They require a classification model and an input (typically an image). The output of the method is an *explanation map* of the shape of the input where a higher value for a feature implies greater relevance in the decision of the model.

## Metrics of comparison

The authors argue that visual inspection of the saliency maps can be misleading. They propose to compute the Spearman rank correlation, the structural similarity index (SSMI) and the Pearson correlation of the histogram of gradients. The authors point out that those metrics capture various notions of similarity, but it is an active area of research and those metrics are imperfect.

## First test: model parameters randomization

A saliency method must be dependent of model parameters, otherwise it cannot help us understand a model. In this test, the authors randomize the model parameters, layer per layer, starting from the top.

Surprisingly, methods such as guided backprop and guided gradcam are completely insensitive to model parameters, as illustrated on this Inception v3 trained on ImageNet:

![image](https://user-images.githubusercontent.com/8659132/61403152-b10b8000-a8a2-11e9-9f6a-cf1ed6a876cc.png)

Integrated gradients looks also dubious as the bird is still visible with a mostly fully randomized model, but the quantitative metrics reveal the difference is actually big between the two models.

## Second test: data randomization

It is well-known that randomly shuffling the labels of a dataset does not prevent a neural network from getting a high accuracy on the training set, though it does prevent generalization. The model is able to learn by either memorizing the data or finding spurious patterns. As a result, saliency maps obtained from such a network should have no clearly interpretable signal.

Here is the result for a ConvNet trained on MNIST and a shuffled MNIST:

![image](https://user-images.githubusercontent.com/8659132/61406757-7efe1c00-a8aa-11e9-9826-a859a373cb4f.png)

The results are very damning for most methods. Only gradients and GradCam are very different between both models, as confirmed by the low correlation.

## Discussion

- Even though some methods do no depend on model parameters and data, they might still depend on the architecture of the models, which could be of some use in some contexts.
- Methods that multiply the input with the gradient are dominated by the input.
- Complex saliency methods are just fancy edge detectors.
- Only gradient, smooth gradient and GradCam survives the sanity checks.

# Comments

- Why is their GradCam maps so ugly? They don't look like usual GradCam maps at all.
- Their tests are simple enough that it's hard to defend a method that doesn't pass them.
- The methods that are left are not very good either. They give fuzzy maps that are difficult to interpret.
- In the case of integrated gradients (IG), I'm not convinced this is sufficient to discard the method. IG requires a "baseline input" that represents the absence of features. In the case of images, people usually just set the image to 0, which is not at all the absence of a feature. The authors also use the "set the image to 0" strategy, and I'd say their tests are damning for this strategy, not for IG in general. I'd expect an estimation of the baseline such as done in [this paper](https://arxiv.org/abs/1702.04595) would be a fairer evaluation of IG.

Code: [GitHub](https://github.com/adebayoj/sanity_checks_saliency) (not available as of 17/07/19)