ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

The Pitfalls of Simplicity Bias in Neural Networks
Shah, Harshay and Tamuly, Kaustav and Raghunathan, Aditi and Jain, Prateek and Netrapalli, Praneeth
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

This is an interesting paper that makes a fairly radical claim, and I haven't fully decided whether what they find is an interesting-but-rare corner case, or a more fundamental weakness in the design of neural nets. The claim is: neural nets prefer learning simple features, even if there exist complex features that are equally or more predictive, and even if that means learning a classifier with a smaller margin - where margin means "the distance between the decision boundary and the nearest-by data". A large-margin classifier is preferable in machine learning because the larger the margin, the larger the perturbation that would have to be made - by an adversary, or just by the random nature of the test set - to trigger misclassification. 

https://i.imgur.com/PJ6QB6h.png

This paper defines simplicity and complexity in a few ways. In their simulated datasets,  a feature is simpler when the decision boundary along that axis requires fewer piecewise linear segments to separate datapoints. (In the example above, note that having multiple alternating blocks still allows for linear separation, but with a higher piecewise linear requirement). In their datasets that concatenate MNIST and CIFAR images, the MNIST component represents the simple feature. 

The authors then test which models use which features by training a model with access to all of the features - simple and complex - and then testing examples where one set of features is sampled in alignment with the label, and one set of features is sampled randomly. If the features being sampled randomly are being used by the model, perturbing them like this should decrease the test performance of the model. For the simulated datasets, a fully connected network was used; for the MNIST/CIFAR concatenation, a variety of different image classification convolutional architectures were tried. 

The paper finds that neural networks will prefer to use the simpler feature to the complete exclusion of more complex features, even if the complex feature is slightly more predictive (can achieve 100 vs 95% separation). The authors go on to argue that what they call this Extreme Simplicity Bias, or Extreme SB, might actually explain some of the observed pathologies in neural nets, like relying on spurious features or being subject to adversarial perturbations. They claim that spurious features - like background color or texture - will tend to be simpler, and that their theory explains networks' reliance on them. Additionally, relying completely or predominantly on single features means that a perturbation along just that feature can substantially hurt performance, as opposed to a network using multiple features, all of which must be perturbed to hurt performance an equivalent amount. 

As I mentioned earlier, I feel like I'd need more evidence before I was strongly convinced by the claims made in this paper, but they are interestingly provocative. On a broader level, I think a lot of the difficulties in articulating why we expect simpler features to perform well come from an imprecision in thinking in language around the idea - we think of complex features as inherently brittle and high-dimensional, but this paper makes me wonder how well our existing definitions of simplicity actually match those intuitions.

papers.nips.cc
scholar.google.com

Learning to Compose Domain-Specific Transformations for Data Augmentation.
Alexander J. Ratner and Henry R. Ehrenberg and Zeshan Hussain and Jared Dunnmon and Christopher R�
Neural Information Processing Systems Conference - 2017 via Local dblp
Keywords:

[link] Summary by David Stutz 7 years ago

Ratner et al. Train an adversarial generative network to learn domain-specific sequences of transformations useful for data augmentation. In particular, as indicated in Figure 1, the generator learns to predict sequences of user-specified transformations and the classifier is intended to distinguish the original images from the transformed ones. For training, the authors use reinforcement learning, because the transformations are not necessarily differentiable – which makes usage of the proposed method very convenient.

https://i.imgur.com/hHQkhIk.png
Figure 1: High-level illustration of the proposed method for learning data augmentation.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

papers.nips.cc
scholar.google.com

End-To-End Memory Networks
Sukhbaatar, Sainbayar and Szlam, Arthur and Weston, Jason and Fergus, Rob
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 10 years ago

TLDR; The authors propose a recurrent memory-based model that can reason over multiple hops and be trained end to end with standard gradient descent. The authors evaluate the model on QA and Language Modeling Tasks. In the case of QA, the network inputs are a list of sentences, a query and (during training) an answer. The network then attends to the sentences at each time step, considering the next piece information relevant to the question. The network outperforms baseline approaches, but does not come close to a strongly supervised (relevant sentences are pre-selected) approach.


#### Key Takeaways

- Sentence Representation: 1. Word embeddings are averaged (BoW) 2. Positional Encoding (PE)
- Synthetic dataset with vocabulary size of ~180. Version one has 1k training example, version 2 has 10k training examples.
- The model is similar to Bahdanau seq2seq attention model, only that it operates on sentences and does not output at every step and used a simpler scoring function.


#### Questions / Notes

- The positional encoding formula is not explained neither is it intutiive.
- There are so many hyperparameters and model variations (jittering, linear start) that it's easy to lose track of the essential.
- No intuitive explanation of what the model does. The easiest way for me to understand this model was to look at it as a variation of Bahdanau's attention model, which is very intuitive. I don't understand the intuition behind the proposed weight constraints.
- The LM results are not convincing. The model beats the baselines by a little bit, but probably only due to very time-intensive hyperparameter optimization.
- What is the training complexity and training time?

arxiv.org
scholar.google.com

Discovering Causal Signals in Images
Lopez-Paz, David and Nishihara, Robert and Chintala, Soumith and Schölkopf, Bernhard and Bottou, Léon
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper tests the following hypothesis, about features learned by a deep network trained on the ImageNet dataset:

*Object features and anticausal features are closely related. Context features and causal features are not necessarily related.*

First, some definitions. Let $X$ be a visual feature (i.e. value of a hidden unit) and $Y$ be information about a label (e.g. the log-odds of probability of different object appearing in the image). A causal feature would be one for which the causal direction is $X \rightarrow Y$. An anticausal feature would be the opposite case, $X \leftarrow Y$.

As for object features, in this paper they are features whose value tends to change a lot when computed on a complete original image versus when computed on an image whose regions *falling inside* object bounding boxes have been blacked out (see Figure 4). Contextual features are the opposite, i.e. values change a lot when blacking out the regions *outside* object bounding boxes. See section 4.2.1 for how "object scores" and "context scores" are computed following this description, to quantitatively measure to what extent a feature is an "object feature" or a "context feature".

Thus, the paper investigates whether 1) for object features, their relationship with object appearance information is anticausal (i.e. whether the object feature's value seems to be caused by the presence of the object) and whether 2) context features are not clearly causal or anticausal.

To perform this investigation, the paper first proposes a generic neural network model (dubbed the Neural Causation Coefficient architecture or NCC) to predict a score of whether the relationship between an input variable $X$ and target variable $Y$ is causal. This model is trained by taking as input datasets of $X$ and $Y$ pairs synthetically generated in such a way that we know whether $X$ caused $Y$ or the opposite. The NCC architecture first embeds each individual $X$,$Y$ instance pair into some hidden representation, performs mean pooling of these representations and then feeds the result to fully connected layers (see Figure 3). The paper shows that the proposed NCC model actually achieves SOTA performance on the Tübingen dataset, a collection of real-world cause-effect observational samples.

Then, the proposed NCC model is used to measure the average object score of features of a deep residual CNN identified as being most causal and most anticausal by NCC. The same is done with the context score. What is found is that indeed, the object score is always higher for the top anticausal features than for the top causal features. However, for the context score, no such clear trend is observed (see Figure 5).

**My two cents**

I haven't been following the growing literature on machine learning for causal inference, so it was a real pleasure to read this paper and catch up a little bit on that. Just for that I would recommend the reading of this paper. The paper does a really good job at explaining the notion of *observational causal inference*, which in short builds on the observation that if we assume IID noise on top of a causal (or anticausal) phenomenon, then causation can possibly be inferred by verifying in which direction of causation the IID assumption on the noise seems to hold best (see Figure 2 for a nice illustration, where in (a) the noise is clearly IID, but isn't in (b)).

Also, irrespective of the study of causal phenomenon in images, the NCC architecture, which achieves SOTA causal prediction performance, is in itself a nice contribution.

Regarding the application to image features, one thing that is hard to wrap your head around is that, for the $Y$ variable, instead of using the true image label, the log-odds at the output layer are used instead in the study. The paper justifies this choice by highlighting that the NCC network was trained on examples where $Y$ is continuous, not discrete. On one hand, that justification makes sense. On the other, this is odd since the log-odds were in fact computed directly from the visual features, meaning that technically the value of the log-odds are directly caused by all the features (which goes against the hypothesis being tested). My best guess is that this isn't an issue only because NCC makes a causal prediction between *a single feature* and $Y$, not *from all features* to $Y$. I'd be curious to read the authors' perspective on this.

Still, this paper at this point is certainly just scratching the surface on this topic. For instance, the paper mentions that NCC could be used to encourage the learning of causal or anticausal features, providing a new and intriguing type of regularization. This sounds like a very interesting future direction for research, which I'm looking forward to.

4 Comments

arxiv.org
scholar.google.com

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Simonyan, Karen and Vedaldi, Andrea and Zisserman, Andrew
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 9 years ago

#### Introduction

* The paper presents gradient computation based techniques to visualise image classification models.
* [Link to the paper](https://arxiv.org/abs/1312.6034)

#### Experimental Setup

* Single deep convNet trained on ILSVRC-2013 dataset (1.2M training images and 1000 classes).
* Weight layer configuration is: conv64-conv256-conv256-conv256-conv256-full4096-full4096-full1000.

#### Class Model Visualisation

* Given a learnt ConvNet and a class (of interest), start with the zero image and perform optimisation by back propagating with respect to the input image (keeping the ConvNet weights constant).
* Add the mean image (for training set) to the resulting image.
* The paper used unnormalised class scores so that optimisation focuses on increasing the score of target class and not decreasing the score of other classes.

#### Image-Specific Class Saliency Visualisation

* Given an image, class of interest, and trained ConvNet, rank the pixels of the input image based on their influence on class scores.
* Derivative of the class score with respect to image gives an estimate of the importance of different pixels for the class.
* The magnitude of derivative also indicated how much each pixel needs to be changed to improve the class score.

##### Class Saliency Extraction

* Find the derivative of the class score with respect with respect to the input image.
* This would result in one single saliency map per colour channel.
* To obtain a single saliency map, take the maximum magnitude of derivative across all colour channels.

##### Weakly Supervised Object Localisation

* The saliency map for an image provides a rough encoding of the location of the object of the class of interest. 
* Given an image and its saliency map, an object segmentation map can be computed using GraphCut colour segmentation.
* Color continuity cues are needed as saliency maps might capture only the most dominant part of the object in the image.
* This weakly supervised approach achieves 46.4% top-5 error on the test set of ILSVRC-2013.

#### Relation to Deconvolutional Networks

* DeconvNet-based reconstruction of the $n^{th}$ layer input is similar to computing the gradient of the visualised neuron activity $f$ with respect to the input layer.
* One difference is in the way RELU neurons are treated: 
    * In DeconvNet, the sign indicator (for the derivative of RELU) is computed on output reconstruction while in this paper, the sign indicator is computed on the layer input.