ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Understanding deep learning requires rethinking generalization
Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Marek Rei 8 years ago

The authors investigate the generalisation properties of several well-known image recognition networks.

https://i.imgur.com/km0mrVs.png

They show that these networks are able to overfit to the training set with 100% accuracy even if the labels on the images are random, or if the pixels are randomly generated. Regularisation, such as weight decay and dropout, doesn’t stop overfitting as much as expected, still resulting in ~90% accuracy on random training data. They then argue that these models likely make use of massive memorization, in combination with learning low-complexity patterns, in order to perform well on these tasks.

arxiv.org
arxiv-vanity.com
scholar.google.com

Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality
Xingjun Ma and Bo Li and Yisen Wang and Sarah M. Erfani and Sudanthi Wijewickrema and Grant Schoenebeck and Dawn Song and Michael E. Houle and James Bailey
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.CR, cs.CV
more

[link] Summary by David Stutz 7 years ago

Ma et al. detect adversarial examples based on their estimated intrinsic dimensionality. I want to note that this work is also similar to [1] – in both publications, local intrinsic dimensionality is used to analyze adversarial examples. Specifically, the intrinsic dimensionality of a sample is estimated based on the radii $r_i(x)$ of the $k$-nearest neighbors around a sample $x$:

$- \left(\frac{1}{k} \sum_{i = 1}^k \log \frac{r_i(x)}{r_k(x)}\right)^{-1}$.

For details regarding the original, theoretical formulation of local intrinsic dimensionality I refer to the paper. In experiments, the authors show that adversarial examples exhibit a significant higher intrinsic dimensionality than training samples or randomly perturbed examples. This observation allows detection of adversarial examples. A proper interpretation of this finding is, however, missing. It would be interesting to investigate what this finding implies about the properties of adversarial examples.

arxiv.org
scholar.google.com

Benchmarking Deep Learning Hardware and Frameworks: Qualitative Metrics
Wei Dai and Daniel Berleant
arXiv e-Print archive - 2019 via Local arXiv
Keywords: cs.DC, cs.LG, cs.PF
more

[link] Summary by Wei Dai 6 years ago

Benchmarking Deep Learning Hardware and Frameworks: Qualitative Metrics 

Previous papers on benchmarking deep neural networks offer knowledge of deep learning hardware devices and software frameworks. This paper introduces benchmarking principles, surveys machine learning devices including GPUs, FPGAs, and ASICs, and reviews deep learning software frameworks. It also qualitatively compares these technologies with respect to benchmarking from the angles of our 7-metric approach to deep learning frameworks and 12-metric approach to machine learning hardware platforms. 

After reading the paper, the audience will understand seven benchmarking principles, generally know that differential characteristics of mainstream artificial intelligence devices,  qualitatively compare deep learning hardware through the 12-metric approach for benchmarking neural network hardware, and read benchmarking results of 16 deep learning frameworks via our 7-metric set for benchmarking  frameworks.

openreview.net
scholar.google.com

Towards Robust, Locally Linear Deep Networks
Lee, Guang-He and Alvarez-Melis, David and Jaakkola, Tommi S.
International Conference on Learning Representations - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 6 years ago

Lee et al. propose a regularizer to increase the size of linear regions of rectified deep networks around training and test points. Specifically, they assume piece-wise linear networks, in its most simplistic form consisting of linear layers (fully connected layers, convolutional layers) and ReLU activation functions. In these networks, linear regions are determined by activation patterns, i.e., a pattern indicating which neurons have value greater than zero. Then, the goal is to compute, and later to increase, the size $\epsilon$ such that the $L_p$-ball of radius $\epsilon$ around a sample $x$, denoted $B_{\epsilon,p}(x)$ is contained within one linear region (corresponding to one activation pattern). Formally, letting $S(x)$ denote the set of feasible inputs $x$ for a given activation pattern, the task is to determine

$\hat{\epsilon}_{x,p} = \max_{\epsilon \geq 0, B_{\epsilon,p}(x) \subset S(x)} \epsilon$.

For $p = 1, 2, \infty$, the authors show how $\hat{\epsilon}_{x,p}$ can be computed efficiently. For $p = 2$, for example, it results in

$\hat{\epsilon}_{x,p} = \min_{(i,j) \in I} \frac{|z_j^i|}{\|\nabla_x z_j^i\|_2}$.

Here, $z_j^i$ corresponds to the $j$th neuron in the $i$th layer of a multi-layer perceptron with ReLU activations; and $I$ contains all the indices of hidden neurons. This analytical form can then used to add a regularizer to encourage the network to learn larger linear regions:

$\min_\theta \sum_{(x,y) \in D} \left[\mathcal{L}(f_\theta(x), y) - \lambda \min_{(i,j) \in I} \frac{|z_j^i|}{\|\nabla_x z_j^i\|_2}\right]$

where $f_\theta$ is the neural network with paramters $\theta$. In the remainder of the paper, the authors propose a relaxed version of this training procedure that resembles a max-margin formulation and discuss efficient computation of the involved derivatives $\nabla_x z_j^i$ without too many additional forward/backward passes.

https://i.imgur.com/jSc9zbw.jpg
Figure 1: Visualization of locally linear regions for three different models on toy 2D data.

On toy data and datasets such as MNIST and CalTech-256, it is shown that the training procedure is effective in the sense that larger linear regions around training and test points are learned. For example, on a 2D toy dataset, Figure 1 visualizes the linear regions for the optimal regularizer as well as the proposed relaxed version.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
scholar.google.com

Generalization in Deep Networks: The Role of Distance from Initialization
Nagarajan, Vaishnavh and Kolter, J. Zico
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 6 years ago

Nagarajan and Kolter show that neural networks are implicitly regularized by stochastic gradient descent to have small distance from their initialization. This implicit regularization may explain the good generalization performance of over-parameterized neural networks; specifically, more complex models usually generalize better, which contradicts the general trade-off between expressivity and generalization in machine learning. On MNIST, the authors show that the distance of the network’s parameters to the original initialization (as measured using the $L_2$ norm on the flattened parameters) reduces with increasing width, and increases with increasing sample size. Additionally, the distance increases significantly when fitting corrupted labels, which may indicate that memorization requires to travel a larger distance in parameter space.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).