ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Howard, Andrew G. and Zhu, Menglong and Chen, Bo and Kalenichenko, Dmitry and Wang, Weijun and Weyand, Tobias and Andreetto, Marco and Adam, Hartwig
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Alexander Jung 8 years ago

* They suggest a factorization of standard 3x3 convolutions that is more efficient.
* They build a model based on that factorization. The model has hyperparameters to choose higher performance or higher accuracy.

### How
* Factorization
* They factorize the standard 3x3 convolution into one depthwise 3x3 convolution, followed by a pointwise convoluton.
* Normal 3x3 convolution:
* Computes per filter and location a weighted average over all filters.
* For kernel height `kH`, width `kW` and number of input filters/planes `Fin`, it requires `kH*kW*Fin` computations per location.
* Depthwise 3x3 convolution:
* Computes per filter and location a weighted average over *one* input filter. E.g. the 13th filter would only computed weighted averages over the 13th input filter/plane and ignore all the other input filters/planes.
* This requires `kH*kW*1` computations per location, i.e. drastically less than a normal convolution.
* Pointwise convolution:
* This is just another name for a normal 1x1 convolution.
* This is placed after a depthwise convolution in order to compensate the fact that every (depthwise) filter only sees a single input plane.
* As the kernel size is `1`, this is rather fast to compute.
* Visualization of normal vs factorized convolution:
* ![architecture](https://github.com/aleju/papers/blob/master/neural-nets/images/MobileNets/architecture.jpg?raw=true "architecture")
* Models
* They use two hyperparameters for their models.
* `alpha`: Multiplier for the width in the range `(0, 1]`. A value of 0.5 means that every layer has half as many filters.
* `roh`: Multiplier for the resolution. In practice this is simply the input image size, having a value of `{224, 192, 160, 128}`.

### Results
* ImageNet
* Compared to VGG16, they achieve 1 percentage point less accuracy, while using only about 4% of VGG's multiply and additions (mult-adds) and while using only about 3% of the parameters.
* Compared to GoogleNet, they achieve about 1 percentage point more accuracy, while using only about 36% of the mult-adds and 61% of the parameters.
* Note that they don't compare to ResNet.
* Results for architecture choices vs. accuracy on ImageNet:
* ![results imagenet](https://github.com/aleju/papers/blob/master/neural-nets/images/MobileNets/results_imagenet.jpg?raw=true "results imagenet")
* Relation between mult-adds and accuracy on ImageNet:
* ![mult-adds vs accuracy](https://github.com/aleju/papers/blob/master/neural-nets/images/MobileNets/mult-adds_vs_accuracy.jpg?raw=true "mult-adds vs accuracy")
* Object Detection
* Their mAP is a bit on COCO when combining MobileNet with SSD (as opposed to using VGG or Inception v2).
* Their mAP is quite a bit worse on COCO when combining MobileNet with Faster R-CNN.
* Reducing the number of filters (`alpha`) influences the results more than reducing the input image resolution (`roh`).
* Making the models shallower influences the results more than making them thinner.

arxiv.org
scholar.google.com

Building Machines That Learn and Think Like People
Lake, Brenden M. and Ullman, Tomer D. and Tenenbaum, Joshua B. and Gershman, Samuel J.
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 10 years ago

TLDR; The author explore the gap between Deep Learning methods and human learning. The argue that natural intelligence is still the best example of intelligence, so it's worth exploring. To demonstrate their points they explore two challenges: 1. Recognizing new characters and objects 2. Learning to play the game Frostbite. The authors make several arguments:

-  Humans have an intuitive understanding of physics and psychology (understanding goals and agents) very early on. These two types of "software" help them to learn new tasks quickly.
-  Humans build causal models of the world instead of just performing pattern recognition. These models allow humans to learn from far fewer examples than current Deep Learning methods. For example, AlphaGo played a billion games or so, Lee Sedol perhaps 50,000. Incorporating compositionality, learning-to-learn (transfer learning) and causality helps humans to build these models.
- Humans use both model-free and model-based learning algorithms.

arxiv.org
arxiv-vanity.com
scholar.google.com

Robustness of classifiers: from adversarial to random noise
Alhussein Fawzi and Seyed-Mohsen Moosavi-Dezfooli and Pascal Frossard
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.CV, stat.ML
more

[link] Summary by David Stutz 7 years ago

Fawzi et al. study robustness in the transition from random samples to semi-random and adversarial  samples. Specifically they present bounds relating the norm of an adversarial perturbation to the norm of random perturbations – for the exact form I refer to the paper. Personally, I find the definition of semi-random noise most interesting, as it allows to get an intuition for distinguishing random noise from adversarial examples. As in related literature, adversarial examples are defined as

$r_S(x_0) = \arg\min_{x_0 \in S} \|r\|_2$ s.t. $f(x_0 + r) \neq f(x_0)$

where $f$ is the classifier to attack and $S$ the set of allowed perturbations (e.g. requiring that the perturbed samples are still images). If $S$ is mostly unconstrained regarding the direction of $r$ in high dimensional space, Fawzi et al. consider $r$ to be an adversarial examples – intuitively, and adversary can choose $r$ arbitrarily to fool the classifier. If, however, the directions considered in $S$ are constrained to an $m$-dimensional subspace, Fawzi et al. consider $r$ to be semi-random noise. In the extreme case, if $m = 1$, $r$ is random noise. In this case, we can intuitively think of $S$ as a randomly chosen one dimensional subspace – i.e. a random direction in multi-dimensional space.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Adversarial Robustness: Softmax versus Openmax
Andras Rozsa and Manuel Günther and Terrance E. Boult
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

[link] Summary by David Stutz 7 years ago

Rozsa et al. describe an adersarial attack against OpenMax [1] by directly targeting the logits. Specifically, they assume a network using OpenMax instead of a SoftMax layer to compute the final class probabilities. OpenMax allows “open-set” networks by also allowing to reject input samples. By directly targeting the logits of the trained network, i.e. iteratively pushing the logits in a target direction, it does not matter whether SoftMax or OpenMax layers are used on top, the network can be fooled in both cases.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

On the Suitability of Lp-Norms for Creating and Preventing Adversarial Examples
Sharif, Mahmood and Bauer, Lujo and Reiter, Michael K.
Conference and Computer Vision and Pattern Recognition - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 7 years ago

Sharif et al. study the effectiveness of $L_p$ norms for creating adversarial perturbations. In this context, their main discussion revolves around whether $L_p$ norms are sufficient and/or necessary for perceptual similarity. Their main conclusion is that $L_p$ norms are neither necessary nor sufficient to ensure perceptual similarity. For example, an adversarial example might be within a specific $L_p$ bal, but humans might still identify it as not similar enough to the originally attacked sample; on the other hand, there are also some imperceptible perturbations that usually extend beyond a reasonable $L_p$ ball. Such transformatons might for example include small rotations or translation. These findings are interesting because it indicates that our current model, or approximation, or perceptual similarity is not meaningful in all cases.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).