ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Progress & Compress: A scalable framework for continual learning
Jonathan Schwarz and Jelena Luketina and Wojciech M. Czarnecki and Agnieszka Grabska-Barwinska and Yee Whye Teh and Razvan Pascanu and Raia Hadsell
arXiv e-Print archive - 2018 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by devin132 5 years ago

Proposes a two-stage approach for continual learning. An active learning phase and a consolidation phase. The active learning stage optimizes for a specific task that is then consolidated into the knowledge base network via Elastic Weight Consolidation (Kirkpatrick et al., 2016). The active learning phases uses a separate network than the knowledge base, but is not always trained from scratch - authors suggest a heuristic based on task-similarity. Improves EWC by deriving a new online method so parameters don’t increase linearly with the number of tasks.

Desiderata for a continual learning solution:

- A continual learning method should not suffer from catastrophic forgetting. That is, it should be able to perform reasonably well on previously learned tasks.

- It should be able to learn new tasks while taking advantage of knowledge extracted from previous tasks, thus exhibiting positive forward transfer to achieve faster learning and/or better final performance.

- It should be scalable, that is, the method should be trainable on a large number of tasks.

- It should enable positive backward transfer as well, which means gaining improved performance on previous tasks after learning a new task which is similar or relevant.

- Finally, it should be able to learn without requiring task labels, and ideally, it should even be applicable in the absence of clear task boundaries.

Experiments:

- Sequential learning of handwritten characters of 50 alphabets taken from the Omniglot dataset.
- Sequential learning of 6 games in the Atari suite (Bellemare et al., 2012) (“Space Invaders”, “Krull”, “Beamrider”, “Hero”, “Stargunner” and “Ms. Pac-man”).
- 8 navigation tasks in 3D environments inspired by experiments with Distral (Teh et al., 2017).

arxiv.org
scholar.google.com

Towards Neural Network-based Reasoning
Peng, Baolin and Lu, Zhengdong and Li, Hang and Wong, Kam-Fai
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a neural network architecture that can take as input a question and a sequence of facts expressed in natural language (i.e. a sequence of words) and produce its output the answer to that question. The main components of the architecture are as follows:

* The question (q) and the facts (f_1, ... , f_K) are each individually transformed into a fixed size vector using the same GRU RNN (with the last hidden layer serving as the vector representation).
* These vectors are each passed through "reasoning layers", where each layer transforms the question q and the facts f_k into a new vector representation. This is done by feeding each question fact pair (q,f_k) to a neural network that outputs a new representation for the fact f_k (which replaces its old representation in the layer), as well as a new representation for the question. All K new question representations are then pooled to obtain a single question representation that replace the old one in the layer.
* The last reasoning layer is either fed to a softmax layer for binary questions, or to a scoring layer for questions with multiple and varying candidate answers.

This so-called Neural Reasoner can be trained by backpropagation, in an end-to-end, supervised way. The authors also suggest the use of auxiliary tasks, to improve results. The first ("original") adds an autoencoder reconstuction cost, that reproduces the question and facts from its first layer encoding. The second ("abstract") instead reconstructs a more abstract version of the sentences (e.g. "The triangle is above the pink rectangle." becomes "x is above y").

Importantly, while the Neural Reasoner framework is presented in this paper as covering many different variants, the version that is experimentally tested is one where the fact representations f_k are actually left unchanged throughout the reasoning layers, with only the question representation being changed.

The paper presents experiments on two synthetic reasoning tasks and report performances that compare favorably with previously published alternatives (based on the general Memory Network architecture). The experiments also show that the auxiliary tasks can substantially improve the performance of the model

#### My two cents

The proposed Neural Reasoner framework is actually very close to work published on arXiv at about the same time on End-to-End Memory Networks \cite{conf/nips/SukhbaatarSWF15}. In fact, the version tested in the paper, with unchanged fact representations throughout layers, is extremely close to End-to-End Memory Networks.

That said, there are also lots of differences. For instance, this paper proposes the use of multilayer networks within each Reasoning Layer, to produce updated question representations. In fact, experiments suggest that using several layers can be very beneficial for the path finding task. The sentence representation at the first layer is also different, being based on a non-linear RNN instead of being based on linear operations on embeddings as in Memory Networks.

The most interesting aspect of this paper to me is probably the demonstration that the use of an auxiliary task such as "original", which is unsupervised, can substantially improve the performance, again for the path finding task. That is, to me, probably the most exciting direction of future research that this paper highlights as promising.

I also liked how the model is presented. It didn't take me much time to understand the model, and I actually found it easier to absorb than the Memory Network model, despite both being very similar. I think this model is indeed a bit simpler than Memory Networks, which is a good thing. It also suggests a different approach to the problem, one where the facts representations are also updated during forward propagation, not just the question's representation (which is the version initially described in the paper... I hope experiments on that variant are eventually presented).

It's unfortunate that the authors only performed experiments on 2 of the 20 synthetic question-answering tasks. I hope a future version of this work can report results on the full benchmark and directly compare with End-to-End Memory Networks.

I was also unable to find out which of the question representation pooling mechanism (section 3.2.2) was used in the experiments. Perhaps the authors forgot to state it?

Overall, a pretty interesting paper that open different doors towards reasoning with neural networks.

arxiv.org
arxiv-vanity.com
scholar.google.com

Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
Antonio Valerio Miceli Barone
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG, cs.NE
more

[link] Summary by Jon Gauthier 8 years ago

This is a simple unsupervised method for learning word-level translation
between embeddings of two different languages.

That's right -- unsupervised.

The basic motivating hypothesis is that there should be an isomorphism between
the "semantic spaces" of different languages:

> we hypothesize that, if languages are used to convey thematically similar information in similar contexts, these random processes should be approximately isomorphic between languages, and that this isomorphism can be learned from the statistics of the realizations of these processes, the monolingual corpora, in principle without any form of explicit alignment.

If you squint a bit, you can make the more aggressive claim from this premise
that there should be a nonlinear / MLP mapping between *word embedding spaces*
that gets us the same result.

The author uses the adversarial autoencoder (AAE, from Makhzani last year)
framework in order to enforce a cross-lingual semantic mapping in word
embedding spaces. The basic setup for adversarial training between a source and
a target language:

1. Sample a batch of words from the source language according to the language's
word frequency distribution.
2. Sample a batch of words from the target language according to its word
frequency distribution. (No sort of relationship is enforced between the two
samples here.)
3. Feed the word embeddings corresponding to the source words through an
*encoder* MLP. This corresponds to the standard "generator" in a GAN setup.
4. Pass the generator output to a *discriminator* MLP along with the
target-language word embeddings.
5. Also pass the generator output to a *decoder* which maps back to the source
embedding distribution.
6. Update weights based on a combination of GAN loss + reconstruction loss.

### Does it work?

We don't really know. The paper is unfortunately short on evaluation --- we
just see a few examples of success and failure on a trained model. One easy
evaluation would be to compare accuracy in lexical mapping vs. corpus frequency
of the source word. I would bet that this would reveal the model hasn't done
much more than learn to align a small set of high-frequency words.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

You Only Look Once: Unified, Real-Time Object Detection
Redmon, Joseph and Divvala, Santosh Kumar and Girshick, Ross B. and Farhadi, Ali
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper models object detection as a regression problem for bounding
boxes and object class probabilities with a single pass through the CNN. The
main contribution is the idea of dividing the image into a 7x7 grid, and having
each cell predict a distribution over class labels as well as a bounding box
for the object whose center falls into it. It's much faster than R-CNN and
Fast R-CNN, as the additional step of extracting region proposals has been
removed.

## Strengths

- Works real-time. Base model runs at 45fps and a faster version goes up to
150fps, and they claim that it's more than twice as fast as other works on
real-time detection.

- End-to-end model; Localization and classification errors can be jointly
optimized.

- YOLO makes more localization errors and fewer background mistakes than
Fast R-CNN, so using YOLO to eliminate false background detections from
Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN
is much slower).

## Weaknesses / Notes

- Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN).

- Performs worse at detecting small objects, as at most one object per grid
cell can be detected.

arxiv.org
scholar.google.com

Out-distribution training confers robustness to deep neural networks
Abbasi, Mahdieh and Gagné, Christian
arXiv e-Print archive - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 6 years ago

Abbasi and Gagné propose explicit but natural out-distribution training as defense against adversarial examples. Specifically, as also illustrated on the toy dataset in Figure 1, they argue that networks commonly produce high-confident predictions in regions that are clearly outside of the data manifold (i.e., the training data distribution). As mitigation strategy, the authors propose to explicitly train on out-of-distribution data, allowing the network to additionally classify this data as “dustbin” data. On MNIST, for example, this data comes from NotMNIST, a dataset of letters A-J – on CIFA-10, this data could be CIFAR-100. Experiments show that this out-of-distribution training allow networks to identify adversarial examples as “dustbin” and thus improve robustness.

https://i.imgur.com/nUSDZay.png
Figure 1: Illustration of a naive model versus an augmented model, i.e., trained on out-of-distribution data, on a toy dataset (left) and on MNIST (right).

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).