Loading [MathJax]/extensions/Safe.js
Welcome to ShortScience.org!
RSS Feed Twitter Facebook

  • ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
  • The website has 1583 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
  • Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
  • Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
  • Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.
hide


[link]
Summary by David Stutz 6 years ago

Abbasi and Gagné propose explicit but natural out-distribution training as defense against adversarial examples. Specifically, as also illustrated on the toy dataset in Figure 1, they argue that networks commonly produce high-confident predictions in regions that are clearly outside of the data manifold (i.e., the training data distribution). As mitigation strategy, the authors propose to explicitly train on out-of-distribution data, allowing the network to additionally classify this data as “dustbin” data. On MNIST, for example, this data comes from NotMNIST, a dataset of letters A-J – on CIFA-10, this data could be CIFAR-100. Experiments show that this out-of-distribution training allow networks to identify adversarial examples as “dustbin” and thus improve robustness.

Figure 1: Illustration of a naive model versus an augmented model, i.e., trained on out-of-distribution data, on a toy dataset (left) and on MNIST (right).

Also find this summary at davidstutz.de.

[link]
Summary by Hugo Larochelle 9 years ago

This paper presents a neural network architecture that can take as input a question and a sequence of facts expressed in natural language (i.e. a sequence of words) and produce its output the answer to that question. The main components of the architecture are as follows:

  • The question (q) and the facts (f_1, ... , f_K) are each individually transformed into a fixed size vector using the same GRU RNN (with the last hidden layer serving as the vector representation).
  • These vectors are each passed through "reasoning layers", where each layer transforms the question q and the facts f_k into a new vector representation. This is done by feeding each question fact pair (q,f_k) to a neural network that outputs a new representation for the fact f_k (which replaces its old representation in the layer), as well as a new representation for the question. All K new question representations are then pooled to obtain a single question representation that replace the old one in the layer.
  • The last reasoning layer is either fed to a softmax layer for binary questions, or to a scoring layer for questions with multiple and varying candidate answers.

This so-called Neural Reasoner can be trained by backpropagation, in an end-to-end, supervised way. The authors also suggest the use of auxiliary tasks, to improve results. The first ("original") adds an autoencoder reconstuction cost, that reproduces the question and facts from its first layer encoding. The second ("abstract") instead reconstructs a more abstract version of the sentences (e.g. "The triangle is above the pink rectangle." becomes "x is above y").

Importantly, while the Neural Reasoner framework is presented in this paper as covering many different variants, the version that is experimentally tested is one where the fact representations f_k are actually left unchanged throughout the reasoning layers, with only the question representation being changed.

The paper presents experiments on two synthetic reasoning tasks and report performances that compare favorably with previously published alternatives (based on the general Memory Network architecture). The experiments also show that the auxiliary tasks can substantially improve the performance of the model

My two cents

The proposed Neural Reasoner framework is actually very close to work published on arXiv at about the same time on End-to-End Memory Networks [conf/nips/SukhbaatarSWF15]. In fact, the version tested in the paper, with unchanged fact representations throughout layers, is extremely close to End-to-End Memory Networks.

That said, there are also lots of differences. For instance, this paper proposes the use of multilayer networks within each Reasoning Layer, to produce updated question representations. In fact, experiments suggest that using several layers can be very beneficial for the path finding task. The sentence representation at the first layer is also different, being based on a non-linear RNN instead of being based on linear operations on embeddings as in Memory Networks.

The most interesting aspect of this paper to me is probably the demonstration that the use of an auxiliary task such as "original", which is unsupervised, can substantially improve the performance, again for the path finding task. That is, to me, probably the most exciting direction of future research that this paper highlights as promising.

I also liked how the model is presented. It didn't take me much time to understand the model, and I actually found it easier to absorb than the Memory Network model, despite both being very similar. I think this model is indeed a bit simpler than Memory Networks, which is a good thing. It also suggests a different approach to the problem, one where the facts representations are also updated during forward propagation, not just the question's representation (which is the version initially described in the paper... I hope experiments on that variant are eventually presented).

It's unfortunate that the authors only performed experiments on 2 of the 20 synthetic question-answering tasks. I hope a future version of this work can report results on the full benchmark and directly compare with End-to-End Memory Networks.

I was also unable to find out which of the question representation pooling mechanism (section 3.2.2) was used in the experiments. Perhaps the authors forgot to state it?

Overall, a pretty interesting paper that open different doors towards reasoning with neural networks.

more
arxiv.org
arxiv-vanity.com
scholar.google.com
Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
Antonio Valerio Miceli Barone
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG, cs.NE

more
[link]
Summary by Jon Gauthier 8 years ago

This is a simple unsupervised method for learning word-level translation between embeddings of two different languages.

That's right -- unsupervised.

The basic motivating hypothesis is that there should be an isomorphism between the "semantic spaces" of different languages:

we hypothesize that, if languages are used to convey thematically similar information in similar contexts, these random processes should be approximately isomorphic between languages, and that this isomorphism can be learned from the statistics of the realizations of these processes, the monolingual corpora, in principle without any form of explicit alignment.

If you squint a bit, you can make the more aggressive claim from this premise that there should be a nonlinear / MLP mapping between word embedding spaces that gets us the same result.

The author uses the adversarial autoencoder (AAE, from Makhzani last year) framework in order to enforce a cross-lingual semantic mapping in word embedding spaces. The basic setup for adversarial training between a source and a target language:

  1. Sample a batch of words from the source language according to the language's word frequency distribution.
  2. Sample a batch of words from the target language according to its word frequency distribution. (No sort of relationship is enforced between the two samples here.)
  3. Feed the word embeddings corresponding to the source words through an encoder MLP. This corresponds to the standard "generator" in a GAN setup.
  4. Pass the generator output to a discriminator MLP along with the target-language word embeddings.
  5. Also pass the generator output to a decoder which maps back to the source embedding distribution.
  6. Update weights based on a combination of GAN loss + reconstruction loss.
Does it work?

We don't really know. The paper is unfortunately short on evaluation --- we just see a few examples of success and failure on a trained model. One easy evaluation would be to compare accuracy in lexical mapping vs. corpus frequency of the source word. I would bet that this would reveal the model hasn't done much more than learn to align a small set of high-frequency words.

more
[link]
Summary by Tiago Vinhoza 7 years ago
Motivation:
  • Take advantage of the fact that missing values can be very informative about the label.
  • Sampling a time series generates many missing values.

Sampling

Model (indicator flag):
  • Indicator of occurrence of missing value.

Indicator

  • An RNN can learn about missing values and their importance only by using the indicator function. The nonlinearity from this type of model helps capturing these dependencies.
  • If one wants to use a linear model, feature engineering is needed to overcome its limitations.
    • indicator for whether a variable was measured at all
    • mean and standard deviation of the indicator
    • frequency with which a variable switches from measured to missing and vice-versa.
Architecture:
  • RNN with target replication following the work "Learning to Diagnose with LSTM Recurrent Neural Networks" by the same authors.

Architecture

Dataset:
  • Children's Hospital LA
  • Episode is a multivariate time series that describes the stay of one patient in the intensive care unit
Dataset properties Value
Number of episodes 10,401
Duration of episodes From 12h to several months
Time series variables Systolic blood pressure, Diastolic blood pressure, Peripheral capillary refill rate, End tidal CO2, Fraction of inspired O2, Glasgow coma scale, Blood glucose, Heart rate, pH, Respiratory rate, Blood O2 Saturation, Body temperature, Urine output.
Experiments and Results:

Goal

  • Predict 128 diagnoses.
  • Multilabel: patients can have more than one diagnose.

Methodology

  • Split: 80% training, 10% validation, 10% test.
  • Normalized data to be in the range [0,1].

  • LSTM RNN:

    • 2 hidden layers with 128 cells. Dropout = 0.5, L2-regularization: 1e-6
    • Training for 100 epochs. Parameters chosen correspond to the time that generated the smallest error in the validation dataset.
  • Baselines:
    • Logistic Regression (L2 regularization)
    • MLP with 3 hidden layers and 500 hidden neurons / layer (parameters chosen via validation set)
    • Tested with raw-features and hand-engineered features.
  • Strategies for missing values:
    • Zeroing
    • Impute via forward / backfilling
    • Impute with zeros and use indicator function
    • Impute via forward / backfilling and use indicator function
    • Use indicator function only
Results
  • Metrics:
    • Micro AUC, Micro F1: calculated by adding the TPs, FPs, TNs and FNs for the entire dataset and for all classes.
    • Macro AUC, Macro F1: Arithmetic mean of AUCs and F1 scores for each of the classes.
    • Precision at 10: Fraction of correct diagnostics among the top 10 predictions of the model.
      • The upper bound for precision at 10 is 0.2281 since in the test set there are on average 2.281 diagnoses per patient.

Results

Discussion:
  • Predictive model based on data collected following a given routine. This routine can change if the model is put into practice. Will the model predictions in this new routine remain valid?

  • Missing values in a way give an indication of the type of treatment being followed.

  • Trade-off between complex models operating on raw features and very complex features operating on more interpretable models.

more
[link]
Summary by Hugo Larochelle 9 years ago

This paper suggests a method (NoBackTrack) for training recurrent neural networks in an online way, i.e. without having to do backprop through time. One way of understanding the method is that it applies the forward method for automatic differentiation, but since it requires maintaining a large Jacobian matrix (nb. of hidden units times nb. of parameters), they propose a way of obtaining a stochastic (but unbiased!) estimate of that matrix. Moreover, the method is improved by using Kalman filtering on that estimate, effectively smoothing the estimate over time.

My two cents

Online training of RNNs is a big, unsolved problem. The current approach people use is to truncate backprop to only a few steps in the past, which is more of a heuristic.

This paper makes progress towards a more principled approach. I really like the "rank-one trick" of Equation 7, really cute! And it is quite central to this method too, so good job on connecting those dots!

The authors present this work as being preliminary, and indeed they do not compare with truncated backprop. I really hope they do in a future version of this work.

Also, I don't think I buy their argument that the "theory of stochastic gradient descent applies". Here's the reason. So the method tracks the Jacobian of the hidden state wrt the parameter, which they note $G(t)$. It is update into $G(t+1)$, using a recursion which is based on the chain rule. However, between computing $G(t)$ and $G(t+1)$, a gradient step is performed during training. This means that $G(t)$ is now slightly stale, and corresponds to the gradient with respect to old value of the parameters, not the current value. As far as I understand, this implies that $G(t+1)$ (more specifically, its stochastic estimate as proposed in this paper) isn't unbiased anymore. So, unless I'm missing something (which I might!), I don't think we can invoke the theory of SGD as they suggest.

But frankly, that last issue seems pretty unavoidable in the online setting. I suspect this will never be solved, and future research will have to somehow have to design learning algorithms that are robust to this issue (or develop new theory that shows it isn't one).

So overall, kudos to the authors, and I'm really looking forward to read more about where this research goes!

more

Send Feedback
ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: