Loading [MathJax]/extensions/Safe.js

luyuchen

sciscore: 2




arxiv.org
arxiv-vanity.com
scholar.google.com
Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning
Stefan Depeweg and José Miguel Hernández-Lobato and Finale Doshi-Velez and Steffen Udluft
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML, cs.LG

more
[link]
Summary by luyuchen 6 years ago

The paper starts with the BNN with latent variable and proposes an entropy-based and a variance-based measure of prediction uncertainty. For each uncertainty measure, the authors propose a decomposition of the aleatoric term and epistemic term. A simple regression toy experiment proves this decomposition and its measure of uncertainty. Then the author tries to improve the regression toy experiment performance by using this uncertainty measure into an active learning scheme. For each batch, they would actively sample which data to label. The result shows that using epistemic uncertainty alone outperforms using total certainty, which both outperforms simple gaussian process. The result is understandable since epistemic is directly related to model weight uncertainty, and sampling from high aleatoric uncertain area does help supervised learning.

Then the authors talk about how to extend the model based RL by adding a risk term which consider both aleatoric term and epistemic term, and its related to model-bias and noise aversion. The experiments on Industrial Benchmark shows the method is able prevent overfitting the learned model and better transfer to real world, but the method seems to be pretty sensitive to $\beta$ and $\gamma$.

[link]
Summary by luyuchen 7 years ago

This paper proposes a simple method for sequentially training new tasks and avoid catastrophic forgetting. The paper starts with the Bayesian formulation of learning a model that is $$ \log P(\theta | D) = \log P(D | \theta) + \log P(\theta) - \log P(D) $$ By switching the prior into the posterior of previous task(s), we have $$ \log P(\theta | D) = \log P(D | \theta) + \log P(\theta | D_{prev}) - \log P(D) $$ The paper use the following form for posterior $$ P(\theta | D_{prev}) = N(\theta_{prev}, diag(F)) $$ where $F$ is the Fisher Information matrix $E_x[ \nabla_\theta \log P(x|\theta) (\nabla_\theta \log P(x|\theta))^T]$. Then the resulting objective function is $$ L(\theta) = L_{new}(\theta) + \frac{\lambda}{2}\sum F_{ii} (\theta_i - \theta^{prev*}_i)^2 $$ where $L_{new}$ is the loss on new task, and $\theta^{prev*}$ is previous best parameter. It can be viewed as a distance which uses Fisher Informatrix to properly scale each dimension, and it further proves that the Fisher Information matrix is important in the experienment by comparing with simple $L_2$ distance.

[link]
Summary by luyuchen 7 years ago

This paper proposes a method to obtain a non-vacuous bound on generalization error by optimizing the PAC-Bayes bound directly. The interesting part is that the authors leverage the black magic of neural net itself to bound the neural net. In order to find the optimal Q, the authors' loss function is an empirical err term plus the $KL(Q|P)$, where they choose the prior $P$ to be $N(0, \lambda I)$, and they also provide justification for choosing the right $\lambda$. Overally, this objective is similar to the variational inference in the Bayesian neural net, and the author is able to obtain a test error bound of $17\%$ on MNIST, while the tradition bounds will be mostly meaningless.

Send Feedback
ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: