ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Unsupervised Deep Feature Learning for Deformable Registration of MR Brain Images
Wu, Guorong and Kim, Minjeong and Wang, Qian and Gao, Yaozong and Liao, Shu and Shen, Dinggang
Medical Image Computing and Computer Assisted Interventions Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Anmol Sharma 7 years ago

Accurate anatomical landmark correspondence is highly critical for medical image registration. Traditionally many of the previous works proposed a number of hand-crafted feature sets that can be used to perform correspondence. However these feature tend to be highly specialized in terms of application area, and cannot be always generalized well to other applications without significant modifications. There have been other works that perform automatic feature extraction, but their reliance on labelled data hinders their ability to perform in cases where there is none.

To this end Wu et al. propose an unsupervised feature learning method which does not require labelled data. Their approach aims to directly learn the basis filters that can effectively represent all observed image patches. The learnt basis filters are later regarded as general image features representing the morphological structure of the patch.

In order to learn the basis filters, the authors propose a two-layer convolutional neural network called the Independent Subspace Analysis (ISA) algorithm. As an extension of ICA, the responses are not required to be all to be mutually independent in ISA. Instead, these responses can be divided into several groups, each of which is called independent subspace. Then, the responses are dependent inside
each group, but dependencies among different groups are not allowed. Thereby, similar
features can be grouped into the same subspace to achieve invariance. To ensure the accurate correspondence detection, multi-scale image features are necessary to use. However, it also raised a problem of high-dimensionality in learning features from the large-scale image patches.

This is achieved by constructing a two-layer network for scaling up the ISA to the large-scale image patches. Specifically, the ISA is first trained in the first layer based on the image patches with smaller scale. After that, a sliding window (with the same scale in the first layer) convolves with each large-scale patch to get a sequence of overlapped small-scale patches. The
combined responses of these overlapped patches through the first layer ISA are whitened by PCA and then used as the input to the second layer that is further trained by another ISA. In this way, high-level
understanding of large-scale image patch can be perceived from the low-level image features detected by the basis filters in the first layer.

The authors compare their work with two other methods, and apply their method on IXI and ADNI datasets.

arxiv.org
scholar.google.com

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA
Hyvärinen, Aapo and Morioka, Hiroshi
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by inFERENCe 9 years ago

"Aapo did it again!" - I exclaimed while reading this paper yesterday on the train back home (or at least I thought I was going home until I realised I was sitting on the wrong train the whole time. This gave me a couple more hours to think while traveling on a variety of long-distance buses...)

Aapo Hyvärinen is one of my heroes - he did tons of cool work, probably most famous for pseudo-likelihood, score matching and ICA. His recent paper, brought to my attention by my brand new colleague Hugo Larochelle, is similarly excellent:

### Summary

Time-contrastive learning (TCL) trains a classifier that looks at a datapoint and guesses what part of the time series it came from.
it exploits nonstationarity of time series to help representation learning
an elegant connection to generative models (nonlinear ICA) is shown, although the assumptions of the model are pretty limiting
TCL is the temporal analogue of representation learning with jigsaw puzzles
similarly to GANs, logistic regression is deployed as a proxy to learn log-likelihood ratios directly from data
Time-contrastive learning

Time-contrastive learning (TCL) is a technique for learning to extract nonlinear representations from time series data. First, the time series is sliced up into a number of non-overlapping chunks, indexed by ττ. Then, a multivariate logistic regression classifier is trained in a supervised manner to look at a sample taken from the series at an unknown time and predict ττ, the index of the chunk it came from. For this classifier, a neural network is used.

The classifier itself is only a proxy to solving the representation learning problem. It turns out, if you chop off the final linear + softmax layer, the activations in the last hidden layer will learn to represent something fundamental, the log-odds-ratios in a probabilistic generative model (see paper for details). If one runs linear ICA over these hidden layer activations, the resulting network will learn to perform inference in a nonlinear ICA latent variable model.

Moreover, if certain conditions about nonstationarity and the generative model are met, one can prove that the latent variable model is identifiable. This means that if the data was indeed drawn from the nonlinear ICA generative model, the resulting inference network - composed by the chopping off the top of the classifier and replacing it with a linear ICA layer - can infer the true hidden variables exactly.

### How practical are the assumptions?

TCL relies on the nonstationarity of time series data: the statistics of data changes depending on which chunk or slice of the time series you are in, but it is also assumed that data are i.i.d. within each chunk. The proof also assumes that the chunk-conditional data distributions are slightly modulated versions of the same nonlinear ICA generative model, this is how the model ends up identifiable - because we can use the different temporal chunks as different perspectives on the latent variables.

I would say that these assumptions are not very practical, or at least on data such as natural video. Something along the lines of slow-feature analysis, with latent variables that exhibit more interesting behaviour over time would be desirable. Nevertheless, the model is complex enough to make a point, and I beleive TCL itself can be deployed more generally for representation learning.

### Temporal jigsaw

It's not hard to see that TCL is analogous to a temporal version of the jigsaw puzzle method I wrote about last month. In the jigsaw puzzle method, one breaks up a single image into non-overlapping chunks, shuffles them, and then trains a network to reassemble the pieces. Here, the chunking happens in the temporal domain instead.

There are other papers that use the same general idea: training classifiers that guess the correct temporal ordering of frames or subsequences in videos. To do well at their job, these classifiers can end up learning about objects, motion, perhaps even a notion of inertia, gravity or causality.

Ishan Misra et al. (2016) Unsupervised Learning using Sequential Verification for Action Recognition
Basura Fernando et al. (2015) Modeling Video Evolution For Action Recognition
In this context, the key contribution of Hyvarinen and Morioka's paper is to provide extra theoretical justification, and relating the idea to generative models. I'm sure one can use this framework to extend TCL to slightly more plausible generative models.

### Key takeaway

`Logistic regression learns likelihood ratios`

This is yet another example of using logistic regression as a proxy to estimating log-probability-ratios directly from data. The same thing happens in generative adversarial networks, where the discriminator learns to represent $\log P(x) - \log Q(x)$, where $P$ and $Q$ are the real and synthetic data distributions, respectively.

This insight provides new ways in which unsupervised or semi-supervised tasks can be reduced to supervised learning problems.
As classification is now considered significantly easier than density estimation, direct probability ratio estimation may provide the easiest path forward for representation learning.

arxiv.org
scholar.google.com

Learning to Compose Words into Sentences with Reinforcement Learning
Yogatama, Dani and Blunsom, Phil and Dyer, Chris and Grefenstette, Edward and Ling, Wang
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 8 years ago

The aim is to have the system discover a method for parsing that would benefit a downstream task.

https://i.imgur.com/q57gGCz.png

They construct a neural shift-reduce parser – as it’s moving through the sentence, it can either shift the word to the stack or reduce two words on top of the stack by combining them. A Tree-LSTM is used for composing the nodes recursively. The whole system is trained using reinforcement learning, based on an objective function of the downstream task. The model learns parse rules that are beneficial for that specific task, either without any prior knowledge of parsing or by initially training it to act as a regular parser.

arxiv.org
arxiv-vanity.com
scholar.google.com

Generative Image Modeling Using Spatial LSTMs
Lucas Theis and Matthias Bethge
arXiv e-Print archive - 2015 via Local arXiv
Keywords: stat.ML, cs.CV, cs.LG
more

[link] Summary by Liew Jun Hao 9 years ago

#### Introduction
This [paper](https://github.com/lucastheis/ride) introduces *recurrent image density estimator* (RIDE), a generative model by combining a *multidimensional* recurrent neural network with mixtures of experts to model the distribution of natural image. In this work, the authors used  *spatial* LSTMs (SLSTM) to capture the semantics in the form of hidden states where these hidden vectors are then fed into a factorized *factorized mixtures of conditional Gaussian scale mixtures* (MCGSMs) to predict the state of the corresponding pixels.

##### __1. Spatial long short-term memory (SLSTM)__
This is a straightforward extension of the multidimensional RNN in order to capture long range interaction. Let $\mathbf{x}$ be a grayscale image patch and $x_{ij}$ be the intensity of pixel at location ${ij}$. At each location $ij$, each LSTM unit perform the following operations:

$\mathbf{c}_{ij} = \mathbf{g}_{ij} \odot \mathbf{i}_{ij} + \mathbf{c}_{i,j-1} \odot \mathbf{f}^c_{ij} + \mathbf{c}_{i-1,j} \odot \mathbf{f}^r_{ij} $ 

$\mathbf{h}_{ij} = \tanh(\mathbf{c}_{ij} \odot \mathbf{o}_{ij})$

$\begin{pmatrix} 
\mathbf{g}_{ij} \\ \mathbf{o}_{ij} \\ \mathbf{i}_{ij} \\ \mathbf{g}_{ij}\\ \mathbf{f}_{ij}^r\\ \mathbf{f}_{ij}^c
\end{pmatrix} = \begin{pmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \\ \sigma\\ \sigma  \end{pmatrix} T_{\mathbf{A,b}}  \begin{pmatrix} \mathbf{x}_{<ij} \\ \mathbf{h}_{i,j-1} \\ \mathbf{h}_{i-1,j} \end{pmatrix} $

where $\mathbf{c}_{ij}$ and $\mathbf{h}_{ij}$ are memory units and hidden units respectively. Note that, there are 2 different forget gates $\mathbf{f}^c_{ij}$ and $\mathbf{f}^r_{ij}$ for the 2 preceding memory states $\mathbf{c}_{i,j-1}$ and $\mathbf{c}_{i-1,j}$. Also note that $\mathbf{x}_{<ij}$ here denotes a set of *causal neighborhood* by applying Markov assumption.

![ride_1](http://i.imgur.com/W8ugGvl.png)
As shown in Fig. C, although the prediction of a pixel depends only on its neighborhood (green) through feedforward connections, there is an indirect connection to a much larger region (red) via recurrent connections.

##### __2. Factorized mixtures of conditional Gaussian scale mixtures__
A generative model can usually be expressed as $p(\mathbf{x};\mathbf{\theta}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta})$ using chain rule. One way to improve the representational power of a model is to introduce different sets of parameters for each pixel, i.e. $p(\mathbf{x}; \{ \mathbf{\theta} \}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta}_{ij})$. However, untying shared parameters will lead to drastic increase of parameters. Therefore, the author applied 2 simple common used assumptions:
1. __Markov  assumption__: $\mathbf{x}_{<ij}$ is limited to small neighborhood around $x_{ij}$ (causal neighborhood)
2. __Stationary and shift invariance__: the same set of $\mathbf{\theta}_{ij}$ is used for every location ${ij}$ which corresponds to recurrent structure in RNN.

Therefore, the hidden vector from SLSTMs can be fed into the MCGSM to predict the state of corresponding label, i.e. $p(x_{ij} | \textbf{x}_{<ij}) = p(x_{ij} | \textbf{h}_{ij})$.

The conditional distribution distribution in MCGSM is represented as a mixture of experts:

$p(x_{ij} | \mathbf{x}_{<ij}; \mathbf{\theta}_{ij}) = \sum_{c,s} p(c, s | \mathbf{x}_{<ij}, \mathbf{\theta}_{ij}) p (x_{ij} | \mathbf{x}_{<ij}, c, s, \mathbf{\theta}_{ij})$.

where the first and second term correspond to gate and experts respectively. To further reduce the number of parameters, the authors proposed using a *factorized* MCGSM in order to use larger neighborhoods and more mixture components. (*__Remarks__: I am not too sure about the exact training of MCGSM, but as far as I understand, the MCGSM is firstly trained end-to-end with SLSTM using SGD with momentum and then finetuned using L-BFGS after each epoch by fixing the parameters of SLSTM.*)

* For training:

```
for n in range(num_epochs):
	for b in range(0, inputs.shape[0] - batch_size + 1, batch_size):
		# compute gradients
		f, df = f_df(params, b)

		loss.append(f / log(2.) / self.num_channels)

		# update SLSTM parameters
		for l in train_layers:
			for key in params['slstm'][l]:
				diff['slstm'][l][key] = momentum * diff['slstm'][l][key] - df['slstm'][l][key]
				params['slstm'][l][key] = params['slstm'][l][key] + learning_rate * diff['slstm'][l][key]

		# update MCGSM parameters
		diff['mcgsm'] = momentum * diff['mcgsm'] - df['mcgsm']
		params['mcgsm'] = params['mcgsm'] + learning_rate * diff['mcgsm']
```

* Finetuning (part of the code)

```
for l in range(self.num_layers):
	self.slstm[l] = SLSTM(
		num_rows=hiddens.shape[1],
		num_cols=hiddens.shape[2],
		num_channels=hiddens.shape[3],
		num_hiddens=self.num_hiddens,
		batch_size=min([hiddens.shape[0], self.MAX_BATCH_SIZE]),
		nonlinearity=self.nonlinearity,
		extended=self.extended,
		slstm=self.slstm[l],
		verbosity=self.verbosity)

	hiddens = self.slstm[l].forward(hiddens)

# finetune with early stopping based on validation performance
return self.mcgsm.train(
	hiddens_train, outputs_train,
	hiddens_valid, outputs_valid,
	parameters={
		'verbosity': self.verbosity,
		'train_means': train_means,
		'max_iter': max_iter})
```

search.ieice.org
sci-hub
scholar.google.com

Hybrid Parallel Inference for Hierarchical Dirichlet Processes
Omoto, Tsukasa and Eguchi, Koji and Tora, Shotaro
IEICE Transactions - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Open Review 9 years ago

This paper proposes two parallel inference algorithms for the Hierarchical Dirichlet Process (HDP) in a distributed/cluster setting. The proposed algorithms use a two-level approach to parallelization where the top level involves distributing the data to individual processors/machines across a symmetric multiprocessing (SMT) cluster and the second level utilizes existing algorithms earlier developed for parallel inference in HDP based models on each machine. The first algorithm uses the approximate distributed HDP (AD-HDP) algorithm of Newman et al (2009) whereas the second algorithm uses the Parallel HDP algorithm of Asuncion et al (2008). The proposed algorithms are compared against a full MPI implementation based on Parallel-HDP and are shown to scale better w.r.t. increasing the number of cores.  

The paper is pretty much a simple extension of the existing algorithms for distributed HDP by simply reusing then on each machine and synchronizing. There is no discussion whether the existing algorithms (AD-HDP and Parallel HDP) could further benefits from the the SMT architecture, discussion about possible communication strategies among processors, or how certain issues such as merging of topics are dealt with (vis-a-vis the AD-HDP). Some discussion would be nice.