Summaries from arXiv e-Print archive on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

A Joint Model of Language and Perception for Grounded Attribute Learning
Cynthia Matuszek and Nicholas FitzGerald and Luke Zettlemoyer and Liefeng Bo and Dieter Fox
arXiv e-Print archive - 2012 via Local arXiv
Keywords: cs.CL, cs.LG, cs.RO
more

[link] Summary by Mihail Eric 7 years ago

* Task of extracting representations of language tied to physical world
* New grounded concepts from a set of scenes containing only sentences, images, and indications of what objects being referred to
* System includes: 
  * *Semantic parsing model* 
    * Defines distribution over logical meaning representations for each given sentence
  * Set of visual attribute classifiers for each possible object in scene
  * Joint model learning mapping from logical constants in logical form to set of visual attribute classifiers
  * Extracted depth and RGB values from images as features (shape and color attributes)

arxiv.org
arxiv-vanity.com
scholar.google.com

Improving neural networks by preventing co-adaptation of feature detectors
Geoffrey E. Hinton and Nitish Srivastava and Alex Krizhevsky and Ilya Sutskever and Ruslan R. Salakhutdinov
arXiv e-Print archive - 2012 via Local arXiv
Keywords: cs.NE, cs.CV, cs.LG
more

[link] Summary by Martin Thoma 8 years ago

This paper introduced Dropout, a new layer type. It has a parameter $\alpha \in (0, 1)$. The output dimensionality of a dropout layer is equal to its input dimensionality. With a probability of $\alpha$ any neurons output is set to 0. At testing time, the output of all neurons is multiplied with $\alpha$ to compensate for the fact that no output is set to 0.

A much better paper, by the same authors but 2 years later, is [Dropout: a simple way to prevent neural networks from overfitting](http://www.shortscience.org/paper?bibtexKey=journals/jmlr/SrivastavaHKSS14).

Dropout can be interpreted as training an ensemble of many networks, which share weights.

It was notably used by [ImageNet Classification with Deep Convolutional Neural Networks](http://www.shortscience.org/paper?bibtexKey=krizhevsky2012imagenet).

arxiv.org
arxiv-vanity.com
scholar.google.com

Advances in Optimizing Recurrent Networks
Yoshua Bengio and Nicolas Boulanger-Lewandowski and Razvan Pascanu
arXiv e-Print archive - 2012 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* Recurrent Neural Networks (RNNs) are very powerful at modelling sequences but they are not good at learning long-term dependencies.
* The paper discusses the reasons behind this difficulty and some suggestions to mitigate it.
* [Link to the paper.](https://arxiv.org/abs/1212.0901)

#### Optimization Difficulty

* RNNs form a deterministic state variable h<sup>t</sup> as function of input observation and previous state.
* Learnable parameters to decide what will be remembered about the past sequence.
* Using local optimisation techniques like Stochastic Gradient Descent (SGD) are unlikely to find optimal values of tunable parameters
* When computations performed by RNN are unfolded through time, a deep Neural Network with shared weights is realised.
* The cost function of this deep network depends on the output of hidden layers.
* Gradient descent updates could "explode" (become very large) or "vanish" (become very small).

#### Training Recurrent Networks

* **Clip Gradient** - when the norm of the gradient vector ($g$) is above a threshold, update is done in direction of threshold $g/||g||$. This normalisation implements a simple form of second-order normalisation (the second-order derivate will also be large in regions of exploding gradient).

* Use a **leaky integration** state-to-state map:
    $h_{t, i} = \alpha_{i}h_{t-1, i} + (1-\alpha _{i})F_{i}(h_{t-1}, x_{t})$

    Different values of &alpha; allow a different amount of the previous state to "leak" through the unfolded layers to further in time. This simply expands the time-scale of vanishing gradients and not totally remove them.

* Use **output probability models** like Restricted Boltzmann Machine or NADE to capture higher order dependencies between variables in case of multivariate prediction.

* By using **rectifier non-linearities**, the gradient on hidden units becomes sparse and these sparse gradients help the hidden units to specialise. The basic idea is that if the gradient is concentrated in fewer paths (in the unfolded computational graph) the vanishing gradient effect would be limited.

* A **simplified Nesterov Momentum** rule is proposed to allow storing past velocities for a longer time while actually using these velocities more conservatively. The new formulation is also easier to implement.

#### Results

* SGD with these optimisations outperforms a vanilla SGD.

arxiv.org
scholar.google.com

Distributed GraphLab: A Framework for Machine Learning in the Cloud
Yucheng Low and Joseph Gonzalez and Aapo Kyrola and Danny Bickson and Carlos Guestrin and Joseph M. Hellerstein
arXiv e-Print archive - 2012 via Local arXiv
Keywords: cs.DB, cs.LG
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* GraphLab abstraction exposes asynchronous, dynamic, graph-parallel computation model in the shared-memory setting.
* This paper extends the abstraction to the distributed setting.
* [Link](http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf) to the paper.

#### Characteristics of MLDM (Machine Learning and Data Mining)

* Graph Structured Computation
* Sometimes computation requires modeling dependencies between data.
* eg modeling dependencies between similar users for the recommendation use case.
* Asynchronous Iterative Computation
* In many cases, asynchronous procedures outperform synchronous ones.
* eg linear systems, belief propagation, stochastic optimization etc.
* Dynamic Computation
* Iterative computation converges asymmetrically.
* Convergence can be accelerated by dynamic scheduling.
* eg do not update parameters that have already converged.
* Serializability
* Ensuring that all parallel executions have an equivalent serial execution is desirable for both correctness and faster convergence.

#### GraphLab Abstraction

### Data Graph
* Store program state as a directed graph.
* **G = (V,E,D)** where D is the user defined data (model parameters, algorithm state, statistical data etc).
* The graph data **D** is mutable but the state of the graph **(V,E)** is immutable.

#### Update Function
* Stateless procedure that modifies the data within the scope of a vertex and schedules the execution of the *update* function on other vertices.
* **Scope** of a vertex (S) - data corresponding to the vertex, its edges and its adjacent vertices.
* **update: $f (v, S_v) -> (S_v, T)$** where T is the set of vertices where *update* function is scheduled to be invoked.
* Scheduling of computation id decoupled from movement of data and no message passing is required between vertices.

#### Execution Model
* Input to the model is G and T, the initial set of vertices to be updated.
* During each step, a vertex is extracted from T, updated and a set of vertices is added to T (for future computation).
* Vertices in T can be executed in any order with the only constraint that all vertices be eventually executed.

#### Sync Operation
* Sync operation runs in the background to maintain global aggregates concurrently.
* These global values are read by *update* function and written by the sync operation.

#### Consistency Models
* Full consistency
* Full read/write access in the *scope*.
* Scope of concurrently updating vertices cannot overlap.
* Edge consistency
* Read/write access on the vertex and the adjacent edges but only read access to adjacent vertices.
* Slightly overlapping scope.
* Vertex consistency
* Write access to the vertex and read access to adjacent edges and vertices.
* All vertices can run update function simultaneously.

### Distributed Data Graph

* Two-phase partitioning process for load balancing the graph on arbitrary cluster size.
* In the first phase, partition the graph into k parts (k >> number of machines).
* Each part, called **atom**, is a file of graph generating commands.
* Atom also stores information about **ghosts** (set of vertices and edges adjacent to the partition boundary).
* Atom index file contains connectivity structure and file location for the k atoms as a meta-graph.
* In the second phase, this meta-graph is partitioned over the physical machines.

### Distributed GraphLab Engines

#### Chromatic Engine

* A vertex coloring (no adjacent vertices have the same color) is constructed to serialize parallel execution of dependent tasks (in our case, vertices in the graph).
* For edge consistency model, execute all vertices of the same color before going to next color and run sync operation between color steps.
* Changes to ghost vertices and edges are communicated asynchronously as they are made.
* Vertex consistency is trivial - assign same color to all the vertices.
* For full consistency, construct second-order vertex coloring (no vertex shares the same color as any of its distance two neighbors)

#### Distributed Locking Engine

* Associate reader-writer locks on each vertex.
* Each machine can update only the local vertices.
* Optimisations
* Ghosting system uses caching to eliminate wait on remote, unchanged data.
* Lock request and synchronization are pipelined to hide network latency.
* Each machine maintains a pipeline of vertices for which locks have been requested but not granted.
* A vertex is executed once lock acquisition and data synchronization are complete.
* Nonblocking reader-writer locks, that work through callback functions, are used.

### Fault Tolerance

* Distributed checkpointing via two modes:
* Synchronous checkpointing
* Suspend computation to save all modified data since the last checkpoint.
* Asynchronous checkpointing based on Chandy-Lamport snapshot algorithm.
* The snapshot step becomes an *update* function in the GraphLab abstraction.
* Better than synchronous checkpointing.

### System Design
* One instance of GraphLab runs on each machine.
* These processes are symmetric and communicate via RPC.
* The first process additionally acts as the master and computes placement of atoms based on atom index.
* Each process maintains a local scheduler (for its vertices) and a cache to access remote data.
* Distributed consensus algorithm to decide when all the schedulers are empty.

### Observations
* The biggest strength of the paper are its extensive experiments.
* GraphLab benefits from the use of background asynchronous communication and pipelined locking but its communication layer is not as efficient as MPI's communication layer.

arxiv.org
scholar.google.com

The Unified Logging Infrastructure for Data Analytics at Twitter
Lee, George and Lin, Jimmy J. and Liu, Chuang and Lorek, Andrew and Ryaboy, Dmitriy V.
arXiv e-Print archive - 2012 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 8 years ago

The [paper](http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf) presents Twitter's logging infrastructure, how it evolved from application specific logging to a unified logging infrastructure and how session-sequences are used as a common case optimization for a large class of queries.

## Messaging Infrastructure

Twitter uses **Scribe** as its messaging infrastructure. A Scribe daemon runs on every production server and sends log data to a cluster of dedicated aggregators in the same data center. Scribe itself uses **Zookeeper** to discover the hostname of the aggregator. Each aggregator registers itself with Zookeeper. The Scribe daemon consults Zookeeper to find a live aggregator to which it can send the data. Colocated with the aggregators is the staging Hadoop cluster which merges the per-category stream from all the server daemons and writes the compressed results to HDFS. These logs are then moved into main Hadoop data warehouse and are deposited in per-category, per-hour directory (eg /logs/category/YYYY/MM/DD/HH). Within each directory, the messages are bundled in a small number of large files and are partially ordered by time.

Twitter uses **Thrift** as its data serialization framework, as it supports nested structures, and was already being used elsewhere within Twitter. A system called **Elephant Bird** is used to generate Hadoop record readers and writers for arbitrary thrift messages. Production jobs are written in **Pig(Latin)** and scheduled using **Oink**.

## Application Specific Logging

Initially, all applications defined their own custom formats for logging messages. While it made it easy to develop application logging, it had many downsides as well.

* Inconsistent naming conventions: eg uid vs userId vs user_Id
* Inconsistent semantics associated with each category name causing resource discovery problem.
* Inconsistent format of log messages.

All these issues make it difficult to reconstruct user session activity.

## Client Events

This is an effort within Twitter to develop a unified logging framework to get rid of all the issues discussed previously. A hierarchical, 6-level schema is imposed on all the events (as described in the table below).

| Component | Description | Example |
|-----------|------------------------------------|----------------------------------------------|
| client | client application | web, iPhone, android |
| page | page or functional grouping | home, profile, who_to_follow |
| section | tab or stream on a page | home, mentions, retweets, searches, suggestions |
| component | component object or objects | search_box, tweet |
| element | UI element within the component | button, avatar |
| action | actual user or application action | impression, click, hover |

**Table 1: Hierarchical decomposition of client event names.**

For example, the following event, `web:home:mentions:stream:avatar:profile_click` is logged whenever there is an image profile click on the avatar of a tweet in the mentions timeline for a user on twitter.com (read from right to left).

The alternate design was a tree based model for logging client events. That model allowed for arbitrarily deep event namespace with as fine-grained logging as required. But the client events model was chosen to make the top level aggregate queries easier.

A client event is a Thrift structure that contains the components given in the table below.

| Field | Description |
|-----------------|---------------------------------|
| event initiator | {client, server} × {user, app} |
| event_name | event name |
| user_id | user id |
| session_id | session id |
| ip | user’s IP address |
| timestamp | timestamp |
| event_details | event details |

**Table 2: Definition of a client event.**

The logging infrastructure is unified in two senses:

* All log messages share a common format with clear semantics.
* All log messages are stored in a single place.

## Session Sequences

A session sequence is a sequence of symbols *S = {s0, s1, s2...sn}* such that each symbol is drawn from a finite alphabet *Σ*. A bijective mapping is defined between Σ and universe of event names. Each symbol in Σ is represented by a valid Unicode point (frequent events are assigned shorter code prints) and each session sequence becomes a valid Unicode string. Once all logs have been imported to the main database, a histogram of event counts is created and is used to map event names to Unicode code points. The counts and samples of each event type are stored in a known location in HDFS. Session sequences are reconstructed from the raw client event logs via a *group-by* on *user_id* and *session_id*. Session sequences are materialized as it is difficult to work with raw client event logs for following reasons:
* A lot of brute force scans.
* Large group-by operations needed to reconstruct user session.

#### Alternate Designs Considered
* Reorganize complete Thrift messages by reconstructing user sessions - This solves the second problem but not the first.
* Use a columnar storage format - This addresses the first issue but it just reduces the time taken by mappers and not the number of mappers itself.

The materialized session sequences are much smaller than raw client event logs (around 50 times smaller) and address both the issues.

## Client Event Catalog

To enhance the accessibility of the client event logs, an automatically generated event data log is used along with a browsing interface to allow users to browse, search and access sample entries for the various client events. (These sample entries are the same entries that were mentioned in the previous section. The catalog is rebuilt every day and is always up to date.

## Applications

Client Event Logs and Session Sequences are used in following applications:

* Summary Statistics - Session sequences are used to compute various statistics about sessions.
* Event Counting - Used to understand what feature of users take advantage of a particular feature.
* Funnel Analytics - Used to focus on user attention in a multi-step process like signup process.
* User Modeling - Used to identify "interesting" user behavior. N-gram models (from NLP domain) can be extended to measure how important temporal signals are by modeling user behavior on the basis of last n actions. The paper also mentions the possibility of extracting "activity collocations" based on the notion of collocations.

## Possible Extensions

Session sequences are limited in the sense that they capture only event name and exclude other details. The solution adopted by Twitter is to use a generic indexing infrastructure that integrates with Hadoop at the level of InputFormats. The indexes reside with the data making it easier to reindex the data. An alternative would have been to use **Trojan layouts** which members indexing in HDFS block header but this means that indexing would require the data to be rewritten. Another possible extension would be to leverage more analogies from the field of Natural Language Processing. This would include the use of automatic grammar induction techniques to learn hierarchical decomposition of user activity. Another area of exploration is around leveraging advanced visualization techniques for exploring sessions and mapping interesting behavioral patterns into distinct visual patterns that can be easily recognized.

arxiv.org
scholar.google.com

Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture
Gilad Mishne and Jeff Dalton and Zhenghua Li and Aneesh Sharma and Jimmy Lin
arXiv e-Print archive - 2012 via Local arXiv
Keywords: cs.IR, cs.DB
more

[link] Summary by Shagun Sodhani 8 years ago

Twitter’s Real-Time Related Query Suggestion Architecture. The paper tells the story behind architecture to support Twitter’s real-time related query suggestion, why the architecture had to be designed twice and what lessons can be learned from this exercise. It does not talk much about the algorithms, rather it talks about the different design decisions that lead to the current architecture — the focus is not on how things were done but why they were done a certain way.

Twitter has an interesting use case — search assistance — which boils down to things like a user searching for “Obama” and getting results for related queries like “White House” as well. Spelling correction is also a part of search assistance. The problem is quite well researched from volume perspective of data but in Twitter’s context, velocity mattered as much as volume. The results had to adapt to rapidly evolving global conversations in real-time — where real-time loosely translates to a target latency of 10 minutes. The real-time sense is important in Twitter’s context where “relevance” has a temporal aspect as well. For example, after the Nepal Earthquake in 2015, the query “Nepal” led to results related to “earthquake” as well. Then when the new constitution was passed in Nepal, the query “Nepal” led to results related to “constitution”. The time frame in which suggestions have maximum impact is very narrow. A suggestion made too early would seem irrelevant while a suggestion made too late would seem obvious and hence less impactful. These fast moving signals have to be mixed with slow moving ones for insightful results. Sometimes the query volume is very low in which case longer observation period is needed before suggestions can be made. Twitter noticed that 17% of top 1000 query terms churn over an hourly basis which means they are no longer in top 1000 after an hour. Similarly, around 13% of top 1000 query terms are churned out every day. This suggested that a fine-grained tracking of search terms was needed.

Twitter started with a basic idea: if two queries are seen in the same context, then they are related. Now they had a large open design space. For example, context can be defined by user’s search session or tweet or both. Measures like log likelihood, chi-square test etc can be used to quantify how often 2 queries appear together. To consider the temporal effect, counts are decayed time. Finally, Twitter has to combine these factors, and some more factors, together to come up with a ranking mechanism. This paper does not focus on what algorithms were chosen for these tasks, it focuses on how an end-to-end system was created.

#### Hadoop Solution
Twitter has a powerful petabyte-scale Hadoop-based analytics platform. Both real-time and batch processes write data to the Hadoop Distributed File System (HDFS). These include bulk exports from databases, application logs, and many other sources. Contents are serialized using either Protocol Buffers or Thrift, and LZOcompressed. There is a work-flow manager, called Oink, which schedules recurring jobs and handles dataflow dependencies between jobs. For example, if job B requires data generated by job A, A will be scheduled first.

Twitter wanted to take advantage of this stack and the first version was deployed in form of a Pig script that aggregated user search sessions to compute term and cooccurrence statistics and ranked related queries on top of the existing stack. While the results were pretty good, the latency was too high and results were not available until several hours.

#### Bottleneck #1 Log Imports
Twitter uses Scribe to aggregate streaming log data in an efficient manner. These logs are rich with user interaction and are used by the search assistant. A Scribe daemon is running on each production server where it collects and sends local log data (consisting of category and message) to a cluster of aggregators which are co-located with a staging Hadoop cluster. This cluster merges per-category streams from the server daemons and writes the results to HDFS of the staging cluster. These logs are then transformed and moved to the main Hadoop data warehouse in chunks of data for an hour. These log messages are put in per-category, per-hour directories and are bundled in a small number of large files. Only now can the search assistant start its computations. The hierarchical aggregation is required to “roll up” data into few, large files as HDFS is not good at handling large numbers of small files. As a result, there is a huge delay from when the logs are generated to when they are available for processing. Twitter estimated that they could bring down the latency to tens of minutes by re-engineering their stack though even that would be too high.

#### Bottleneck #2 Hadoop:
Hadoop is not meant for latency sensitive jobs. For example, a large job could take tens of seconds to just startup — irrespective of the amount of data crunched. Moreover, the Hadoop cluster was a shared resource across Twitter. Using a scheduler (in this case, FairScheduler) is not the ideal solution as the focus is on predictable end-to-end latency bound and not resource allocation. Lastly, the job completion time depending on stragglers. For some scenarios, a simple hash partitioning scheme created chunks of “work” with varying size. This lead to large varying running times for different map-reduce jobs. For scripts that chain together Hadoop jobs, the slowest task becomes the bottleneck. Just like with log imports, Twitter estimated the best case scenario for computing query suggestions to be of the order of ten minutes.

Starting with the Hadoop stack had many advantages like a working prototype was built quickly and ad hoc analysis could be easily done. This also helped them to understand the query churn and make some important observations about factors to use in search assistant. For example, Twitter discovered that only 2 sources of context — search sessions and tweets — were good enough for an initial implementation.But due to high latency, Twitter had to restrict this solution to the experimental stage itself.

#### Current Architecture
Firehose is the streaming API that provides access to all tweets in real time and the frontend, called Blender, brokers all requests and provides a streaming API for queries — also called query hose. These two streams are used by EarlyBird, the inverted indexing engine, and search assistant engine. Now client logs are not needed as Blender has all search sessions. Twitter search assistance is an in-memory processing engine comprising of two decoupled components:

1. Frontend Nodes — These are lightweight in-memory caches which periodically read fresh results from HDFS. They are implemented as a Thrift service, and can be scaled out to handle increased query load.
2. Backend Nodes — These nodes perform the real computations. The backend processing engine is replicated but not sharded. Every five minutes, computed results are persisted to HDFS and every minute, the frontend caches poll a known HDFS location for updated results.

Request routing to the replicas is handled by a ServerSet, which provides client-side load-balanced access to a replicated service, coordinated by
ZooKeeper for automatic resource discovery and robust failover.

Each backend instance is a multi-threaded application that consisting of:
1. Stats collector: Reads the firehose and query hose
2. In-memory stores: Hold the most up-to-date statistics
3. Rankers: Periodically execute one or more ranking algorithm by getting raw features from the in-memory stores.

There are three separate in-memory stores to keep track
of relevant statistics:
1. Sessions store: Keeps track of (anonymized) user sessions observed in the query hose, and for each session, the history of the queries issued in a linked list. Sessions older than a threshold are discarded. Metadata is also tracked separately.
2. Query statistics store: Retains up-to-date statistics, like session count, about individual queries. These also include a weighted count based on a custom scoring function. This function captures things like association is more between 2 consecutively typed queries vs 2 consecutively clicked hash-tags. These weights are periodically decayed to reflect decreasing importance over time. It also keeps additional metadata about the query like its language.
3. Query cooccurrence statistics store: Holds data about
pairs of co-occurring queries. Weighting and decaying are applied like in the case of query statistics store.

**Query Flow** — As a user query flows through the query hose, query statistics are updated in the query statistics store, it is added to the sessions store and some old queries may be removed. For each previous query in the session, a query cooccurrence is formed with the new query and statistics in the query cooccurrence statistics store are also updated.

**Tweet Flow** — As a tweet flows through the firehose, its n-grams are checked to determine whether they are query-like or not. All matching n-grams are processed just like the query above except that the “session” is the tweet itself.

**Decay/Prune cycles** — Periodically, all weights are decayed and queries or co-occurrences with scores below predefined thresholds are removed
to control the overall memory footprint of the service. Even user sessions with no recent activity are pruned.

**Ranking cycles** — Rankers are triggered periodically to generates suggestions for each query based on the various accumulated statistics. Top results are then persisted to HDFS.

#### Scalability

1. Since there is no sharding, each instance of the backend processing engine must consume the entire firehose and query hose to keep up with the upcoming data.

2. The memory footprint for retaining various statistics, without any pruning, is very large. But if the footprint is reduced, by say pruning, the quality and coverage of results may be affected. Another approach could be to store less session history and decay the weights more aggressively though it may again affect the quality of the results.

#### Lessons Learnt
Twitter managed to solve the problem of fast moving big data but their solution is far from ideal. It works well but only for scenario it is fine-tuned for. What is needed is a unified data platform to process for big and fast moving data with varying latency requirements. Twitter’s current implementation is an in-memory engine which mostly uses tweets and search sessions to build the context. Rich parameters like clicks, impressions etc are left out for now to keep the latency under check. Twitter described it as “a patchwork of different processing paradigms”.

Though not-so-complete, it is still an important step in the direction of unifying big data and fast data processing systems. A lot of systems exists which solve the problem is pieces eg message queues like Kafka for moving data in real time and Facebook’s ptail for running Hadoop operations in real time but there is no end-to-end general data platform which can adapt itself to perform analytics on both short and long term data and combine their results as per latency bound in different contexts.

arxiv.org
scholar.google.com

A Case Study in Complexity Estimation: Towards Parallel Branch-and-Bound over Graphical Models
Otten, Lars and Dechter, Rina
arXiv e-Print archive - 2012 via Local Bibsonomy
Keywords: dblp

[link] Summary by Open Review 8 years ago

This paper is about a complexity estimation of using distributed algorithm for branch-and-bound over graphical models. The paper proposes a distributed/parallel Branch-and-Bound algorithm and evaluates its efficiency for load balancing. The ability to search multiple sub problem in parallel for the same problem would speed up the search significantly. However, balancing the parallel search load is important for efficiency. This paper does a complete case study of learning how to parallelize And-Or graph using Branch and Bound search. They use a set of features collected from the graph (static) or from the search problem (dynamic) for the problem and evaluate each of the three learning cases: per problem instances, per problem class, and across problem class using linear and non-linear regression methods.T hey extensively evaluate all the possible combinations and show the pros and cons of using each type of learning. 

Pros: The paper is well written, and covers all the possibilities for learning and has a good discussion that educates the reader. The evaluations are extensive and conclusive. The literature review is complete. 

Cons: From a practicality point of view, the bottle neck problem is not discussed, when would parallelizing AOGBB be bad? Also the motivation for choosing the features that they used, are they standard set of features? (the good point is that they showed that some of the features were more important than other based on their experiments).