|
Welcome to ShortScience.org! |
|
|
[link]
This summary builds extensively on my prior summary of SIRENs, so if you haven't read that summary or the underlying paper yet, I'd recommend doing that first! At a high level, the idea of SIRENs is to use a neural network to learn a compressed, continuous representation of an image, where the neural network encodes a mapping from (x, y) to the pixel value at that location, and the image can be reconstructed (or, potentially, expanded in size) by sampling from that function across the full range of the image. To do this effectively, they use sinusoidal activation functions, which let them match not just the output of the neural network f(x, y) to the true image, but also the first and second derivatives of the neural network to the first and second derivatives of the true image, which provides a more robust training signal. NERFs builds on this idea, but instead of trying to learn a continuous representation of an image (mapping from 2D position to 3D RGB), they try to learn a continuous representation of a scene, mapping from position (specified with with three coordinates) and viewing direction (specified with two angles) to the RGB color at a given point in a 3D grid (or "voxel", analogous to "pixel"), as well as the *density* or opacity of that point. Why is this interesting? Because if you have a NERF that has learned a good underlying function of a particular 3D scene, you can theoretically take samples of that scene from arbitrary angles, even angles not seen during training. It essentially functions as a usable 3D model of a scene, but one that, because it's stored in the weights of a neural network, and specified in a continuous function, is far smaller than actually storing all the values of all the voxels in a 3D scene (the authors give an example of 5MB vs 15GB for a NERF vs a full 3D model). To get some intuition for this, consider that if you wanted to store the curve represented by a particular third-degree polynomial function between 0 and 10,000 it would be much more space-efficient to simply store the 3 coefficients of that polynomial, and be able to sample from it at your desired granularity at will, rather than storing many empirically sampled points from along the curve. https://i.imgur.com/0c33YqV.png How is a NERF model learned? - The (x, y, z) position of each point is encoded as a combination of sine-wave, Fourier-style curves of increasingly higher frequency. This is similar to the positional encoding used by transformers. In practical turns, this means a location in space will be represented as a vector calculated as [some point on a low-frequency curve, some point on a slightly higher frequency curve..., some point on the highest-frequency curve]. This doesn't contain any more *information* than the (x, y, z) representation, but it does empirically seem to help training when you separate the frequencies like this - You take a dataset of images for which viewing direction is known, and simulate sending a ray through the scene in that direction, hitting some line (or possibly tube?) of voxels on the way. You calculate the perceived color at that point, which is an integral of the color information and density/opacity returned by your model, for each point. Intuitively, if you have a high opacity weight early on, that part of the object blocks any voxels further in the ray, whereas if the opacity weight is lower, more of the voxels behind will contribute to the overall effective color perceived. You then compare these predicted perceived colors to the actual colors captured by the 2D image, and train on the prediction error. - (One note on sampling: the paper proposes a hierarchical sampling scheme to help with sampling efficiently along the ray, first taking a course sample, and then adding additional samples in regions of high predicted density) - At the end of training, you have a network that hopefully captures the information from *that particular scene*. A notable downside of this approach is that it's quite slow for any use cases that require training on many scenes, since each individual scene network takes about 1-2 days of GPU time to train ![]() |
|
[link]
A never-ending learning problem has 2 components — a set of learning tasks and a set of coupling constraints. A learning task in this paradigm is same as the learning task in any other paradigm — improving the system’s performance, as measured by some metric P, over a task T given some experience E. Each coupling constraint can be thought of as a function, defined over 2 or more learning tasks, which specifies the degree of satisfaction of the constraint. Given such a learning problem, a never-ending learning agent A produces a sequence of solutions to the individual learning tasks such that, over time, the quality of the individual learning functions as well as the degree to which each coupling constraint is satisfied both increases. To take a simplified example, consider the problem of Google classifying emails in our inbox. Let us say we have 2 learning tasks going on — one which learns whether to put a mail in spam or not and another weather to mark a mail important or not. An obvious constraint here would be that any mail which is marked as spam must not be marked important (though not the other way round). Think of them as the set of rules that you do not want to see violated. What makes coupling constraints important and powerful is that the learning agent can improve its learning of one function by successfully learning other functions. The paper also presents the case study of a program called the Never-Ending Language Learner (NELL). NELL implements some of the features of this new paradigm and has been in action 24X7 since January 2010. Every day it extracts beliefs from the web and updates its knowledge base by removing incorrect beliefs and adding the ones which, it believes, are correct. NELL started with an initial ontology of categories and some labelled examples and relations and has got over 80 million interconnected beliefs in its Knowledge Base (KB) so far. It is performing over 2500 learning tasks which include: - Category classification: eg Apple is a food and a company. - Relationship classifications: eg (Apple, Company, “IsA”). - Entity Resolution: e.g “NYC” and “Big Apple” can refer to the same entity. - Inferring Rules among belief triples. For all these tasks, it uses a variety of classification methods. The coupling constraints include: - Multi-view co-training coupling: Different classifiers should give same output on the same input. - Subset/superset coupling: If a category A is a subset of category B then each phrase belonging to A must belong to B. - Multi-label mutual exclusion coupling: When 2 categories are declared to be mutually exclusive, none of the noun phrases should lie in both categories. - Coupling relations to their argument types: In constraints like LivesIn(A, B), A should be a person and B should be a place. - Horn clause coupling. Each reading and inference module (based on above classifications and constraints) sends proposed updates in the KB to a Knowledge Integrator (KI) which makes a final decision on all these updates. Once the updates are made, all the modules are re-trained based on the updated KB. Due to sheer size of the KB, it is not possible to consider each and every candidate belief so the KI would consider only high confidence candidate beliefs and would re-assess confidence using a limited subgraph of consistency constraints and beliefs. This means many iterations are required for the effect of constraints to propagate throughout the graph. The paper mentions a more effective algorithm to achieve this propagation. To add learning tasks and ontology extension, it extracts sentences mentioning different categories, build a context by context co-occurrence matrix and then cluster the related contexts together. Now each cluster corresponds to a candidate relation. These candidates go through a trained classifier and manual filtering before they are added to the ontology. An empirical analysis of the systems performance shows that: - The system is improving its reading competence over time as was expected and desired. - Its KB is increasing but at a decreasing rate as it gets difficult to extract new knowledge from less frequently mentioned beliefs. The paper mentions some places for improvement: - Adding a self-reflection capability to NELL so that it can detect where it is doing well and where it is doing poor and allocate its resources more intelligently. This feature is a part of the paradigm itself. - Broaden the scope of NELL by using other data sources as well eg Never-Ending Image Learner (NEIL) uses image data. - Merge other ontologies like DBPedia to boost to its ontology. - Use ”micro-reading” methods to NELL perform deep semantic analysis The never-ending learning paradigm raises 2 fundamental questions: 1. Under what conditions is an increasingly consistent learning agent also an increasingly correct agent? — This is important because an autonomous agent can perceive consistency but not correctness. 2. Under what conditions is convergence guaranteed in principle and in practice? — The architecture may not have sufficient self-modification operations to ensure never-ending learning or these operations may not be practical due to limits of computation and/or training experience. ### My thoughts What makes never-ending learning different, and in some cases more powerful, from conventional paradigms are the concepts of coupling constraints and never ending learning. As a learning model, it seems closer to the human learning model. We try to learn and take actions by relating different scenarios. Our actions and decisions are constrained by both variables and our other actions and decisions. Constraint coupling seems to capture this requirement. Then there are scenarios where conventional machine learning approaches would fail. Learning is not always about throwing in more data or introducing new variables. If the domain is evolving rapidly, there would always be newer data coming in and newer variables that are not accounted for. These are the kind of scenarios where this paradigm can fill the gap. Another aspect is that all this work builds on top of existing work. All the algorithms used for the various classifications are existing ones. The paradigm does not suggest a new algorithm for any of the individual learning problems. Instead, it provides a mechanism where success in learning one function helps in learning others. The paper has strongly put forward the case for this new paradigm. There is a case study to evaluate the model practically, they start out with a small labeled dataset, the reported metrics are behaving as desired and the web is the best choice for applying a never-ending learning algorithm as the web is the never-ending growing domain. One criticism is that the paper does not mention how resource-intensive NELL is — other than mentioning about the dataset. Even time-based metrics are missing. Not that I expect such a system to be frugal, but I would none-the-less be interested in knowing about their computing infrastructure and time-based metrics. There is still a lot to be explored about the effectiveness of this model. Two prime questions are already listed above. Other than that, the model needs firm mathematical footing. It also needs to be put to test in other domains as well. NEIL is one of the extensions of this. I would be interested to see how does this approach plays out in other domains and what kind of ontologies are obtained especially in case of social networks which are both data-rich and constantly evolving. ![]() |
|
[link]
The paper introduces a new framework called Spark which focuses on use cases where MapReudce fails. While a lot of applications fit the MapReduce’s acyclic data flow model, there are use cases requiring iterative jobs where MapReduce is not greatly efficient. Many machine learning algorithms fall into this category. Secondly with MapReduce, each query incurs a significant overhead because it is effectively a different job each time. So MapReduce is not the ideal solution for doing interactive analysis. This is the space which Spark intends to fill in. Spark is implemented in Scala and now supports API in Java, Python, and R #### Programming Model The model supports RDDs, parallel operations and 2 type of shared variables. A driver program implements the high-level control flow and launches different operations in parallel. Resilient Distributed Datasets (RDDs) are a read-only collection of objects, partitioned across a set of machines in a cluster. A handle to RDD contains enough information to compute the RDD from data in case of partition failure. RDDs can be constructed: 1. From a file in a Hadoop supported file system. 2. By “parallelizing” a Scala collection. 3. By transforming an existing RDD using operations like flatMap, map, and filter. 4. By changing persistence of an existing RDD. flatMap, map and filter are standard operations as supported by various functional programming languages. For example map will take as input a list of items and return a new list of items after applying a function to each item of the original list. RDDs are lazy and ephemeral. They are constructed on demand and discarded after use. The persistence can be changed to cache which means they are still lazy but are kept in memory (or disk if they can not fit memory) or to save where they are saved to disk only. Some supported parallel operations(at time of writing the paper) include: 1. reduce: Combines dataset elements using an associative function to produce a result at the driver program. 2. collect: Sends all elements of the dataset to the driver program. 3. foreach: Passes each element through a user-provided function Since the paper was authored, Spark has come a long way and support much more transformations (sample, union, distinct, groupByKey, reduceByKey, sortByKey, join, cartesian, etc), parallel operations (shuffle, count, first, take, takeSample etc), and persistence options (memory only, memory and disk, disk only, memory only serialized, etc). Spark supports 2 kinds of shared variables: Broadcast variables — These are read-only variables that are distributed to worker nodes for once and can be used multiple times (for reading). A use case of this variable would be training data which can be sent to all the worker nodes once and can be used for learning different models instead of sending the same data with each model. Accumulators — These variables are also shared with workers, the different being that the driver program can only read them and workers can perform only associative operations on them. A use case could be when we want to count the total number of entries in a data set, each worker fills up its count accumulator and sends it to the driver which adds up all the received values. #### Implementation The core of Spark is the implementation of RDDs. Suppose we start by reading a file, then filtering the lines to get lines with the word “ERROR” in them, then we cache the results and then count the number of such lines using the standard map-reduce trick. RDDs will be formed corresponding to each of these steps and these RDDs will be stored as a link-list to capture the lineage of each RDD. https://cdn-images-1.medium.com/max/800/1*6I7aiD2bPrsw32U76Q5q2g.png Lineage chain for distributed dataset objects Each RDD contains a pointer to its parent and information about how it was transformed. This lineage information is sufficient to reconstruct any lost partition and checkpointing of any kind is not required. There is no overhead if no node fails and even if some nodes fails, only select RDDs need to be reconstructed. Internally an RDD object implements a simple interface consisting of three operations: 1. getPartitions, which returns a list of partition IDs. 2. getIterator(partition), which iterates over a partition. 3. getPreferredLocations(partition), which is used for task scheduling to achieve data locality. Spark is similar to MapReduce — it sends computation to data instead of the other way round. This requires shipping closures to workers — closures to define and process a distributed dataset. This is easy given Scala uses Java serialization. However unlike MapReduce, operations are performed on RDDs that can persist across operations. Shared variables are implemented using classes with custom serialization formats. When a broadcast variable b is created with a value v, v is saved to a file in the shared file system. The serialized form of b is a path to this file. When b’s value is queried, Spark checks if v is in the local cache. If not, it is read from the file system. For accumulators, each accumulator is given a unique ID upon creation and its serialized form contains its ID and the “zero” value. On the workers, a separate copy of the accumulator is created for each thread and is reset to “zero” value. Once the task finishes, the updated value is sent to the driver program. #### Future Work The paper describes how an early stage implementation performs on Logistic Regression, Alternating Least Square, and interactive mode. The results seem to outperform MapReduce largely because of caching the results of previous computations. This makes Spark a good alternative for use cases where same data is read into memory again and again (iterative jobs fit the category.) Spark has come a long way since the paper was written. It now supports libraries for handling SQL-like queries (SparkSQL), streaming data (Spark streaming), graphs (GraphX) and machine learning (MLlib) along with more transformations and parallel operations. I came across Spark while working at Adobe Analytics and have been reading about it to learn more. The cool thing about Spark is that it supports interactive analysis and has APIs in Python, R and Java thus making it easy to adopt. While I have not done some much work around Spark, I am looking forward to making something on top of it. ![]() |
|
[link]
It is a nice paper on video captioning. They exploit LSTM ability to learn long term dependencies to modeling the problem of translating video sequence to language sequence. The new thing in this paper is that they have two LSTM layers for modeling frames in videos and also words in sentences. ![]() |
|
[link]
1. U-NET learns segmentation in an end to end images.
2. They solved Challenges are
* Very few annotated images (approx. 30 per application).
* Touching objects of the same class.
# How:
* Input image is fed in to the network, then the data is propagated through the network along all possible path at the end segmentation maps comes out.
* In U-net architecture, each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.
https://i.imgur.com/Usxmv6r.png
* In two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2for down sampling. At each down sampling step they double the number of feature channels.
* Contracting path (left side from up to down) is increases the feature channel and reduces the steps and an expansive path (right side from down to up) consists of sequence of up convolution and concatenation with the corresponds high resolution features from contracting path.
* The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image.
## Challenges:
1. Overlap-tile strategy for seamless segmentation of arbitrary large images:
* To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image.
* In fig, segmentation of the yellow area uses input data of the blue area and the raw data extrapolation by mirroring.
https://i.imgur.com/NUbBRUG.png
2. Augment training data using deformation:
* They use excessive data augmentation by applying elastic deformations to the available training images.
* Then the network to learn invariance to such deformations, without the need to see these transformations in the annotated image corpus.
* Deformation used to be the most common variation in tissue and realistic deformations can be simulated efficiently.
https://i.imgur.com/CyC8Hmd.png
3. Segmentation of touching object of the same class:
* They propose the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function.
* Ensure separation of touching objects, in that segmentation mask for training (inserted background between touching objects) get the loss weights for each pixel.
https://i.imgur.com/ds7psDB.png
4. Segmentation of neural structure in electro-microscopy(EM):
* Ongoing challenge since ISBI 2012 in this dataset structures with low contrast, fuzzy membranes and other cell components.
* The training data is a set of 30 images (512x512 pixels) from serial section transmission electron microscopy of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes with corresponding fully annotated ground truth segmentation map for cells(white) and membranes (black).
* An evaluation can be obtained by sending the predicted membrane probability map to the organizers. The evaluation is done by thresholding the map at 10 different levels and computation of the warping error, the Rand error and the pixel error.
### Results:
* The u-net (averaged over 7 rotated versions of the input data) achieves with-out any further pre or post-processing a warping error of 0.0003529, a rand-error of 0.0382 and a pixel error of 0.0611.
https://i.imgur.com/6BDrByI.png
* ISBI cell tracking challenge 2015, one of the dataset contains cell phase contrast microscopy has strong shape variations,weak outer borders, strong irrelevant inner borders and cytoplasm has same structure like background.
https://i.imgur.com/vDflYEH.png
* The first data set PHC-U373 contains Glioblastoma-astrocytoma U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy- It contains 35 partially annotated training images. Here we achieve an average IOU ("intersection over union") of 92%,which is significantly better than the second best algorithm with 83%.
https://i.imgur.com/of4rAYP.png
* The second data set DIC-HeLa are HeLa cells on a flat glass recorded by differential interference contrast (DIC) microscopy - It contains 20 partially annotated training images. Here we achieve an average IOU of 77.5% which is significantly better than the second best algorithm with 46%.
https://i.imgur.com/Y9wY6Lc.png
![]() |