![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
#### Goal + Predict 128 diagnoses for intensive pediatric care patients. #### Dataset: + Children's Hospital LA. + Episode is a multivariate time series that describes the stay of one patient in the intensive care unit. Dataset properties | Value ---------|---------- Number of episodes | 10,401 Duration of episodes | From 12h to several months Time series variables | Systolic blood pressure, Diastolic blood pressure, Peripheral capillary refill rate, End tidal CO2, Fraction of inspired O2, Glasgow coma scale, Blood glucose, Heart rate, pH, Respiratory rate, Blood O2 Saturation, Body temperature, Urine output. + Resampling and missing values: + Irregularly sampled time-series that is resampled to an hourly rate. + Mean measurement within each hour window is taken. + Forward- and back-filling are used to fill gaps created by the resampling. + When variable time series is missing entirely: imputation with a clinically *normal* value defined by domain experts. + This paper is followed by [Modeling Missing Data in Clinical Time Series with RNNs](http://www.shortscience.org/paper?bibtexKey=journals/corr/LiptonKW16) from the same research group. + Labels: + Each episode is associated with 0 or more diagnoses. (in-house taxonomy, ICD-9 based). + Dataset contains 429 diagnoses. The paper focuses on the 128 most frequent diagnoses that appear 50 or more times in the dataset. #### Architecture: + LSTM with Target Replication:  + Loss function: + For the model with target replication, output y is generated at every sequence step. The loss function is then a convex combination of the final loss (log-loss in the case of this paper) and the average of the losses over all steps where T is the number of sequence steps and alpha is a hyperparameter.  #### Experiments and Results: **Methodology**: + Split dataset: 80% training, 10% validation, 10% test + LTSM trained for 100 epochs via gradient stochastic gradient (with momentum). + Regularization L2: 1e-6, obtained via validation dataset. + LSTM: 2 hidden layers with 64 cells or 128 cells (and 50% dropout) + Multiple combinations: target replication / auxiliary target variables (trained using the other 301 diagnoses and other clinical information as a target. Inferences are made only for the 128 major diagnoses. + Baselines for comparison: + Logistic Regression - L2 regularized + MLP with 3 hidden layers - ReLU - dropout 50%. + Baselines tested in the raw time-series and in a feature engineering version made by domain experts. *Metrics*: + Micro AUC, Micro F1: calculated by adding the TPs, FPs, TNs and FNs for the entire dataset and for all classes. + Macro AUC, Macro F1: Arithmetic mean of AUCs and F1 scores for each of the classes. + Precision at 10: Fraction of correct diagnoses among the top 10 predictions of the model. + The upper bound for precision at 10 is 0.2281 since in the test set there are on average 2.281 diagnoses per patient. *Results*:  *Results for selected diagnoses*:  #### Discussion: + Auxiliary outputs improve performance at the expense of increased training time. Very unbalanced dataset for some of the remaining 301 labels makes it spend an entire epoch only to learn that one of the target variables can take values other than 0. + Real-Time Predictions: In the future, the authors expect that the proposed solution could be used to make continuously updated real-time alerts and diagnoses. ![]() |
[link]
This paper describes an architecture designed for generating class predictions based on a set of features in situations where you may only have a few examples per class, or, even where you see entirely new classes at test time. Some prior work has approached this problem in ridiculously complex fashion, up to and including training a network to predict the gradient outputs of a meta-network that it thinks would best optimize loss, given a new class. The method of Prototypical Networks prides itself on being much simpler, and more intuitive, so I hope I’ll be able to convey that in this explanation. In order to think about this problem properly, it makes sense to take a few steps back, and think about some fundamental assumptions that underly machine learning. https://i.imgur.com/Q45w0QT.png One very basic one is that you need some notion of similarity between observations in your training set, and potential new observations in your test set, in order to properly generalize. To put it very simplistically, if a test example is very similar to examples of class A that we saw in training, we might predict it to be of class A at testing. But what does it *mean* for two observations to be similar to one another? If you’re using a method like K Nearest Neighbors, you calculate a point’s class identity based on the closest training-set observations to it in Euclidean space, and you assume that nearness in that space corresponds to likelihood of two data points having come the same class. This is useful for the use case of having new classes show up after training, since, well, there isn’t really a training period: the strategy for KNN is just carrying your whole training set around, and, whenever a new test point comes along, calculating it’s closest neighbors among those training-set points. If you see a new class in the wild, all you need to do is add the examples of that class to your group of training set points, and then after a few examples, if your assumptions hold, you’ll be able to predict that class by (hopefully) finding those two or three points as neighbors. But what if some dimensions of your feature space matter much more than others for differentiating between classes? In a simplistic example, you could have twenty features, but, unbeknownst to you, only one is actually useful for separating out your classes, and the other 19 are random. If you use the naive KNN assumption, you wouldn’t expect to perform well here, because you will have distances in these 19 meaningless directions spreading out your points, due to randomness, more than the meaningful dimension spread them out due to belonging to different classes. And what if you want to be able to learn non-linear relationships between your features, which the composability of multi-layer neural networks lends itself well to? In cases like those, the features you were handed may be a woefully suboptimal metric space in which to calculate a kind of similarity that corresponds to differences in class identity, so you’ll just have to strike out for the territories and create a metric space for yourself. That is, at a very high level, what this paper seeks to do: learn a transformation between input features and some vector space, such that distances in that vector space correspond as well as possible to probabilities of belonging to a given output class. You may notice me using “vector space” and “embedding” similarity; they are the same idea: the result of that learned transformation, which represents your input observations as dense vectors in some p-dimensional space, where p is a chosen hyperparameter. What are the concrete learning steps this architecture goes through? 1. During each training episode, sample a subset of classes, and then divide those classes into training examples, and query examples 2. Using a set of weights that are being learned by the network, map the input features of each training example into a vector space. 3. Once all training examples are mapped into the space, calculate a “mean vector” for class A by averaging all of the embeddings of training examples that belong to class A. This is the “prototype” for class A, and once we have it, we can forget the values of the embedded examples that were averaged to create it. This is a nice update on the KNN approach, since the number of parameters we need to carry around to evaluate is only (num-dimensions) * (num-classes), rather than (num-dimensions) * (num-training-examples). 4. Then, for each query example, map it into the embedding space, and use a distance metric in that space to create a softmax over possible classes. (You can just think of a softmax as a network’s predicted probability, it’s a set of floats that add up to 1). 5. Then, you can calculate the (cross-entropy) error between the true output and that softmax prediction vector in the same way as you would for any classification network 6. Add up the prediction loss for all the query examples, and then backpropogate through the network to update your weights The overall effect of this process is to incentivize your network to learn, not necessarily a good prediction function, but a good metric space. The idea is that, if the metric space is good enough, and the classes are conceptually similar to each other (i.e. car vs chair, as opposed to car vs the-meaning-of-life), a space that does well at causing similar observed classes to be close to one another will do the same for classes not seen during training. I admit to not being sufficiently familiar with the datasets used for testing to have a sense for how well this method compares to more fully supervised classification schemes; if anyone does, definitely let me know! But the paper claims to get state of the art results compared to other approaches in this domain of few-shot learning (matching networks, and the aforementioned meta-learning). One interesting note is that the authors found that squared Euclidean distance, when applied within the embedded space, worked meaningfully better than cosine distance (which is a more standard way of measuring distances between vectors, since it measures only angle, rather than magnitude). They suspect that this is because Euclidean distance, but not cosine distance belongs to a category of divergence/distance metrics (called Bregman Divergences) that have a special set of properties such that the point closest on aggregate to all points in a cluster is the average of all those points. If you want to dive way deep into the minutia on this point, I found this blog post quite good: http://mark.reid.name/blog/meet-the-bregman-divergences.html ![]()
1 Comments
|
[link]
The Transformer, a somewhat confusingly-named model structure that uses attention mechanisms to aggregate information for understanding or generating data, has been having a real moment in the last year or so, with GPT-2 being only the most well-publicized tip of that iceberg. It has lots of advantages: the obvious attractions of strong performance, as well as the ability to train in parallel across parts of a sequence, which RNNs can’t do because of the need to build up and maintain state. However, a problematic fact about the Transformer approach is how it scales to large sequences of input data. Because attention is based on performing pairwise queries between each point in the data sequence and each other point, to allow for aggregation of information from places throughout the sequence, it scales as O(N^2), because every new element in the sequence needs to be queried by N other ones. This makes it resource-intensive to run transformer models on large architectures. The Sparse Transformer design proposed in this OpenAI paper tries to cut down on this resource cost by loosening the requirement that, in every attention operation, each element directly pulls information from every other element. In this new system, each point doesn’t get information about each other point in a single operation, but, having two operations such limited operations being chained in a row provides that global visibility. This is done in one of two ways. (1) The first, called the “strided” version, performs two operations in a row, one masked attention that only looks at the last k timesteps (for example, the last 7), and then a second masked attention that only looks at every kth timestep. So, at the end of the second operation, each point has pulled information from points at checkpoints 7, 14, 21 steps ago, and each of these has pulled information from the window between it and its preceding checkpoint, giving visibility into a full global receptive frame in the course of two operations (2) The second, called the “fixed” version, uses a similar sort of logic, but instead of having the “window accumulation points” be defined in reference to the point doing the querying, you instead have fixed accumulation points responsible for gathering information from the windows around them. So, using the example given in the paper, if you imagine a window of size 128, and an “accumulation points per window” of 8, then the points in indices 120-128 (say) would have visibility into points 0-128. That represents the first operation, and in the second one, all other points in the sequence pull in information by querying the designated accumulation points for all the windows that aren’t masked for it. The paper argues that, between these two systems, the Strided system should work best when the data has some inherent periodicity, but I don’t know that I particularly follow that intuition. I have some sense that the important distinction here is that in the strided case you have many points of accumulation, each with not much context, whereas in the fixed case you have very few accumulation points each with a larger window, but I don’t know what performance differences exactly I’d expect these mechanical differences to predict. This whole project of reducing access to global information seems initially a little counterintuitive, since the whole point of a transformer design, in some sense, was its ability to gain global context in a single layer, as opposed to a convnet needing multiple layers to build receptive field, or a RNN needing to maintain state throughout the sequence. However, I think this paper makes the most sense as a way of interpolating the space between something like a CNN and a full attention design, for the sake of efficiency. With a CNN, you have a fixed kernel, and so as your sequence gets longer, you need to add more and more layers in order for any given point to be able to incorporate into its representation information from the complete other side of the sequence. With a RNN, as your sequence gets longers, you pay the cost of needing to backpropogate state farther. So, by contrast, even though the Sparse Transformer seems to be giving up its signal advantage, it’s instead just trading one constant number of steps to achieve global visibility (1), for another (2, in this paper, but conceptually could be more), but still in a way that’s constant with respect to the length of the data. And, in exchange for this trade, they get very sparse, very masked operations, where many of the multiplications involved in these big query calculations can be ignored, making for faster computation speeds. On the datasets tried, the Sparse Transformer increased speed, and in fact in I think all cases increased performance - not by much, the performance gain by itself isn’t that dramatic, but in the context of expecting if anything worse performance as a result of limiting model structure, it’s encouraging and interesting that it either stays about the same or possible improves. ![]() |
[link]
At a high level, this paper is a massive (34 pgs!) and highly-resourced study of many nuanced variations of language pretraining tasks, to see which of those variants produce models that transfer the best to new tasks. As a result, it doesn't lend itself *that* well to being summarized into a central kernel of understanding. So, I'm going to do my best to pull out some high-level insights, and recommend you read the paper in more depth if you're working particularly in language pretraining and want to get the details. The goals here are simple: create a standardized task structure and a big dataset, so that you can use the same architecture across a wide range of objectives and subsequent transfer tasks, and thus actually compare tasks on equal footing. To that end, the authors created a huge dataset by scraping internet text, and filtering it according to a few common sense criteria. This is an important and laudable task, but not one with a ton of conceptual nuance to it. https://i.imgur.com/5z6bN8d.png A more interesting structural choice was to adopt a unified text to text framework for all of the tasks they might want their pretrained model to transfer to. This means that the input to the model is always a sequence of tokens, and so is the output. If the task is translation, the input sequence might be "translate english to german: build a bed" and the the desired output would be that sentence in German. This gets particularly interesting as a change when it comes to tasks where you're predicting relationships of words within sentences, and would typically have a categorical classification loss, which is changed here to predicting the word of the correct class. This restructuring doesn't seem to hurt performance, and has the nice side effect that you can directly use the same model as a transfer starting point for all tasks, without having to add additional layers. Some of the transfer tasks include: translation, sentiment analysis, summarization, grammatical checking of a sentence, and checking the logical relationship between claims. All tested models followed a transformer (i.e. fully attentional) architecture. The authors tested performance along many different axes. A structural variation was the difference between an encoder-decoder architecture and a language model one. https://i.imgur.com/x4AOkLz.png In both cases, you take in text and predict text, but in an encoder-decoder, you have separate models that operate on the input and output, whereas in a language model, it's all seen as part of a single continuous sequence. They also tested variations in what pretraining objective is used. The most common is simple language modeling, where you predict words in a sentence given prior or surrounding ones, but, inspired by the success of BERT, they also tried a number of denoising objectives, where an original sentence was corrupted in some way (by dropping tokens and replacing them with either masks, nothing, or random tokens, dropping individual words vs contiguous spans of words) and then the model had to predict the actual original sentence. https://i.imgur.com/b5Eowl0.png Finally, they performed testing as to the effect of dataset size and number of training steps. Some interesting takeaways: - In almost all tests, the encoder-decoder architecture, where you separately build representations of your input and output text, performs better than a language model structure. This is still generally (though not as consistently) true if you halve the number of parameters in the encoder-decoder, suggesting that there's some structural advantage there beyond just additional parameter count. - A denoising, BERT-style objective works consistently better than a language modeling one. Within the set of different kinds of corruption, none work obviously and consistently better across tasks, though some have a particular advantage at a given task, and some are faster to train with due to different lengths of output text. - Unsurprisingly, more data and bigger models both lead to better performance. Somewhat interestingly, training with less data but the same number of training iterations (such that you see the same data multiple times) seems to be fine up to a point. This potentially gestures at an ability to train over a dataset a higher number of times without being as worried about overfitting. - Also somewhat unsurprisingly, training on a dataset that filters out HTML, random lorem-ipsum web text, and bad words performs meaningfully better than training on one that doesn't ![]() |
[link]
This paper presents a variety of issues related to the evaluation of image generative models. Specifically, they provide evidence that evaluations of generative models based on the popular Parzen windows estimator or based on a visual fidelity (qualitative) measure both present serious flaws. The Parzen windows approach to generative modeling evaluation works by taking a finite set of samples generated from a given model and then using those as the centroids of a Parzen windows Gaussian mixture. The constructed Parzen windows mixture is then used to compute a log-likelihood score on a set of test examples. Some of the key observations made in this paper are: 1. A simple, k-means based approach can obtain better Parzen windows performance than using the original training samples for a given dataset, even though these are samples from the true distribution! 2. Even for the fairly low dimensional space of 6x6 image patches, a Parzen windows estimator would require an extremely large number of samples to come close to the true log-likelihood performance of a model. 3. Visual fidelity is a bad predictor of true log-likelihood performance, as it is possible to Obtain great visual fidelity and arbitrarily low log-likelihood, with a Parzen windows model made of Gaussians with very small variance. Obtain bad visual fidelity and high log-likelihood by taking a model with high log-likelihood and mixing it with a white noise model and putting as much as 99% of the mixing probability on the white noise model (i.e. which would produce bad samples 99% of the time). 4. Measuring overfitting of a model by taking samples from the model and making sure their training set nearest neighbors are different is ineffective, since it is actually trivial to generate samples that are each visually almost identical to a training example, but that yet each have large euclidean distance with their corresponding (visually similar) training example. ![]() |