![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
The goal of one-shot learning tasks is to design a learning structure that can perform a new task (or, more canonically, add a new class to an existing task) using only one a small number of examples of the new task or class. So, as an example: you'd want to be able to take one positive and one negative example of a given task and correctly classify subsequent points as either positive or negative. A common way of achieving this, and the way that the paper builds on, is to learn a parametrized function projecting both your labeled points (your "support set") and your unlabeled point (your "query") into an embedding space, and then assigning a class to your query according to how close it is to the support set points associated with each label. The hope is that, in the course of training on different but similar tasks, you've learned a metric space where nearby things tend to be of similar classes. This method is called a "matching network". This paper has the specific objective of using such one-shot methods for drug discovery, and evaluates on tasks drawn from that domain, but most of the mechanics of the paper can be understood without reference to molecular dat in particular. In the simplest version of such a network, the query and support set points are embedded unconditionally - meaning that the query would be embedded in the same way regardless of the values in the support set, and that each point in the support set would be embedded without knowledge of each other. However, given how little data we're giving our model to work with, it might be valuable to allow our query embedder (f(x)) and support set embedder (g(x)) to depend on the values within the support set. Prior work had achieved this by: 1) Creating initial f'(x) and g'(x) query and support embedders. 2) Concatenating the embedded support points g'(x) into a single vector and running a bidirectional LSTM over the concatenation, which results in a representation g(x) of each input that incorporates information from g'(x_i) for all other x_i (albeit in a way that imposes a concatenation ordering that may not correspond to a meaningful order) 3) Calculating f(x) of your embedding point by using an attention mechanism to combine f'(x) with the contextualized embeddings g(x) The authors of the current paper argue that this approach is suboptimal because of the artificially imposed ordering, and because it calculated g(x) prior to f(x) using asymmetrical model structures (though it's not super clear why this latter point is a problem). Instead, they propose a somewhat elaborate and difficult-to-follow attention based mechanism. As best as I can understand, this is what they're suggesting: https://i.imgur.com/4DLWh8H.png 1) Update the query embedding f(x) by calculating an attention distribution over the vector current embeddings of support set points (here referred to as bolded <r>), pooling downward to a single aggregate embedding vector r, and then using a LSTM that takes in that aggregate vector and the prior update to generate a new update. This update, dz, is added to the existing query embedding estimate to get a new one 2) Update the vector of support set embeddings by iteratively calculating an attention mapping between the vector of current support set embeddings and the original features g'(S), and using that attention mapping to create a new <r>, which, similar to the above, is fed into an LSTM to calculate the next update. Since the model is evaluated on molecular tasks, all of the embedding functions are structured as graph convolutions. Other than the obvious fact that attention is a great way of aggregating information in an order-independent way, the authors give disappointingly little justification of why they would expect their method to work meaningfully better than past approaches. Empirically, they do find that it performs slightly better than prior contextualized matching networks on held out tasks of predicting toxicity and side effects with only a small number from the held out task. However, neither this paper's new method nor previous one-shot learning work is able to perform very well on the challenging MUV dataset, where held-out binding tasks involve structurally dissimilar molecules from those seen during training, suggesting that whatever generalization this method is able to achieve doesn't quite rise to the task of making inferences based on molecules with different structures. ![]() |
[link]
This is a paper released by the creators of the DeepChem library/framework, explaining the efforts they've put into facilitating straightforward and reproducible testing of new methods. They advocate for consistency between tests on three main axes. 1. On the most basic level, that methods evaluate on the same datasets 2. That they use canonical train/test splits 3. That they use canonical metrics. To that end, they've integrated a framework they call "MoleculeNet" into DeepChem, containing standardized interfaces to datasets, metrics, and test sets. **Datasets** MoleculeNet contains 17 different datasets, where "dataset" here just means a collection of data labeled for a certain task or set of tasks. The tasks fall into one of four groups: - quantum mechanical prediction (atomization energy, spectra) - prediction of properties of physical chemistry (solubility, lipophilicity) - prediction of biophysical interactions like bonding affinity - prediction of human-level physiological properties (toxicity, side effects, whether it passes the blood brain barrier) An interesting thing to note here is that only some datasets contain 3D orientations of molecules, because spatial orientations are properties of *a given conformation* of a molecule, and while some output measures (like binding geometry) depend on 3D arrangement, others (like solubility) don't. **Metrics** The metrics chosen were pretty straightforward - Root Mean Squared Error or Absolute Error for continuous prediction tasks, and ROC-AUC or PRC-AUC for prediction ones. The only notable nuance was that the paper argued for PRC-AUC as the standard metric for datasets with a low number of positives, since that metric is the strictest on false positives. **Test/Train Split** Most of these were fairly normal - random split and time-based split - but I found the idea of a scaffold split (where you cluster molecules by similarity, and assign each cluster to either train or test), interesting. The idea here is that if molecules are similar enough to one another, seeing one of a pair during training might be comparable to seeing an actual shared example between training and test, and have the same propensity for overconfident results. **Models** DeepChem has put together implementations of a number of standard machine learning methods (SVM, Random Forest, XGBoost, Logistic Regression) on molecular features, as well as a number of molecule-specific graph-structured methods. At a high level, these are: https://i.imgur.com/x4yutlp.png - Graph Convolutions, which update atom representations by combining transformations of the features of bonded neighbor atoms - DAGs, which create an "atom-centric" graph for each atom in the molecule and "pull" information inwards from farther away nodes (for the record, I don't fully follow how this one works, since I haven't read the underlying paper) - Weave Model, which maintains both atom representations and pair representations between all pairs of atoms, not just ones bonded to one another, and updates each in a cross-cutting way: updating an atom representation from all of its pairs (as well as itself), and then updating a pair representation from the atoms in its pairing (as well as itself). This has the benefit of making information from far-away molecules available immediately, rather than having to propagate through a graph, but is also more computationally taxing - Message Passing Neural Network, which operates like Graph Convolutions except that the feature transform used to pull in information from neighboring atoms changes depending on the type of the bond between atoms - Deep Tensor Neural Network - Instead of bonds, this approach represents atoms in 3D space, and pulls in information based on other atoms nearby in spatial distance **Results** As part of creating its benchmark, MoleculeNet also tested its implementations of its models on all its datasets. It's interesting the extent to which the results form a narrative, in terms of which tasks benefit most from flexible structure-based methods (like graph approaches) vs hand-engineered features. https://i.imgur.com/dCAdJac.png Predictions of quantum mechanical properties and properties of physical chemistry do consistently better with graph-based methods, potentially suggesting that the features we've thought to engineer aren't in line with useful features for those tasks. By contrast, on biophysical tasks, hand-engineered features combined with traditional machine learning mostly comes out on top, a fact I found a bit surprising, given the extent to which I'd read about deep learning methods claiming strong results on prediction of things like binding affinity. This was a useful pointer of things I should do some more work to resolve clarity on. And, when it came to physiological properties like toxicity and side effects, results are pretty mixed between graph-based and traditional methods. ![]() |
[link]
If you read modern (that is, 2018-2020) papers using deep learning on molecular inputs, almost all of them use some variant of graph convolution. So, I decided to go back through the citation chain and read the earliest papers that thought to apply this technique to molecules, to get an idea of lineage of the technique within this domain. This 2015 paper, by Duvenaud et al, is the earliest one I can find. It focuses the entire paper on comparing differentiable, message-passing networks to the state of the art standard at the time, circular fingerprints (more on that in a bit). I really appreciated this approach, which, rather than trying to claim an unrealistic level of novelty, goes into detail on the prior approach, and carves out specific areas of difference. At a high level, the authors' claim is: our model is, in its simplest case, a more flexible and super version of existing work. The unspoken corollary, which ended up being proven true, is that the flexibility of the neural network structure makes it easy to go beyond this initial level of simplicity. Circular Fingerprinting (or, more properly, Extended-Connectivity Circular Fingerprints), is a fascinating algorithm that captures many of the elements of convolution: shared weights, a hierarchy of kernels that match patterns at different scales, and a clever way of aggregating information across an arbitrary number of input nodes. Mechanistically, Circular Fingerprints work by: 1) Taking each atom, and creating a concatenated vector of its basic features, along with the basic features of each atom it's bonded to (with bonded neighbors quasi-randomly) 2) Calculating next-level features by applying some number of hash functions (roughly equivalent to convolutional kernels) to the neighborhood feature vector at the lower level to produce an integer 3) For each feature, setting the value of the fingerprint vector to 1 at the index implied by the integer in step (2) 4) Iterating this process at progressively higher layers, using the hash The effect of this is to assign each index of the vector to an binary feature (modulo hash collisions), where that feature is activated if an exact match is found to a structure within a given atom. Its main downside is that (a) its "kernel" equivalents are fixed and not trainable, since they're just random hashes, and (b) its features represent *exact* matches to lower-level feature patterns, which means you can't have one feature activated to different degrees by variations on a pattern it's identifying. https://i.imgur.com/V8FpfVE.png Duvenaud et al present their alternative in terms of keeping a similar structure, but swapping out fixed and binary components for trainable (because differentiable) and continuous ones. Instead of concatenating a random sorting of atom neighbors to enforce invariance to sorting, they simply sum feature vectors across neighbors, which is also an order-invariantoperation. Instead of applying hash functions, they apply parametrized kernel functions, with the same parameters used across all aggregated neighborhood vectors . This will no longer look for exact matches, but will activate to the extent a structure within an atom matches against a kernel pattern. Then, these features are put into a softmax, which instead setting an index of a vector to a sharp one value, activates different feature indices in the final vector to differing degrees. The final fingerprint is simply the sum of these softmax feature activations for each atom. The authors do a few tests to confirm their substitution is working well, including starting out with a random network (to better approximate the random hash functions), comparing distances between atoms according to either the circular or neural footprint (which had a high correlation), and confirming that that performs similarly to circular fingerprints on a set of supervised learning tasks on modules. When they trained weights to be better than random on three such supervised tasks, they found that their model was comparable or better than circular fingerprints on all three (to break that down: it was basically equivalent on one, and notably better on the other two, according to mean squared error) This really is the simplest possible version of a message-passing or graph convolutional network (it doesn't use edge features, it doesn't calculate features of a neighbor-connection according to the features of each node, etc), but it's really satisfying to see it laid out as a next-step alternative that offered value just by stepping away from exact-match feature dynamics and non-random functions, even without all the sophisticated additions that would later be added to such models. ![]()
2 Comments
|
[link]
This paper was published after the 2015 Duvenaud et al paper proposing a differentiable alternative to circular fingerprints of molecules: substituting out exact-match random hash functions to identify molecular structures with learned convolutional-esque kernels. As far as I can tell, the Duvenaud paper was the first to propose something we might today recognize as graph convolutions on atoms. I hoped this paper would build on that one, but it seems to be coming from a conceptually different direction, and it seems like it was more or less contemporaneous, for all that it was released later. This paper introduces a structure that allows for more explicit message passing along bonds, by calculating atom features as a function of their incoming bonds, and then bond features as a function of their constituent atoms, and iterating this procedure, so information from an atom can be passed into a bond, then, on the next iteration, pulled in by another atom on the other end of that bond, and then pulled into that atom's bonds, and so forth. This has the effect of, similar to a convolutional or recurrent network, creating representations for each atom in the molecular graph that are informed by context elsewhere in the graph, to different degrees depending on distance from that atom. More specifically, it defines: - A function mapping from a prior layer atom representation to a subsequent layer atom representation, taking into account only information from that atom (Atom to Atom) - A function mapping from a prior layer bond representation (indexed by the two atoms on either side of the bond), taking into account only information from that bond at the prior layer (Bond to Bond) - A function creating a bond representation by applying a shared function to the atoms at either end of it, and then combining those representations with an aggregator function (Atoms to Bond) - A function creating an atom representation by applying a shared function all the bonds that atom is a part of, and then combining those results with an aggregator function (Bonds to Atom) At the top of this set of layers, when each atom has had information diffused into it by other parts of the graph, depending on the network depth, the authors aggregate the per-atom representations into histograms (basically, instead of summing or max-pooling feature-wise, creating course distributions of each feature), and use that for supervised tasks. One frustration I had with this paper is that it doesn't do a great job of highlighting its differences with and advantages over prior work; in particular, I think it doesn't do a very good job arguing that its performance is superior to the earlier Duvenaud work. That said, for all that the presentation wasn't ideal, the idea of message-passing is an important one in graph convolutions, and will end up becoming more standard in later works. ![]() |
[link]
My objective in reading this paper was to gain another perspective on, and thus a more well-grounded view of, machine learning scoring functions for docking-based prediction of ligand/protein binding affinity. As quick background context, these models are useful because many therapeutic compounds act by binding to a target protein, and it can be valuable to prioritize doing wet lab testing on compounds that are predicted to have a stronger binding affinity. Docking systems work by predicting the pose in which a compound (or ligand) would bind to a protein, and then scoring prospective poses based on how likely such a pose would be to have high binding affinity. It's important to note that there are two predictive components in such a pipeline, and thus two sources of potential error: the searching over possible binding poses, done by physics-based systems, and scoring of the affinity of a given pose, assuming that were actually the correct one. Therefore, in the second kind of modeling, which this paper focuses on, you take in features *of a particular binding pose*, which includes information like which atoms of the compound are nearby to which atoms of the protein. The actual neural network structure used here was admittedly a bit underwhelming (though, to be fair, many of the ideas it seems to be gesturing at wouldn't be properly formalized until Graph Convolutional Networks came around). I'll describe the network mechanically first, and then offer some commentary on the design choices. https://i.imgur.com/w9wKS10.png 1. For each atom (a) in the compound, a set of neighborhood features are defined. The neighborhood is based on two hyperparameters, one for "how many atoms from the protein should be included," and one for "how many atoms from the compound should be included". In both cases, you start by adding the closest atom from either the compound or protein, and as hyperparameter values of each increase, you add in farther-away atoms. The neighborhood features here are (i) What are the types of the atoms? (ii) What are the partial charges of the atoms? (iii) How far are the atoms from the reference atom? (iiii) What amino acid within the protein do the protein atoms come? 2. All of these features are turned into embeddings. Yes, all of them, even the ones (distance and charge) that are continuous values. Coming from a machine learning perspective, this is... pretty weird as a design choice. The authors straight-up discretize the distance values, and then use those as discrete values for the purpose of looking up embeddings. (So, you'd have one embedding vector for distance (0.25-0.5, and a different one for 0.0-0.25, say). 3. The embeddings are concatenated together into a single "atom neighborhood vector" based on a predetermined ordering of the neighbor atoms and their property vectors. We now have one atom neighborhood vector for each atom in the compound. 4. The authors then do what they call a convolution over the atom neighborhood vectors. But it doesn't act like a normal convolution in the sense of mixing information from nearby regions of atom space. It just is basically a fully connected layer that's applied to atom neighborhood vector separately, but with shared weights, so the same layer is applied to each neighborhood vector. They then do a feature-wise max pool across the layer-transformed version of neighborhood vectors, getting you one vector for the full compound 5. This single vector is then put into a softmax, which predicts whether this ligand (in in this particular pose) will have strong binding with the protein Some thoughts on what's going on here. First, I really don't have a good explanation for why they'd have needed to embed a discretized version of the continuous variables, and since they don't do an ablation test of that design choice, it's hard to know if it mattered. Second, it's interesting to see, in their "convolution" (which I think is more accurately described as a Siamese Network, since it's only convolution-like insofar as there are shared weights), the beginning intuitions of what would become Graph Convolutions. The authors knew that they needed methods to aggregate information from arbitrary numbers of atoms, and also that they need should learn representations that have visibility onto neighborhoods of atoms, rather than single ones, but they do so in an entirely hand-engineered way: manually specifying a fixed neighborhood and pulling in information from all those neighbors equally, in a big concatenated vector. By contrast, when Graph Convolutions come along, they act by defining a "message-passing" function for features to aggregate across graph edges (here: molecular bonds or binaries on being "near enough" to another atom), which similarly allows information to be combined across neighborhoods. And, then, the 'convolution' is basically just a simple aggregation: necessary because there's no canonical ordering of elements within a graph, so you need an order-agnostic aggregation like a sum or max pool. The authors find that their method is able to improve on the hand-designed scoring functions within the docking programs. However, they also find (similar to another paper I read recently) that their model is able to do quite well without even considering structural relationships of the binding pose with the protein, which suggests the dataset (DUD - a dataset of 40 proteins with ~4K correctly binding ligands, and ~35K ligands paired with proteins they don't bind to) and problem given to the model is too easy. It's also hard to tell how I should consider AUCs within this problem - it's one thing to be better than an existing method, but how much value do you get from a given unit of AUC improvement, when it comes to actually meaningfully reducing wetlab time used on testing compounds? I don't know that there's much to take away from this paper in terms of useful techniques, but it is interesting to see the evolution of ideas that would later be more cleanly formalized in other works. ![]() |