|
Welcome to ShortScience.org! |
|
|
[link]
* Deep plain/ordinary networks usually perform better than shallow networks.
* However, when they get too deep their performance on the *training* set decreases. That should never happen and is a shortcoming of current optimizers.
* If the "good" insights of the early layers could be transferred through the network unaltered, while changing/improving the "bad" insights, that effect might disappear.
### What residual architectures are
* Residual architectures use identity functions to transfer results from previous layers unaltered.
* They change these previous results based on results from convolutional layers.
* So while a plain network might do something like `output = convolution(image)`, a residual network will do `output = image + convolution(image)`.
* If the convolution resorts to just doing nothing, that will make the result a lot worse in the plain network, but not alter it at all in the residual network.
* So in the residual network, the convolution can focus fully on learning what positive changes it has to perform, while in the plain network it *first* has to learn the identity function and then what positive changes it can perform.
### How it works
* Residual architectures can be implemented in most frameworks. You only need something like a split layer and an element-wise addition.
* Use one branch with an identity function and one with 2 or more convolutions (1 is also possible, but seems to perform poorly). Merge them with the element-wise addition.
* Rough example block (for a 64x32x32 input):
https://i.imgur.com/NJVb9hj.png
* An example block when you have to change the dimensionality (e.g. here from 64x32x32 to 128x32x32):
https://i.imgur.com/9NXvTjI.png
* The authors seem to prefer using either two 3x3 convolutions or the chain of 1x1 then 3x3 then 1x1. They use the latter one for their very deep networks.
* The authors also tested:
* To use 1x1 convolutions instead of identity functions everywhere. Performed a bit better than using 1x1 only for dimensionality changes. However, also computation and memory demands.
* To use zero-padding for dimensionality changes (no 1x1 convs, just fill the additional dimensions with zeros). Performed only a bit worse than 1x1 convs and a lot better than plain network architectures.
* Pooling can be used as in plain networks. No special architectures are necessary.
* Batch normalization can be used as usually (before nonlinearities).
### Results
* Residual networks seem to perform generally better than similarly sized plain networks.
* They seem to be able to achieve similar results with less computation.
* They enable well-trainable very deep architectures with up to 1000 layers and more.
* The activations of the residual layers are low compared to plain networks. That indicates that the residual networks indeed only learn to make "good" changes and default to "if in doubt, change nothing".

*Examples of basic building blocks (other architectures are possible). The paper doesn't discuss the placement of the ReLU (after add instead of after the layer).*

*Activations of layers (after batch normalization, before nonlinearity) throughout the network for plain and residual nets. Residual networks have on average lower activations.*
-------------------------
### Rough chapter-wise notes
* (1) Introduction
* In classical architectures, adding more layers can cause the network to perform worse on the training set.
* That shouldn't be the case. (E.g. a shallower could be trained and then get a few layers of identity functions on top of it to create a deep network.)
* To combat that problem, they stack residual layers.
* A residual layer is an identity function and can learn to add something on top of that.
* So if `x` is an input image and `f(x)` is a convolution, they do something like `x + f(x)` or even `x + f(f(x))`.
* The classical architecture would be more like `f(f(f(f(x))))`.
* Residual architectures can be easily implemented in existing frameworks using skip connections with identity functions (split + merge).
* Residual architecture outperformed other in ILSVRC 2015 and COCO 2015.
* (3) Deep Residual Learning
* If some layers have to fit a function `H(x)` then they should also be able to fit `H(x) - x` (change between `x` and `H(x)`).
* The latter case might be easier to learn than the former one.
* The basic structure of a residual block is `y = x + F(x, W)`, where `x` is the input image, `y` is the output image (`x + change`) and `F(x, W)` is the residual subnetwork that estimates a good change of `x` (W are the subnetwork's weights).
* `x` and `F(x, W)` are added using element-wise addition.
* `x` and the output of `F(x, W)` must be have equal dimensions (channels, height, width).
* If different dimensions are required (mainly change in number of channels) a linear projection `V` is applied to `x`: `y = F(x, W) + Vx`. They use a 1x1 convolution for `V` (without nonlinearity?).
* `F(x, W)` subnetworks can contain any number of layer. They suggest 2+ convolutions. Using only 1 layer seems to be useless.
* They run some tests on a network with 34 layers and compare to a 34 layer network without residual blocks and with VGG (19 layers).
* They say that their architecture requires only 18% of the FLOPs of VGG. (Though a lot of that probably comes from VGG's 2x4096 fully connected layers? They don't use any fully connected layers, only convolutions.)
* A critical part is the change in dimensionality (e.g. from 64 kernels to 128). They test (A) adding the new dimensions empty (padding), (B) using the mentioned linear projection with 1x1 convolutions and (C) using the same linear projection, but on all residual blocks (not only for dimensionality changes).
* (A) doesn't add parameters, (B) does (i.e. breaks the pattern of using identity functions).
* They use batch normalization before each nonlinearity.
* Optimizer is SGD.
* They don't use dropout.
* (4) Experiments
* When testing on ImageNet an 18 layer plain (i.e. not residual) network has lower training set error than a deep 34 layer plain network.
* They argue that this effect does probably not come from vanishing gradients, because they (a) checked the gradient norms and they looked healthy and (b) use batch normaliaztion.
* They guess that deep plain networks might have exponentially low convergence rates.
* For the residual architectures its the other way round. Stacking more layers improves the results.
* The residual networks also perform better (in error %) than plain networks with the same number of parameters and layers. (Both for training and validation set.)
* Regarding the previously mentioned handling of dimensionality changes:
* (A) Pad new dimensions: Performs worst. (Still far better than plain network though.)
* (B) Linear projections for dimensionality changes: Performs better than A.
* (C) Linear projections for all residual blocks: Performs better than B. (Authors think that's due to introducing new parameters.)
* They also test on very deep residual networks with 50 to 152 layers.
* For these deep networks their residual block has the form `1x1 conv -> 3x3 conv -> 1x1 conv` (i.e. dimensionality reduction, convolution, dimensionality increase).
* These deeper networks perform significantly better.
* In further tests on CIFAR-10 they can observe that the activations of the convolutions in residual networks are lower than in plain networks.
* So the residual networks default to doing nothing and only change (activate) when something needs to be changed.
* They test a network with 1202 layers. It is still easily optimizable, but overfits the training set.
* They also test on COCO and get significantly better results than a Faster-R-CNN+VGG implementation.
![]() |
|
[link]
TLDR; The authors train a word-level NMT where UNK tokens in both source and target sentence are replaced by character-level RNNs that produce word representations. The authors can thus train a fast word-based system that still generalized that doesn't produce unknown words. The best system achieves a new state of the art BLEU score of 19.9 in WMT'15 English to Czech translation. #### Key Points - Source Sentence: Final hidden state of character-RNN is used as word representation. - Source Sentence: Character RNNs always initialized with 0 state to allow efficient pre-training - Target: Produce word-level sentence including UNK first and then run the char-RNNs - Target: Two ways to initialize char-RNN: With same hidden state as word-RNN (same-path), or with its own representation (separate-path) - Authors find that attention mechanism is critical for pure character-based NMT models #### Notes - Given that the authors demonstrate the potential of character-based models, is the hybrid approach the right direction? If we had more compute power, would pure character-based models win? ![]() |
|
[link]
Amazon’s platform is built upon different techniques working together to provide a single powerful, highly-available system. One of the core components powering this system is Dynamo. There are many services in the Amazon ecosystem which store and retrieve data by primary key. Take the example of the shopping cart service. Customers should be able to view and update their cart anytime. For these services, sophisticated solutions like RDBMS, GFS, etc are an overkill as these services do not need a complex query and data management system. Instead, they need a service which only supports read and write operations on a (key, value) store where the value is a small object (less than 1Mb in size) uniquely identified by the key. The service should be scalable and highly available with a well-defined consistency window. This is what Dynamo is: a scalable, distribute, highly available key-value store that provides an “always-on” experience. #### Design Considerations Dynamo achieves high availability at the cost of weaker consistency. Changes propagate to the replicas in the background and conflict resolution is done at read time to make sure none of the write operations can fail. Dynamo uses simple policies like “last write win” for conflict resolution though applications using Dynamo may override these techniques with their own methods. Eg the cart application may choose to add items across all versions to make sure none of the items is lost. A service could depend on multiple services to get its results. To guarantee that a service returns its results in a bounded time, each dependency in the service has to return its results with even tighter bounds. As a result clients enter into a contract with servers regarding service-related characteristics like expected request distribution rate, expected latency and so on. Such an agreement is called Service Level Agreement(SLA) and must be met to ensure efficiency. SLA apply in the context of Dynamo as well. Dynamo supports incremental scaling where the system is able to scale out one node at a time. Moreover, all the nodes are symmetrical in the sense they have the same set of responsibilities. Since Dynamo is used only by Amazon’s internal applications, there are no security related requirements like authentication and authorization . #### Architecture Dynamo exposes two operations: get() and put(). get(key) returns value or list of values along with context objects corresponding to the key. put(key, context, value) stores the value and the context corresponding to the key. context objects are used for conflict resolution. To support incremental scaling, Dynamo uses consistent hashing for its partitioning scheme. In consistent hashing, the output range of a hash function is treated as a fixed circular space. Each node and data object is assigned a random value or position within this space. A data object is mapped to the first node which is placed clockwise to the position of the data object. Every data item is replicated at N hosts. So every time a data item is assigned to a node, it is replicated to N-1 clockwise successor nodes as well. The list of nodes storing a data item is called its preference list. Generally preference list contains more than N nodes to account for system and network failures. An example case is shown with N = 3. Any key between A and B would be mapped to B (by consistent hashing logic) and to C and D (by replication logic). https://cdn-images-1.medium.com/max/800/1*66VMYcQfvG3Z2acQD7aeYQ.png Each time data is created/updated, a new version of data is created. So for a given key, several versions of data (or value) can exist. For versioning, Dynamo uses vector clocks. A vector clock is a list of (node, counter) pairs. When a put operation reaches node X, the node uses the context from the put request to know which version it is updating. If there is an entry corresponding to X in vector clock, the counter is incremented else a new entry is created for node X with counter = 1. When retrieving value corresponding to a key, the node will resolve conflicts among all versions based on Dynamo’s logic or client’s logic. A likely issue with this approach is that the vector clock list may grow very large. To mitigate this, Amazon keeps evicting pairs from the list in ascending order of the time when the entry was created till the size reaches below a threshold. Amazon has not faced any issues related to loss of accuracy with this approach. They also observed that the % of data with at least 2 versions is about 0.06% Dynamo uses a quorum system to maintain consistency. For a read (or write) operation to be successful R (or W) number of replicas out of N replicas must participate in the operation successfully with the condition that R+W > N. If some of the first N replicas are not available, say due to network failure, the read and write operations are performed on the first N healthy nodes. eg if node A is down then node B can be included in its place for the quorum. In this case, B would keep track of data it received on behalf of A and when A comes online, B would hand over this data to A. This way a sloppy quorum is achieved. It is possible that B itself becomes unavailable before it can return the data to A. In this case, anti-entropy protocols are used to keep replicas synchronized. In Dynamo, each node maintains a Merkle tree for each key range it hosts. Nodes A and B exchange the roots of Merkle trees corresponding to set of keys they both host. Merkle tree is a hash tree whose leaves are hash values of individual keys and parents are hash values of children. This allows branches to be checked for replication without having to traverse the entire tree. A branch is traversed only when the hash values at the top of the branch differ. This way the amount of data to be transferred for synchronization is minimized. The nodes in a cluster communicate as per a gossip-based protocol in which each node contacts a random peer and then the two nodes reconcile their persisted membership history. This ensures an eventually consistent membership view. Apart from this, some nodes are marked as seed nodes which are known to all nodes including the ones that join later. Seed nodes ensure that logical partitions are not created within the network even when new nodes are added. Since consistent hashing is used, the overhead of key reallocation when adding a new node is quite low. #### Routing There are 2 modes of routing requests in Dynamo. In the first mode, servers route the request. The node fulfilling the request is called coordinator. If it is a read request, any node can act as the coordinator. For a write request, the coordinator is one of the nodes from the key’s current preference list. So if the write request reaches a node which is not in the preference list, it routes the request to one of the nodes in the preference list. An alternate approach would be where the client downloads the current membership state from any Dynamo node and determine which node to send the write request to. This approach saves an extra hop within the server cluster but it assumes the membership state to be fresh. #### Optimizations Apart from the architecture described above, Dynamo uses optimizations like read-repair where, during quorum, if a node returns a stale response for a read query, it is updated with the latest version of data. Similarly, since writes follow reads, the coordinator for read operation is the node that replies fastest to the previous read operation. This increases the chances of having read you write consistency. To further reduce the latency, each node maintains an object buffer in its main memory where write operations are stored and written to disk by a separate thread. The read operations also first refer the in-memory buffer before checking the disks. There is an added risk of the node crashing before writing the objects from the buffer to the disk. To mitigate this, one of the N replicas performs a durable write — that is, the data is written to the disk. Since the quorum requires only W responses, latency due to one node does not affect the performance. Amazon also experimented with different partitioning schemes to ensure uniform load distribution and adopted the scheme where hash space is divided into Q equally sized partitions and placement of partition is decoupled from the partitioning scheme. #### Lessons Learnt Although Dynamo is primarily designed as a write intensive data store, N, R and W provides ample control to modify its behavior for other scenarios as well. For example, setting R = 1 and W = N makes it a high performance read engine. Services maintaining product catalog and promotional items can use this mode. Similarly setting W = 1 means a write request is never rejected as long as at least one server is up though this increases the risk of inconsistency. Given that Dynamo allows the clients to override the conflict resolution methods, it becomes a general solution for many more scenarios than it was originally intended for. One limitation is the small size of data for which it is designed. The choice makes sense in the context of Amazon but it would be interesting to see how storing larger values affects its performance. The response time would obviously increase as more data needs to be transferred and in-memory buffers would be able to store lesser data. But using caching and larger in memory buffers, the response time may be brought down to the limit that Dynamo can be used with somewhat larger data objects as well. Dynamo scales well for a few hundred of nodes but it will not scale equally well for tens of thousands of nodes because of the large overhead of maintaining and distributing the routing table whose size increases with the number of nodes. Another problem that Amazon did not have to face was a high conflict resolution rate. They observed that around 99.94% requests saw exactly one version. Had this number been higher, the latency would have been more. All in all, Dynamo is not a universal solution for a distributed key-value store. But it solves one problem and it solves it very well. ![]() |
|
[link]
This paper shows how to train a character level RNN to generate text using only the GAN objective (reinforcement learning and the maximum-likelihood objective are not used). The baseline WGAN is made up of: * A recurrent **generator** that first embeds the previously omitted token, inputs this into a GRU, which outputs a state that is then transformed into a distribution over the character vocabulary (which represents the model's belief about the next output token). * A recurrent **discriminator** that embeds each input token and then feeds them into a GRU. A linear transformation is used on the final hidden state in order to give a "score" to the input (a correctly-trained discriminator should give a high score to real sequences of text and a low score to fake ones). The paper shows that if you try to train this baseline model to generate sequences of length 32 it just wont work (only gibberish is generated). In order to get the model to work, the baseline model is augmented in three different ways: 1. **Curriculum Learning**: At first the generator has to generate sequences of length 1 and the discriminator only trains on real and generated sequences of length 1. After a while, the models moves on to sequences of length 2, and then 3, and so on, until we reach length 32. 2. **Teacher Helping**: In GANs the problem is usually that the generator is too weak. In order to help it, this paper proposes a method in which at stage $i$ in the curriculum, when the generator should generate sequences of length $i$, we feed it a real sequence of length $i-1$ and ask it to just generate 1 character more. 3. **Variable Lengths**: In each stage $i$ in the curriculum learning process, we generate and discriminate sequences of length $k$, for each $ 1 \leq k \leq i$ in each batch (instead of just generating and discriminating sequences of length exactly $i$). [[code]](https://github.com/amirbar/rnn.wgan) ![]() |
|
[link]
* Stochastic Depth (SD) is a method for residual networks, which randomly removes/deactivates residual blocks during training.
* As such, it is similar to dropout.
* While dropout removes neurons, SD removes blocks (roughly the layers of a residual network).
* One can argue that dropout randomly changes the width of layers, while SD randomly changes the depth of the network.
* One can argue that using dropout is similar to training an ensemble of networks with different layer widths, while using SD is similar to training an ensemble of networks with different depths.
* Using SD has the following advantages:
* It decreases the effects of vanishing gradients, because on average the network is shallower during training (per batch), thereby increasing the gradient that reaches the early blocks.
* It increases training speed, because on average less convolutions have to be applied (due to blocks being removed).
* It has a regularizing effect, because blocks cannot easily co-adapt any more. (Similar to dropout avoiding co-adaption of neurons.)
* If using an increasing removal probability for later blocks: It spends more training time on the early (and thus most important) blocks than on the later blocks.
### How
* Normal formula for a residual block (test and train):
* `output = ReLU(f(input) + identity(input))`
* `f(x)` are usually one or two convolutions.
* Formula with SD (during training):
* `output = ReLU(b * f(input) + identity(input))`
* `b` is either exactly `1` (block survived, i.e. is not removed) or exactly `0` (block was removed).
* `b` is sampled from a bernoulli random variable that has the hyperparameter `p`.
* `p` is the survival probability of a block (i.e. chance to *not* be removed). (Note that this is the opposite of dropout, where higher values lead to more removal.)
* Formula with SD (during test):
* `output = ReLU(p * f(input) + input)`
* `p` is the average probability with which this residual block survives during training, i.e. the hyperparameter for the bernoulli variable.
* The test formula has to be changed, because the network will adapt during training to blocks being missing. Activating them all at the same time can lead to overly strong signals. This is similar to dropout, where weights also have to be changed during test.
* There are two simple schemas to set `p` per layer:
* Uniform schema: Every block gets the same `p` hyperparameter, i.e. the last block has the same chance of survival as the first block.
* Linear decay schema: Survival probability is higher for early layers and decreases towards the end.
* The formula is `p = 1 - (l/L)(1-q)`.
* `l`: Number of the block for which to set `p`.
* `L`: Total number of blocks.
* `q`: Desired survival probability of the last block (0.5 is a good value).
* For linear decay with `q=0.5` and `L` blocks, on average `(3/4)L` blocks will be trained per minibatch.
* For linear decay with `q=0.5` the average speedup will be about `1/4` (25%). If using `q=0.2` the speedup will be ~40%.
### Results
* 152 layer networks with SD outperform identical networks without SD on CIFAR-10, CIFAR-100 and SVHN.
* The improvement in test error is quite significant.
* SD seems to have a regularizing effect. Networks with SD are not overfitting where networks without SD already are.
* Even networks with >1000 layers are well trainable with SD.
* The gradients that reach the early blocks of the networks are consistently significantly higher with SD than without SD (i.e. less vanishing gradient).
* The linear decay schema consistently outperforms the uniform schema (in test error). The best value seems to be `q=0.5`, though values between 0.4 and 0.8 all seem to be good. For the uniform schema only 0.8 seems to be good.

*Performance on SVHN with 152 layer networks with SD (blue, bottom) and without SD (red, top).*

*Performance on CIFAR-10 with 1202 layer networks with SD (blue, bottom) and without SD (red, top).*

*Optimal choice of the survival probability `p_L` (in this summary `q`) for the last layer, for the uniform schema (same for all other layers) and the linear decay schema (decreasing towards `p_L`). Linear decay performs consistently better and allows for lower `p_L` values, leading to more speedup.*
--------------------
### Rough chapter-wise notes
* (1) Introduction
* Problems of deep networks:
* Vanishing Gradients: During backpropagation, gradients approach zero due to being repeatedly multiplied with small weights. Possible counter-measures: Careful initialization of weights, "hidden layer supervision" (?), batch normalization.
* Diminishing feature reuse: Aequivalent problem to vanishing gradients during forward propagation. Results of early layers are repeatedly multiplied with later layer's (randomly initialized) weights. The total result then becomes meaningless noise and doesn't have a clear/strong gradient to fix it.
* Long training time: The time of each forward-backward increases linearly with layer depth. Current 152-layer networks can take weeks to train on ImageNet.
* I.e.: Shallow networks can be trained effectively and fast, but deep networks would be much more expressive.
* During testing we want deep networks, during training we want shallow networks.
* They randomly "drop out" (i.e. remove) complete layers during training (per minibatch), resulting in shallow networks.
* Result: Lower training time *and* lower test error.
* While dropout randomly removes width from the network, stochastic depth randomly removes depth from the networks.
* While dropout can be thought of as training an ensemble of networks with different depth, stochastic depth can be thought of as training an ensemble of networks with different depth.
* Stochastic depth acts as a regularizer, similar to dropout and batch normalization. It allows deeper networks without overfitting (because 1000 layers clearly wasn't enough!).
* (2) Background
* Some previous methods to train deep networks: Greedy layer-wise training, careful initializations, batch normalization, highway connections, residual connections.
* <Standard explanation of residual networks>
* <Standard explanation of dropout>
* Dropout loses effectiveness when combined with batch normalization. Seems to have basically no benefit any more for deep residual networks with batch normalization.
* (3) Deep Networks with Stochastic Depth
* They randomly skip entire layers during training.
* To do that, they use residual connections. They select random layers and use only the identity function for these layers (instead of the full residual block of identity + convolutions + add).
* ResNet architecture: They use standard residual connections. ReLU activations, 2 convolutional layers (conv->BN->ReLU->conv->BN->add->ReLU). They use <= 64 filters per conv layer.
* While the standard formula for residual connections is `output = ReLU(f(input) + identity(input))`, their formula is `output = ReLU(b * f(input) + identity(input))` with `b` being either 0 (inactive/removed layer) or 1 (active layer), i.e. is a sample of a bernoulli random variable.
* The probabilities of the bernoulli random variables are now hyperparameters, similar to dropout.
* Note that the probability here means the probability of *survival*, i.e. high value = more survivors.
* The probabilities could be set uniformly, e.g. to 0.5 for each variable/layer.
* They can also be set with a linear decay, so that the first layer has a very high probability of survival, while the last layer has a very low probability of survival.
* Linear decay formula: `p = 1 - (l/L)(1-q)` where `l` is the current layer's number, `L` is the total number of layers, `p` is the survival probability of layer `l` and `q` is the desired survival probability of the last layer (e.g. 0.5).
* They argue that linear decay is better, as the early layer extract low level features and are therefor more important.
* The expected number of surviving layers is simply the sum of the probabilities.
* For linear decay with `q=0.5` and `L=54` (i.e. 54 residual blocks = 110 total layers) the expected number of surviving blocks is roughly `(3/4)L = (3/4)54 = 40`, i.e. on average 14 residual blocks will be removed per training batch.
* With linear decay and `q=0.5` the expected speedup of training is about 25%. `q=0.2` leads to about 40% speedup (while in one test still achieving the test error of the same network without stochastic depth).
* Depending on the `q` setting, they observe significantly lower test errors. They argue that stochastic depth has a regularizing effect (training an ensemble of many networks with different depths).
* Similar to dropout, the forward pass rule during testing must be slightly changed, because the network was trained on missing values. The residual formular during test time becomes `output = ReLU(p * f(input) + input)` where `p` is the average probability with which this residual block survives during training.
* (4) Results
* Their model architecture:
* Three chains of 18 residual blocks each, so 3*18 blocks per model.
* Number of filters per conv. layer: 16 (first chain), 32 (second chain), 64 (third chain)
* Between each block they use average pooling. Then they zero-pad the new dimensions (e.g. from 16 to 32 at the end of the first chain).
* CIFAR-10:
* Trained with SGD (momentum=0.9, dampening=0, lr=0.1 after 1st epoch, 0.01 after epoch 250, 0.001 after epoch 375).
* Weight decay/L2 of 1e-4.
* Batch size 128.
* Augmentation: Horizontal flipping, crops (4px offset).
* They achieve 5.23% error (compared to 6.41% in the original paper about residual networks).
* CIFAR-100:
* Same settings as before.
* 24.58% error with stochastic depth, 27.22% without.
* SVHN:
* The use both the hard and easy sub-datasets of images.
* They preprocess to zero-mean, unit-variance.
* Batch size 128.
* Learning rate is 0.1 (start), 0.01 (after epoch 30), 0.001 (after epoch 35).
* 1.75% error with stochastic depth, 2.01% error without.
* Network without stochastic depth starts to overfit towards the end.
* Stochastic depth with linear decay and `q=0.5` gives ~25% speedup.
* 1202-layer CIFAR-10:
* They trained a 1202-layer deep network on CIFAR-10 (previous tests: 152 layers).
* Without stochastic depth: 6.72% test error.
* With stochastic depth: 4.91% test error.
* (5) Analytic experiments
* Vanishing Gradient:
* They analyzed the gradient that reaches the first layer.
* The gradient with stochastic depth is consistently higher (throughout the epochs) than without stochastic depth.
* The difference is very significant after decreasing the learning rate.
* Hyper-parameter sensitivity:
* They evaluated with test error for different choices of the survival probability `q`.
* Linear decay schema: Values between 0.4 and 0.8 perform best. 0.5 is suggested (nearly best value, good spedup). Even 0.2 improves the test error (compared to no stochastic depth).
* Uniform schema: 0.8 performs best, other values mostly significantly worse.
* Linear decay performs consistently better than the uniform schema.
![]() |