|
Welcome to ShortScience.org! |
|
|
[link]
Narodytska and Kasiviswanathan propose a local search-based black.box adversarial attack against deep networks. In particular, they address the problem of k-misclassification defined as follows: Definition (k-msiclassification). A neural network k-misclassifies an image if the true label is not among the k likeliest labels. To this end, they propose a local search algorithm which, in each round, randomly perturbs individual pixels in a local search area around the last perturbation. If a perturbed image satisfies the k-misclassificaiton condition, it is returned as adversarial perturbation. While the approach is very simple, it is applicable to black-box models where gradients and or internal representations are not accessible but only the final score/probability is available. Still the approach seems to be quite inefficient, taking up to one or more seconds to generate an adversarial example. Unfortunately, the authors do not discuss qualitative results and do not give examples of multiple adversarial examples (except for the four in Figure 1). https://i.imgur.com/RAjYlaQ.png Figure 1: Examples of adversarial attacks. Top: original image, bottom: perturbed image. ![]() |
|
[link]
### Contribution The author conducts five experiments on EC2 to assess the impact of software-defined virtual networking with HTTP on composite container applications. Compared to previous container performance studies, it contributes new insight into the overlay networking aspect specifically for VM-hosted containers. Evidently, the SDVN causes a major performance loss whereas the container itself as well as the encryption cause minor (but still not negligible) losses. The results indicate that further practical work on container networking tools and stacks is needed for performance-critical distributed applications. ### Strong points The methodology of measuring the performance and using a baseline performance result is appropriate. The author provides the benchmark tooling (ppbench) and reference results (in dockerised form) to enable recomputable research. ### Weak points The title mentions microservices and the abstract promises design recommendations for microservice architectures. Yet, the paper only discusses containers which are a potential implementation technology but neither necessary for nor guaranteed to be microservices. Reducing the paper scope to just containers would be fair. The introduction contains an unnecessary redundant mention of Kubernetes, CoreOS, Mesos and reference [9] around the column wrap. The notation of SDN vs. SDVN is inconsistent between text and images; due to SDN being a wide area of research, the consistent use of SDVN is recommended. Fig. 3b is not clearly labelled. Resulting transfer losses - 100% means no loss, this is confusing. The y axis should presumably be inverted so that losses show highest for SDN with about 70%. The performance breakdown around 300kB messages in Fig. 2 is not sufficiently explained. Is it a repeating phenomenon which might be related to packet scheduling? The "just Docker" networking configuration is not explained, does it run in host or bridge mode? Which version of Docker was used? The size and time distribution of the 6 million HTTP requests should also be explained in greater detail to see how much randomness was involved. ### Further comments The work assumes that containers are always hosted in virtual machines while bare metal container hosting in the form of CaaS becomes increasingly available (Triton, CoreOS OnMetal, etc.). The results by Felter et al. are mentioned but not put into perspective. A comparison of how the networking is affected by VM/BM hosting would be a welcome addition, although AWS would probably not be a likely environment due to ECS running atop EC2. ![]() |
|
[link]
**Idea:** With the growing use of visual explanation systems of machine learning models such as saliency maps, there needs to be a standardized method of verifying if a saliency method is correctly describing the underlying ML model. **Solution:** In this paper two Sanity Checks have been proposed to verify the accuracy and the faithfulness of a saliency method: * *Model parameter randomization test:* In this sanity check the outputs of a saliency method on a trained model is compared to that of the same method on an untrained randomly parameterized model. If these images are similar/identical then this saliency method does not correctly describe the model. In the course of this experiment it is found that certain methods such as the Guided BackProp are constant in their explanations despite alterations in the model. * *Data Randomization Test:* This method explores the relationship of saliency methods to data and their associated labels. In this test, the labels of the training data are randomized thus there should be no definite pattern describing the model (Since the model is as good as randomly guessing an output label). If there is a definite pattern, this shows that the saliency methods are independent of the underlying model/training data labels. In this test as well Guided BackProp did not fare well, implying this saliency method is as good as an edge detector as opposed to a ML explainer. Thus this paper makes a valid argument toward having standardized tests that an interpretation model must satisfy to be deemed accurate or faithful. ![]() |
|
[link]
* The original R-CNN had three major disadvantages:
1. Two-staged training pipeline: Instead of only training a CNN, one had to train first a CNN and then multiple SVMs.
2. Expensive training: Training was slow and required lots of disk space (feature vectors needed to be written to disk for all region proposals (2000 per image) before training the SVMs).
3. Slow test: Each region proposal had to be handled independently.
* Fast R-CNN ist an improved version of R-CNN and tackles the mentioned problems.
* It no longer uses SVMs, only CNNs (single-stage).
* It does one single feature extraction per image instead of per region, making it much faster (9x faster at training, 213x faster at test).
* It is more accurate than R-CNN.
### How
* The basic architecture, training and testing methods are mostly copied from R-CNN.
* For each image at test time they do:
* They generate region proposals via selective search.
* They feed the image once through the convolutional layers of a pre-trained network, usually VGG16.
* For each region proposal they extract the respective region from the features generated by the network.
* The regions can have different sizes, but the following steps need fixed size vectors. So each region is downscaled via max-pooling so that it has a size of 7x7 (so apparently they ignore regions of sizes below 7x7...?).
* This is called Region of Interest Pooling (RoI-Pooling).
* During the backwards pass, partial derivatives can be transferred to the maximum value (as usually in max pooling). That derivative values are summed up over different regions (in the same image).
* They reshape the 7x7 regions to vectors of length `F*7*7`, where `F` was the number of filters in the last convolutional layer.
* They feed these vectors through another network which predicts:
1. The class of the region (including background class).
2. Top left x-coordinate, top left y-coordinate, log height and log width of the bounding box (i.e. it fine-tunes the region proposal's bounding box). These values are predicted once for every class (so `K*4` values).
* Architecture as image:
* 
* Sampling for training
* Efficiency
* If batch size is `B` it is inefficient to sample regions proposals from `B` images as each image will require a full forward pass through the base network (e.g. VGG16).
* It is much more efficient to use few images to share most of the computation between region proposals.
* They use two images per batch (each 64 region proposals) during training.
* This technique introduces correlations between examples in batches, but they did not observe any problems from that.
* They call this technique "hierarchical sampling" (first images, then region proposals).
* IoUs
* Positive examples for specific classes during training are region proposals that have an IoU with ground truth bounding boxes of `>=0.5`.
* Examples for background region proposals during training have IoUs with any ground truth box in the interval `(0.1, 0.5]`.
* Not picking IoUs below 0.1 is similar to hard negative mining.
* They use 25% positive examples, 75% negative/background examples per batch.
* They apply horizontal flipping as data augmentation, nothing else.
* Outputs
* For their class predictions the use a simple softmax with negative log likelihood.
* For their bounding box regression they use a smooth L1 loss (similar to mean absolute error, but switches to mean squared error for very low values).
* Smooth L1 loss is less sensitive to outliers and less likely to suffer from exploding gradients.
* The smooth L1 loss is only active for positive examples (not background examples). (Not active means that it is zero.)
* Training schedule
* The use SGD.
* They train 30k batches with learning rate 0.001, then 0.0001 for another 10k batches. (On Pascal VOC, they use more batches on larger datasets.)
* They use twice the learning rate for the biases.
* They use momentum of 0.9.
* They use parameter decay of 0.0005.
* Truncated SVD
* The final network for class prediction and bounding box regression has to be applied to every region proposal.
* It contains one large fully connected hidden layer and one fully connected output layer (`K+1` classes plus `K*4` regression values).
* For 2000 proposals that becomes slow.
* So they compress the layers after training to less weights via truncated SVD.
* A weights matrix is approximated via 
* U (`u x t`) are the first `t` left-singular vectors of W.
* Sigma is a `t x t` diagonal matrix of the top `t` singular values.
* V (`v x t`) are the first `t` right-singular vectors of W.
* W is then replaced by two layers: One contains `Sigma V^T` as weights (no biases), the other contains `U` as weights (with original biases).
* Parameter count goes down to `t(u+v)` from `uv`.
### Results
* They try three base models:
* AlexNet (Small, S)
* VGG-CNN-M-1024 (Medium, M)
* VGG16 (Large, L)
* On VGG16 and Pascal VOC 2007, compared to original R-CNN:
* Training time down to 9.5h from 84h (8.8x faster).
* Test rate *with SVD* (1024 singular values) improves from 47 seconds per image to 0.22 seconds per image (213x faster).
* Test rate *without SVD* improves similarly to 0.32 seconds per image.
* mAP improves from 66.0% to 66.6% (66.9% without SVD).
* Per class accuracy results:
* Fast_R-CNN__pvoc2012.jpg
* 
* Fixing the weights of VGG16's convolutional layers and only fine-tuning the fully connected layers (those are applied to each region proposal), decreases the accuracy to 61.4%.
* This decrease in accuracy is most significant for the later convolutional layers, but marginal for the first layers.
* Therefor they only train the convolutional layers starting with `conv3_1` (9 out of 13 layers), which speeds up training.
* Multi-task training
* Training models on classification and bounding box regression instead of only on classification improves the mAP (from 62.6% to 66.9%).
* Doing this in one hierarchy instead of two seperate models (one for classification, one for bounding box regression) increases mAP by roughly 2-3 percentage points.
* They did not find a significant benefit of training the model on multiple scales (e.g. same image sometimes at 400x400, sometimes at 600x600, sometimes at 800x800 etc.).
* Note that their raw CNN (everything before RoI-Pooling) is fully convolutional, so they can feed the images at any scale through the network.
* Increasing the amount of training data seemed to improve mAP a bit, but not as much as one might hope for.
* Using a softmax loss instead of an SVM seemed to marginally increase mAP (0-1 percentage points).
* Using more region proposals from selective search does not simply increase mAP. Instead it can lead to higher recall, but lower precision.
* 
* Using densely sampled region proposals (as in sliding window) significantly reduces mAP (from 59.2% to 52.9%). If SVMs instead of softmaxes are used, the results are even worse (49.3%).
![]() |
|
[link]
This work attempts to use meta-learning to learn an update rule for a reinforcement learning agent. In this context, "learning an update rule" means learning the parameters of an LSTM module that takes in information about the agent's recent reward and current model and outputs two values - a scalar and a vector - that are used to update the agent's model. I'm not going to go too deep into meta-learning here, but, at a high level, meta learning methods optimize parameters governing an agent's learning, and, over the course of many training processes over many environments, optimize those parameters such that the reward over the full lifetime of training is higher. To be more concrete, the agent in a given environment learns two things: - A policy, that is, a distribution over predicted action given a state. - A "prediction vector". This fits in the conceptual slot where most RL algorithms would learn some kind of value or Q function, to predict how much future reward can be expected from a given state. However, in this context, this vector is *very explicitly* not a value function, but is just a vector that the agent-model generates and updates. The notion here is that maybe our human-designed construction of a value function isn't actually the best quantity for an agent to be predicting, and, if we meta-learn, we might find something more optimal. I'm a little bit confused about the structure of this vector, but I think it's *intended* to be a categorical 1-of-m prediction At each step, after acting in the environment, the agent passes to an LSTM: - The reward at the step - A binary of whether the trajectory is done - The discount factor - The probability of the action that was taken from state t - The prediction vector evaluated at state t - The prediction vector evaluated at state t+1 Given that as input (and given access to its past history from earlier in the training process), the LSTM predicts two things: - A scalar, pi-hat - A prediction vector, y-hat These two quantities are used to update the existing policy and prediction model according to the rule below. https://i.imgur.com/xx1W9SU.png Conceptually, the scalar governs whether to increase or decrease probability assigned to the taken action under the policy, and y-hat serves as a target for the prediction vector to be pulled towards. An important thing to note about the LSTM structure is that none of the quantities it takes as input are dependent on the action or observation space of the environment, so, once it is learned it can (hopefully) generalize to new environments. Given this, the basic meta learning objective falls out fairly easily - optimize the parameters of the LSTM to maximize lifetime reward, taken in expectation over training runs. However, things don't turn out to be quite that easy. The simplest version of this meta-learning objective is wildly unstable and difficult to optimize, and the authors had to add a number of training hacks in order to get something that would work. (It really is dramatic, by the way, how absolutely essential these are to training something that actually learns a prediction vector). These include: - A entropy bonus, pushing the meta learned parameters to learn policies and prediction vectors that have higher entropy (which is to say: are less deterministic) - An L2 penalty on both pi-hat and y-hat - A removal of the softmax that had originally been originally taken over the k-dimensional prediction vector categorical, and switching that target from a KL divergence to a straight mean squared error loss. As far as I can tell, this makes the prediction vector no longer actually a 1-of-k categorical, but instead just a continuous vector, with each value between 0 and 1, which makes it make more sense to think of k separate binaries? This I was definitely confused about in the paper overall https://i.imgur.com/EL8R1yd.png With the help of all of these regularizers, the authors were able to get something that trained, and that appeared to be able to perform comparably to or better than A2C - the human-designed baseline - across the simple grid-worlds it was being trained in. However, the two most interesting aspects of the evaluation were: 1. The authors showed that, given the values of the prediction vector, you could predict the true value of a state quite well, suggesting that the vector captured most of the information about what states were high value. However, beyond that, they found that the meta-learned vector was able to be used to predict the value calculated with discount rates different that than one used in the meta-learned training, which the hand-engineered alternative, TD-lambda, wasn't able to do (it could only well-predict values at the same discount rate used to calculate it). This suggests that the network really is learning some more robust notion of value that isn't tied to a specific discount rate. 2. They also found that they were able to deploy the LSTM update rule learned on grid worlds to Atari games, and have it perform reasonably well - beating A2C in a few cases, though certainly not all. This is fairly impressive, since it's an example of a rule learned on a different, much simpler set of environments generalizing to more complex ones, and suggests that there's something intrinsic to Reinforcement Learning that it's capturing ![]() |