|
Welcome to ShortScience.org! |
|
|
[link]
The fundamental unit of Reinforcement Learning is the reward function, with a core assumption of the area being that actions induce rewards, with some actions being higher reward than others. But, reward functions are just artificial objects we design to induce certain behaviors; the universe doesn’t hand out “true” rewards we can build off of. Inverse Reinforcement Learning as a field is rooted in the difficulty of designing reward functions, and has the aspiration of, instead of requiring a human to hard code a reward function, inferring rewards from observing human behavior. The rough idea is that if we imagine a human is (even if they don’t know it) operating so as to optimize some set of rewards, we might be able to infer that set of underlying incentives from their actions, and, once we’ve extracted a reward function, use that to train new agents. This is a mathematically quite tricky problem, for the basic reason that a human’s actions are often consistent with a wide range of possible underlying “policy” parameters, and also that a given human policy could be an optimal for a wide range of underlying reward functions. This paper proposes using an adversarial frame on the problem, where you learn a reward function by trying to make reward higher for the human demonstrations you observe, relative to the actions the agent itself is taking. This has the effect of trying to learn an agent that can imitate human actions. However, it specifically designs its model structure to allow it to go beyond just imitation. The problem with learning a purely imitative policy is that it’s hard for the model to separate out which actions the human is taking because they are intrinsically high reward (like, perhaps, eating candy), versus actions which are only valuable in a particular environment (perhaps opening a drawer if you’re in a room where that’s where the candy is kept). If you didn’t realize that the real reward was contained in the candy, you might keep opening drawers, even if you’re in a room where the candy is laying out on the table. In mathematical terms, separating out intrinsic vs instrumental (also known as "shaped") rewards is a matter of making sure to learn separate out the reward associated with a given state from value of taking a given action at that state, because the value of your action is only born out based on assumptions about how states transition between each other, which is a function of the specific state to state dynamics of the you’re in. The authors do this by defining a g(s) function, and a h(s) function. They then define their overall reward of an action as (g(s) + h(s’)) - h(s), where s’ is the new state you end up in if you take an action. https://i.imgur.com/3ENPFVk.png This follows the natural form of a Bellman update, where the sum of your future value at state T should be equal to the sum of your future value at time T+1 plus the reward you achieve at time T. https://i.imgur.com/Sd9qHCf.png By adopting this structure, and learning a separate neural network to capture the h(s) function representing the value from here to the end, the authors make it the case that the g(s) function is a purer representation of the reward at a state, regardless of what we expect to happen in the future. Using this, they’re able to use this learned reward to bootstrap good behavior in new environments, even in contexts where a learned value function would be invalid because of the assumptions of instrumental value. They compare their method to the baseline of GAIL, which is a purely imitation-learning approach, and show that theirs is more able to transfer to environments with similar states but different state-to-state dynamics. ![]() |
|
[link]
Reinforcement learning is notoriously sample-inefficient, and one reason why is that agents learn about the world entirely through experience, and it takes lots of experience to learn useful things. One solution you might imagine to this problem is the ones humans by and large use in encountering new environments: instead of learning everything through first-person exploration, acquiring lots of your knowledge by hearing or reading condensed descriptions of the world that can help you take more sensible actions within it. This paper and others like it have the goal of learning RL agents that can take in information about the world in the form of text, and use that information to solve a task. This paper is not the first to propose a solution in this general domain, but it claims to be unique by dint of having both the dynamics of the environment and the goal of the agent change on a per-environment basis, and be described in text. The precise details of the architecture used are very much the result of particular engineering done to solve this problem, and as such, it's a bit hard to abstract away generalizable principles that this paper showed, other than the proof of concept fact that tasks of the form they describe - where an agent has to learn which objects can kill which enemies, and pursue the goal of killing certain ones - can be solved. Arguably the most central design principle of the paper is aggressive and repeated use of different forms of conditioning architectures, to fully mix the information contained in the textual and visual data streams. This was done in two main ways: - Multiple different attention summaries were created, using the document embedding as input, but with queries conditioned on different things (the task, the inventory, a summarized form of the visual features). This is a natural but clever extension of the fact that attention is an easy way to generate conditional aggregated versions of some input https://i.imgur.com/xIsRu2M.png - The architecture uses FiLM (Featurewise Linear Modulation), which is essentially a many-generations-generalized version of conditional batch normalization in which the gamma and lambda used to globally shift and scale a feature vector are learned, taking some other data as input. The canonical version of this would be taking in text input, summarizing it into a vector, and then using that vector as input in a MLP that generates gamma and lambda parameters for all of the convolutional layers in a vision system. The interesting innovation of this paper is essentially to argue that this conditioning operation is quite neutral, and that there's no essential way in which the vision input is the "true" data, and the text simply the auxiliary conditioning data: it's more accurate to say that each form of data should conditioning the process of the other one. And so they use Bidirectional FiLM, which does just that, conditioning vision features on text summaries, but also conditioning text features on vision summaries. https://i.imgur.com/qFaH1k3.png - The model overall is composed of multiple layers that perform both this mixing FiLM operation, and also visually-conditioned attention. The authors did show, not super surprisingly, that these additional forms of conditioning added performance value to the model relative to the cases where they were ablated ![]() |
|
[link]
* The authors train a variant of AlexNet that has significantly fewer parameters than the original network, while keeping the network's accuracy stable.
* Advantages of this:
* More efficient distributed training, because less parameters have to be transferred.
* More efficient transfer via the internet, because the model's file size is smaller.
* Possibly less memory demand in production, because fewer parameters have to be kept in memory.
### How
* They define a Fire Module. A Fire Module contains of:
* Squeeze Module: A 1x1 convolution that reduces the number of channels (e.g. from 128x32x32 to 64x32x32).
* Expand Module: A 1x1 convolution and a 3x3 convolution, both applied to the output of the Squeeze Module. Their results are concatenated.
* Using many 1x1 convolutions is advantageous, because they need less parameters than 3x3s.
* They use ReLUs, only convolutions (no fully connected layers) and Dropout (50%, before the last convolution).
* They use late maxpooling. They argue that applying pooling late - rather than early - improves accuracy while not needing more parameters.
* They try residual connections:
* One network without any residual connections (performed the worst).
* One network with residual connections based on identity functions, but only between layers of same dimensionality (performed the best).
* One network with residual connections based on identity functions and other residual connections with 1x1 convs (where dimensionality changed) (performance between the other two).
* They use pruning from Deep Compression to reduce the parameters further. Pruning simply collects the 50% of all parameters of a layer that have the lowest values and sets them to zero. That creates a sparse matrix.
### Results
* 50x parameter reduction of AlexNet (1.2M parameters before pruning, 0.4M after pruning).
* 510x file size reduction of AlexNet (from 250mb to 0.47mb) when combined with Deep Compression.
* Top-1 accuracy remained stable.
* Pruning apparently can be used safely, even after the network parameters have already been reduced significantly.
* While pruning was generally safe, they found that two of their later layers reacted quite sensitive to it. Adding parameters to these (instead of removing them) actually significantly improved accuracy.
* Generally they found 1x1 convs to react more sensitive to pruning than 3x3s. Therefore they focused pruning on 3x3 convs.
* First pruning a network, then re-adding the pruned weights (initialized with 0s) and then retraining for some time significantly improved accuracy.
* The network was rather resilient to significant channel reduction in the Squeeze Modules. Reducing to 25-50% of the original channels (e.g. 128x32x32 to 64x32x32) seemed to be a good choice.
* The network was rather resilient to removing 3x3 convs and replacing them with 1x1 convs. A ratio of 2:1 to 1:1 (1x1 to 3x3) seemed to produce good results while mostly keeping the accuracy.
* Adding some residual connections between the Fire Modules improved the accuracy.
* Adding residual connections with identity functions *and also* residual connections with 1x1 convs (where dimensionality changed) improved the accuracy, but not as much as using *only* residual connections with identity functions (i.e. it's better to keep some modules without identity functions).
--------------------
### Rough chapter-wise notes
* (1) Introduction and Motivation
* Advantages from having less parameters:
* More efficient distributed training, because less data (parameters) have to be transfered.
* Less data to transfer to clients, e.g. when a model used by some app is updated.
* FPGAs often have hardly any memory, i.e. a model has to be small to be executed.
* Target here: Find a CNN architecture with less parameters than an existing one but comparable accuracy.
* (2) Related Work
* (2.1) Model Compression
* SVD-method: Just apply SVD to the parameters of an existing model.
* Network Pruning: Replace parameters below threshold with zeros (-> sparse matrix), then retrain a bit.
* Add quantization and huffman encoding to network pruning = Deep Compression.
* (2.2) CNN Microarchitecture
* The term "CNN Microarchitecture" refers to the "organization and dimensions of the individual modules" (so an Inception module would have a complex CNN microarchitecture).
* (2.3) CNN Macroarchitecture
* CNN Macroarchitecture = "big picture" / organization of many modules in a network / general characteristics of the network, like depth
* Adding connections between modules can help (e.g. residual networks)
* (2.4) Neural Network Design Space Exploration
* Approaches for Design Space Exporation (DSE):
* Bayesian Optimization, Simulated Annealing, Randomized Search, Genetic Algorithms
* (3) SqueezeNet: preserving accuracy with few parameters
* (3.1) Architectural Design Strategies
* A conv layer with N filters applied to CxHxW input (e.g. 3x128x128 for a possible first layer) with kernel size kHxkW (e.g. 3x3) has `N*C*kH*kW` parameters.
* So one way to reduce the parameters is to decrease kH and kW, e.g. from 3x3 to 1x1 (reduces parameters by a factor of 9).
* A second way is to reduce the number of channels (C), e.g. by using 1x1 convs before the 3x3 ones.
* They think that accuracy can be improved by performing downsampling later in the network (if parameter count is kept constant).
* (3.2) The Fire Module
* The Fire Module has two components:
* Squeeze Module:
* One layer of 1x1 convs
* Expand Module:
* Concat the results of:
* One layer of 1x1 convs
* One layer of 3x3 convs
* The Squeeze Module decreases the number of input channels significantly.
* The Expand Module then increases the number of input channels again.
* (3.3) The SqueezeNet architecture
* One standalone conv, then several fire modules, then a standalone conv, then global average pooling, then softmax.
* Three late max pooling laters.
* Gradual increase of filter numbers.
* (3.3.1) Other SqueezeNet details
* ReLU activations
* Dropout before the last conv layer.
* No linear layers.
* (4) Evaluation of SqueezeNet
* Results of competing methods:
* SVD: 5x compression, 56% top-1 accuracy
* Pruning: 9x compression, 57.2% top-1 accuracy
* Deep Compression: 35x compression, ~57% top-1 accuracy
* SqueezeNet: 50x compression, ~57% top-1 accuracy
* SqueezeNet combines low parameter counts with Deep Compression.
* The accuracy does not go down because of that, i.e. apparently Deep Compression can even be applied to small models without giving up on performance.
* (5) CNN Microarchitecture Design Space Exploration
* (5.1) CNN Microarchitecture metaparameters
* blabla we test various values for this and that parameter
* (5.2) Squeeze Ratio
* In a Fire Module there is first a Squeeze Module and then an Expand Module. The Squeeze Module decreases the number of input channels to which 1x1 and 3x3 both are applied (at the same time).
* They analyzed how far you can go down with the Sqeeze Module by training multiple networks and calculating the top-5 accuracy for each of them.
* The accuracy by Squeeze Ratio (percentage of input channels kept in 1x1 squeeze, i.e. 50% = reduced by half, e.g. from 128 to 64):
* 12%: ~80% top-5 accuracy
* 25%: ~82% top-5 accuracy
* 50%: ~85% top-5 accuracy
* 75%: ~86% top-5 accuracy
* 100%: ~86% top-5 accuracy
* (5.3) Trading off 1x1 and 3x3 filters
* Similar to the Squeeze Ratio, they analyze the optimal ratio of 1x1 filters to 3x3 filters.
* E.g. 50% would mean that half of all filters in each Fire Module are 1x1 filters.
* Results:
* 01%: ~76% top-5 accuracy
* 12%: ~80% top-5 accuracy
* 25%: ~82% top-5 accuracy
* 50%: ~85% top-5 accuracy
* 75%: ~85% top-5 accuracy
* 99%: ~85% top-5 accuracy
* (6) CNN Macroarchitecture Design Space Exploration
* They compare the following networks:
* (1) Without residual connections
* (2) With residual connections between modules of same dimensionality
* (3) With residual connections between all modules (except pooling layers) using 1x1 convs (instead of identity functions) where needed
* Adding residual connections (2) improved top-1 accuracy from 57.5% to 60.4% without any new parameters.
* Adding complex residual connections (3) worsed top-1 accuracy again to 58.8%, while adding new parameters.
* (7) Model Compression Design Space Exploration
* (7.1) Sensitivity Analysis: Where to Prune or Add parameters
* They went through all layers (including each one in the Fire Modules).
* In each layer they set the 50% smallest weights to zero (pruning) and measured the effect on the top-5 accuracy.
* It turns out that doing that has basically no influence on the top-5 accuracy in most layers.
* Two layers towards the end however had significant influence (accuracy went down by 5-10%).
* Adding parameters to these layers improved top-1 accuracy from 57.5% to 59.5%.
* Generally they found 1x1 layers to be more sensitive than 3x3 layers so they pruned them less aggressively.
* (7.2) Improving Accuracy by Densifying Sparse Models
* They found that first pruning a model and then retraining it again (initializing the pruned weights to 0) leads to higher accuracy.
* They could improve top-1 accuracy by 4.3% in this way.
![]() |
|
[link]
## Introduction
* Introduces techniques to learn word vectors from large text datasets.
* Can be used to find similar words (semantically, syntactically, etc).
* [Link to the paper](http://arxiv.org/pdf/1301.3781.pdf)
* [Link to open source implementation](https://code.google.com/archive/p/word2vec/)
## Model Architecture
* Computational complexity defined in terms of a number of parameters accessed during model training.
* Proportional to $E*T*Q$
* *E* - Number of training epochs
* *T* - Number of words in training set
* *Q* - depends on the model
### Feedforward Neural Net Language Model (NNLM)
* Probabilistic model with input, projection, hidden and output layer.
* Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size).
* Input layer projected to projection layer P with dimensionality *N\*D*
* Hidden layer (of size *H*) computes the probability distribution over all words.
* Complexity per training example $Q =N*D + N*D*H + H*V$
* Can reduce *Q* by using hierarchical softmax and Huffman binary tree (for storing vocabulary).
### Recurrent Neural Net Language Model (RNNLM)
* Similar to NNLM minus the projection layer.
* Complexity per training example $Q =H*H + H*V$
* Hierarchical softmax and Huffman tree can be used here as well.
## Log-Linear Models
* Nonlinear hidden layer causes most of the complexity.
* NNLMs can be successfully trained in two steps:
* Learn continuous word vectors using simple models.
* N-gram NNLM trained over the word vectors.
### Continuous Bag-of-Words Model
* Similar to feedforward NNLM.
* No nonlinear hidden layer.
* Projection layer shared for all words and order of words does not influence projection.
* Log-linear classifier uses a window of words to predict the middle word.
* $Q = N*D + D*\log_2V$
### Continuous Skip-gram Model
* Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window.
* Distant words are given less weight by sampling fewer distant words.
* $Q = C*(D + D*log_2 V$) where *C* is the max distance of the word from the middle word.
* Given a *C* and a training data, a random *R* is chosen in range *1 to C*.
* For each training word, *R* words from history (previous words) and *R* words from future (next words) are marked as target output and model is trained.
## Results
* Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece).
* Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance.
* Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge.
* Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman").
![]() |
|
[link]
The paper proposes a standardized benchmark for a number of safety-related problems, and provides an implementation that can be used by other researchers. The problems fall in two categories: specification and robustness. Specification refers to cases where it is difficult to specify a reward function that encodes our intentions. Robustness means that agent's actions should be robust when facing various complexities of a real-world environment. Here is a list of problems: 1. Specification: 1. Safe interruptibility: agents should neither seek nor avoid interruption. 2. Avoiding side effects: agents should minimize effects unrelated to their main objective. 3. Absent supervisor: agents should not behave differently depending on presence of supervisor. 4. Reward gaming: agents should not try to exploit errors in reward function. 2. Robustness: 1. Self-modification: agents should behave well when environment allows self-modification. 2. Robustness to distributional shift: agents should behave robustly when test differs from train. 3. Robustness to adversaries: agents should detect and adapt to adversarial intentions in environment. 4. Safe exploration: agent should behave safely during learning as well. It is worth noting that problems 1.2, 1.4, 2.2, and 2.4 have been described back in "Concrete Problems in AI Safety". It is suggested that each of these problems be tackled in a "gridworld" environment — a 2D environment where the agent lives on a grid, and the only actions it has available are up/down/left/right movements. The benchmark consists of 10 environments, each corresponding to one of 8 problems mentioned above. Each of the environments is an extremely simple instance of the problem, but nevertheless they are of interest as current SotA algorithms usually don't solve the posed task. Specifically, the authors trained A2C and Rainbow with DQN update on each of the environments and showed that both algorithms fail on all of specification problems, except for Rainbow on 1.1. This is expected, as neither of those algorithms are designed for cases where reward function is misspecified. Both algorithms failed on 2.2--2.4, except for A2C on 2.3. On 2.1, the authors swapped A2C for Rainbow with Sarsa update and showed that Rainbow DQN failed while Rainbow Sarsa performed well. Overall, this is a good groundwork paper with only a few questionable design decisions, such as the design of actual reward in 1.2. It is unlikely to have impact similar to MNIST or ImageNet, but it should stimulate safety-related research. ![]() |