|
Welcome to ShortScience.org! |
|
|
[link]
* They use an implementation of Q-learning (i.e. reinforcement learning) with CNNs to automatically play Atari games.
* The algorithm receives the raw pixels as its input and has to choose buttons to press as its output. No hand-engineered features are used. So the model "sees" the game and "uses" the controller, just like a human player would.
* The model achieves good results on various games, beating all previous techniques and sometimes even surpassing human players.
### How
* Deep Q Learning
* *This is yet another explanation of deep Q learning, see also [this blog post](http://www.nervanasys.com/demystifying-deep-reinforcement-learning/) for longer explanation.*
* While playing, sequences of the form (`state1`, `action1`, `reward`, `state2`) are generated.
* `state1` is the current game state. The agent only sees the pixels of that state. (Example: Screen shows enemy.)
* `action1` is an action that the agent chooses. (Example: Shoot!)
* `reward` is the direct reward received for picking `action1` in `state1`. (Example: +1 for a kill.)
* `state2` is the next game state, after the action was chosen in `state1`. (Example: Screen shows dead enemy.)
* One can pick actions at random for some time to generate lots of such tuples. That leads to a replay memory.
* Direct reward
* After playing randomly for some time, one can train a model to predict the direct reward given a screen (we don't want to use the whole state, just the pixels) and an action, i.e. `Q(screen, action) -> direct reward`.
* That function would need a forward pass for each possible action that we could take. So for e.g. 8 buttons that would be 8 forward passes. To make things more efficient, we can let the model directly predict the direct reward for each available action, e.g. for 3 buttons `Q(screen) -> (direct reward of action1, direct reward of action2, direct reward of action3)`.
* We can then sample examples from our replay memory. The input per example is the screen. The output is the reward as a tuple. E.g. if we picked button 1 of 3 in one example and received a reward of +1 then our output/label for that example would be `(1, 0, 0)`.
* We can then train the model by playing completely randomly for some time, then sample some batches and train using a mean squared error. Then play a bit less randomly, i.e. start to use the action which the network thinks would generate the highest reward. Then train again, and so on.
* Indirect reward
* Doing the previous steps, the model will learn to anticipate the *direct* reward correctly. However, we also want it to predict indirect rewards. Otherwise, the model e.g. would never learn to shoot rockets at enemies, because the reward from killing an enemy would come many frames later.
* To learn the indirect reward, one simply adds the reward value of highest reward action according to `Q(state2)` to the direct reward.
* I.e. if we have a tuple (`state1`, `action1`, `reward`, `state2`), we would not add (`state1`, `action1`, `reward`) to the replay memory, but instead (`state1`, `action1`, `reward + highestReward(Q(screen2))`). (Where `highestReward()` returns the reward of the action with the highest reward according to Q().)
* By training to predict `reward + highestReward(Q(screen2))` the network learns to anticipate the direct reward *and* the indirect reward. It takes a leap of faith to accept that this will ever converge to a good solution, but it does.
* We then add `gamma` to the equation: `reward + gamma*highestReward(Q(screen2))`. `gamma` may be set to 0.9. It is a discount factor that devalues future states, e.g. because the world is not deterministic and therefore we can't exactly predict what's going to happen. Note that Q will automatically learn to stack it, e.g. `state3` will be discounted to `gamma^2` at `state1`.
* This paper
* They use the mentioned Deep Q Learning to train their model Q.
* They use a k-th frame technique, i.e. they let the model decide upon an action at (here) every 4th frame.
* Q is implemented via a neural net. It receives 84x84x4 grayscale pixels that show the game and projects that onto the rewards of 4 to 18 actions.
* The input is HxWx4 because they actually feed the last 4 frames into the network, instead of just 1 frame. So the network knows more about what things are moving how.
* The network architecture is:
* 84x84x4 (input)
* 16 convs, 8x8, stride 4, ReLU
* 32 convs, 4x4, stride 2, ReLU
* 256 fully connected neurons, ReLU
* <N_actions> fully connected neurons, linear
* They use a replay memory of 1 million frames.
### Results
* They ran experiments on the Atari games Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest and Space Invaders.
* Same architecture and hyperparameters for all games.
* Rewards were based on score changes in the games, i.e. they used +1 (score increases) and -1 (score decreased).
* Optimizer: RMSProp, Batch Size: 32.
* Trained for 10 million examples/frames per game.
* They had no problems with instability and their average Q value per game increased smoothly.
* Their method beats all other state of the art methods.
* They managed to beat a human player in games that required not so much "long" term strategies (the less frames the better).
* Video: starts at 46:05.
https://youtu.be/dV80NAlEins?t=46m05s

*The original full algorithm, as shown in the paper.*
--------------------
### Rough chapter-wise notes
* (1) Introduction
* Problems when using neural nets in reinforcement learning (RL):
* Reward signal is often sparse, noise and delayed.
* Often assumption that data samples are independent, while they are correlated in RL.
* Data distribution can change when the algorithm learns new behaviours.
* They use Q-learning with a CNN and stochastic gradient descent.
* They use an experience replay mechanism (i.e. memory) from which they can sample previous transitions (for training).
* They apply their method to Atari 2600 games in the Arcade Learning Environment (ALE).
* They use only the visible pixels as input to the network, i.e. no manual feature extraction.
* (2) Background
* blablabla, standard deep q learning explanation
* (3) Related Work
* TD-Backgammon: "Solved" backgammon. Worked similarly to Q-learning and used a multi-layer perceptron.
* Attempts to copy TD-Backgammon to other games failed.
* Research was focused on linear function approximators as there were problems with non-linear ones diverging.
* Recently again interest in using neural nets for reinforcement learning. Some attempts to fix divergence problems with gradient temporal-difference methods.
* NFQ is a very similar method (to the one in this paper), but worked on the whole batch instead of minibatches, making it slow. It also first applied dimensionality reduction via autoencoders on the images instead of training on them end-to-end.
* HyperNEAT was applied to Atari games and evolved a neural net for each game. The networks learned to exploit design flaws.
* (4) Deep Reinforcement Learning
* They want to connect a reinforcement learning algorithm with a deep neural network, e.g. to get rid of handcrafted features.
* The network is supposes to run on the raw RGB images.
* They use experience replay, i.e. store tuples of (pixels, chosen action, received reward) in a memory and use that during training.
* They use Q-learning.
* They use an epsilon-greedy policy.
* Advantages from using experience replay instead of learning "live" during game playing:
* Experiences can be reused many times (more efficient).
* Samples are less correlated.
* Learned parameters from one batch don't determine as much the distributions of the examples in the next batch.
* They save the last N experiences and sample uniformly from them during training.
* (4.1) Preprocessing and Model Architecture
* Raw Atari images are 210x160 pixels with 128 possible colors.
* They downsample them to 110x84 pixels and then crop the 84x84 playing area out of them.
* They also convert the images to grayscale.
* They use the last 4 frames as input and stack them.
* So their network input has shape 84x84x4.
* They use one output neuron per possible action. So they can compute the Q-value (expected reward) of each action with one forward pass.
* Architecture: 84x84x4 (input) => 16 8x8 convs, stride 4, ReLU => 32 4x4 convs stride 2 ReLU => fc 256, ReLU => fc N actions, linear
* 4 to 18 actions/outputs (depends on the game).
* Aside from the outputs, the architecture is the same for all games.
* (5) Experiments
* Games that they played: Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest, Space Invaders
* They use the same architecture und hyperparameters for all games.
* They give a reward of +1 whenever the in-game score increases and -1 whenever it decreases.
* They use RMSProp.
* Mini batch size was 32.
* They train for 10 million frames/examples.
* They initialize epsilon (in their epsilon greedy strategy) to 1.0 and decrease it linearly to 0.1 at one million frames.
* They let the agent decide upon an action at every 4th in-game frame (3rd in space invaders).
* (5.1) Training and stability
* They plot the average reward und Q-value per N games to evaluate the agent's training progress,
* The average reward increases in a noisy way.
* The average Q value increases smoothly.
* They did not experience any divergence issues during their training.
* (5.2) Visualizating the Value Function
* The agent learns to predict the value function accurately, even for rather long sequences (here: ~25 frames).
* (5.3) Main Evaluation
* They compare to three other methods that use hand-engineered features and/or use the pixel data combined with significant prior knownledge.
* They mostly outperform the other methods.
* They managed to beat a human player in three games. The ones where the human won seemed to require strategies that stretched over longer time frames.
![]() |
|
[link]
In this article, the authors provide a framework for training two translation models with large accessible monolingual corpus. In traditional methods, machine translation models always require large parallel corpus to train a good quality model, which is expensive to acquire. However, the massive monolingual data is not fully utilized. The monolingual corpus are typically used in pretraining the NMT decoder rnn and augmenting initial parallel corpus through self-generated translations. The authors embed machine translation task into a reinforcement learning framework, in which two agents act as two different native speakers respectively and know little about each other and then they learn to translate by trying to communicate with each other. **The two speakers**, `A` and `B`, obviously know well about their corresponding language respectively, this situation is easily simulated by two well-trained language models for `A` and `B`. Then, speaker `A` tries to tell a sentence $x$ to `B` by translating it into $y$ in `B`'s language. Since they don't know each other, `B` is uncertain about what `A` truly means by saying $y$. However, `B` is capable of evaluate the degree of sensibility of $y$ from his own understanding. Next, `B` informs `A` his sensibility evaluation score and tries to recover what `A` truly means in `A`'s language, i.e. $x'$. And similarly, `A` can also evaluate the degree of sensibility of $x'$ from his own understanding. In general, the very original idea that `A` tried to convey, is passed through a noisy channel to `B`, and then back to `A` through another noisy channel. The former noisy channel is a `A-B` translation model and the latter a `B-A` translation model in the framework. Think about how the first American learnt Chinese in history and I think it is intuitively similar to the principle in this work. ![]()
1 Comments
|
|
[link]
Wang et al. discuss an alternative definition of adversarial examples, taking into account an oracle classifier. Adversarial perturbations are usually constrained in their norm (e.g., $L_\infty$ norm for images); however, the main goal of this constraint is to ensure label invariance – if the image didn’t change notable, the label didn’t change either. As alternative formulation, the authors consider an oracle for the task, e.g., humans for image classification tasks. Then, an adversarial example is defined as a slightly perturbed input, whose predicted label changes, but where the true label (i.e., the oracle’s label) does not change. Additionally, the perturbation can be constrained in some norm; specifically, the perturbation can be constrained on the true manifold of the data, as represented by the oracle classifier. Based on this notion of adversarial examples, Wang et al. argue that deep neural networks are not robust as they utilize over-complete feature representations. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
|
[link]
**TL;DR**: Rearranging the terms in Maximum Mean Discrepancy yields a much better loss function for the discriminator of Generative Adversarial Nets.
**Keywords**: Generative adversarial nets, Maximum Mean Discrepancy, spectral normalization, convolutional neural networks, Gaussian kernel, local stability.
**Summary**
Generative adversarial nets (GANs) are widely used to learn the data sampling process and are notoriously difficult to train. The training of GANs may be improved from three aspects: loss function, network architecture, and training process.
This study focuses on a loss function called the Maximum Mean Discrepancy (MMD), defined as:
$$
MMD^2(P_X,P_G)=\mathbb{E}_{P_X}k_{D}(x,x')+\mathbb{E}_{P_G}k_{D}(y,y')-2\mathbb{E}_{P_X,P_G}k_{D}(x,y)
$$
where $G,D$ are the generator and discriminator networks, $x,x'$ are real samples, $y,y'$ are generated samples, $k_D=k\circ D$ is a learned kernel that calculates the similariy between two samples. Overall, MMD calculates the distance between the real and the generated sample distributions. Thus, traditionally, the generator is trained to minimize $L_G=MMD^2(P_X,P_G)$, while the discriminator minimizes $L_D=-MMD^2(P_X,P_G)$.
This study makes three contributions:
- It argues that $L_D$ encourages the discriminator to ignores the fine details in real data. By minimizing $L_D$, $D$ attempts to maximize $\mathbb{E}_{P_X}k_{D}(x,x')$, the similarity between real samples scores. Thus, $D$ has to focus on common features shared by real samples rather than fine details that separate them. This may slow down training. Instead, a repulsive loss is proposed, with no additional computational cost to MMD:
$$
L_D^{rep}=\mathbb{E}_{P_X}k_{D}(x,x')-\mathbb{E}_{P_G}k_{D}(y,y')
$$
- Inspired by the hinge loss, this study proposes a bounded Gaussian kernel for the discriminator to facilitate stable training of MMD-GAN.
- The spectral normalization method divides the weight matrix at each layer by its spectral norm to enforce that each layer is Lipschitz continuous. This study proposes a simple method to calculate the spectral norm of a convolutional kernel.
The results show the efficiency of proposed methods on CIFAR-10, STL-10, CelebA and LSUN-bedroom datasets. In Appendix, we prove that MMD-GAN training using gradient method is locally exponentially stable (a property that the Wasserstein loss does not have), and show that the repulsive loss works well with gradient penalty.
The paper has been accepted at ICLR 2019 ([OpenReview link](https://openreview.net/forum?id=HygjqjR9Km)). The code is available at [GitHub link](https://github.com/richardwth/MMD-GAN).
![]() |
|
[link]
Offline reinforcement learning is potentially high-value thing for the machine learning community learn to do well, because there are many applications where it'd be useful to generate a learnt policy for responding to a dynamic environment, but where it'd be too unsafe or expensive to learn in an on-policy or online way, where we continually evaluate our actions in the environment to test their value. In such settings, we'd like to be able to take a batch of existing data - collected from a human demonstrator, or from some other algorithm - and be able to learn a policy from those pre-collected transitions, without being able to query the environment further by taking arbitrary actions. There are two broad strategies for learning a policy from precollected transitions. One is to simply learn to mimic the action policy used by the demonstrator, predicting the action the demonstrator would take in a given state, without making use of reward data at all. This is Behavioral Cloning, and has the advantage of being somewhat more conservative (in terms of not experimenting with possibly-unsafe-or-low-reward actions the demonstrator never took), but this is also a disadvantage, because it's not possible to get higher reward than the demonstrator themselves got if you're simply copying their behavior. Another approach is to learn a Q function - estimating the value of a given action in a given state - using the reward data from the precollected transitions. This can also have some downsides, mostly in the direction of overconfidence. Q value Temporal Difference learning works by using the current reward added to the max Q value over possible next actions as the target for the current-state Q estimate. This tends to lead to overestimates, because regression to the mean effects mean that the highest value Q estimates are disproportionately likely to be noisy (possibly because they correspond to an action with little data in the demonstrator dataset). In on-policy Q learning, this is less problematic, because the agent can take the action associated with their noisily inaccurate estimate, and as a result get more data for that action, and get an estimate that is less noisy in future. But when we're in a fully offline setting, all our learning is completed before we actually start taking actions with our policy, so taking high-uncertainty actions isn't a valuable source of new information, but just risky. The approach suggested by this DeepMind paper - Critic Regularized Regression, or CRR - is essentially a synthesis of these two possible approaches. The method learns a Q function as normal, using temporal difference methods. The distinction in this method comes from how to get a policy, given a learned Q function. Rather than simply taking the action your Q estimate says is highest-value at a particular point, CRR optimizes a policy according to the formula shown below. The f() function is a stand-in for various potential functions, all of which are monotonic with respect to the Q function, meaning they increase when the Q function does. https://i.imgur.com/jGmhYdd.png This basically amounts to a form of a behavioral cloning loss (with the part that maximizes the probability under your policy of the actions sampled from the demonstrator dataset), but weighted or, as the paper terms it, filtered, by the learned Q function. The higher the estimated q value for a transition, the more weight is placed on that transition from the demo dataset having high probability under your policy. Rather than trying to mimic all of the actions of the demonstrator, the policy preferentially tries to mimic the demonstrator actions that it estimates were particularly high-quality. Different f() functions lead to different kinds of filtration. The `binary`version is an indicator function for the Advantage of an action (the Q value for that action at that state minus some reference value for the state, describing how much better the action is than other alternatives at that state) being greater than zero. Another, `exp`, uses exponential weightings which do a more "soft" upweighting or downweighting of transitions based on advantage, rather than the sharp binary of whether an actions advantage is above 1. The authors demonstrate that, on multiple environments from three different environment suites, CRR outperforms other off-policy baselines - either more pure behavioral cloning, or more pure RL - and in many cases does so quite dramatically. They find that the sharper binary weighting scheme does better on simpler tasks, since the trade-off of fewer but higher-quality samples to learn from works there. However, on more complex tasks, the policy benefits from the exp weighting, which still uses and learns from more samples (albeit at lower weights), which introduces some potential mimicking of lower-quality transitions, but at the trade of a larger effective dataset size to learn from. ![]() |