![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
## General Framework Really **similar to DAgger** (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1011.0686&a=muntermulehitch)) but considers **cost-sensitive classification** ("some mistakes are worst than others": you should be more careful in imitating that particular action of the expert if failing in doing so incurs a large cost-to-go). By doing so they improve from DAgger's bound of $\epsilon_{class}uT$ where $u$ is the difference in cost-to-go (between the expert and one error followed by expert policy) to $\epsilon_{class}T$ where $\epsilon_{class}$ is the error due to the lack of expressiveness of the policy class. In brief, by accounting for the effect of a mistake on the cost-to-go they remove the cost-to-go contribution to the bound (difference in the performance of the learned policy vs. expert policy) and thus have a tighter bound. In the paper they use the word "regret" for two distinct concepts which is confusing to me: one for the no-regret online learning meta-approach to IL (similar to DAgger) and another one because Aggrevate aims at minimizing the cost-to-go difference with the expert (cost-to-go difference: the sum of the cost I endured because I did not behave like the expert once = *regret*) compared to DAgger that aims at minimizing the error rate wrt. the expert. Additionally, the paper extends the view of Imitation learning as an online learning procedure to Reinforcement Learning. ## Assumptions **Interactive**: you can re-query the expert and thus reach $\epsilon T$ bounds instead of $\epsilon T^2$ like with non-interactive methods (Behavioral Cloning) due to compounding error. Additionally, one also needs a **reward/cost** that **cannot** be defined relative to the expert (no 0-1 loss wrt expert for ex.) since the cost-to-go is computed under the expert policy (would always yield 0 cost). ## Other methods **SEARN**: does also reason about **cost-to-go but under the current policy** instead of the expert's (even if you can use the expert's in practice and thus becomes really similar to Aggrevate). SEARN uses **stochastic policies** and can be seen as an Aggrevate variant where stochastic mixing is used to solve the online learning problem instead of **Regularized-Follow-The-Leader (R-FTL)**. ## Aggrevate - IL with cost-to-go  Pretty much like DAgger but one has to use a no-regret online learning algo to do **cost-sensitive** instead of regular classification. In the paper, they use the R-FTL algorithm and train the policy on all previous iterations. Indeed, using R-FTL with strongly convex loss (like the squared error) with stable batch leaner (like stochastic gradient descent) ensures the no-regret property. In practice (to deal with infinite policy classes and knowing the cost of only a few actions per state) they reduce cost-sensitive classification to an **argmax regression problem** where they train a model to match the cost given state-action (and time if we want nonstationary policies) using the collected datapoints and the (strongly convex) squared error loss. Then, they argmin this model to know which action minimizes the cost-to-go (cost-sensitive classification). This is close to what we do for **Q-learning** (DQN or DDPG): fit a critic (Q-values) with the TD-error (instead of full rollouts cost-to-go of expert), argmax your critic to get your policy. Similarly to DQN, the way you explore the actions of which you compute the cost-to-go is important (in this paper they do uniform exploration). **Limitations** If the policy class is not expressive enough and cannot match the expert policy performance this algo may fail to learn a reasonable policy. Example: the task is to go for point A to point B, there exist a narrow shortcut and a safer but longer road. The expert can handle both roads so it prefers taking the shortcut. Even if the learned policy class can handle the safer road it will keep trying to use the narrow one and fail to reach the goal. This is because all the costs-to-go are computed under the expert's policy, thus ignoring the fact that they cannot be achieved by any policy of the learned policy class. ## RL via No-Regrety Policy Iteration -- NRPI  NRPI does not require an expert policy anymore but only a **state exploration distribution**. NRPI can also be preferred when no policy in the policy class can match the expert's since it allows for more exploration by considering the **cost-to-go of the current policy**. Here, the argmax regression equivalent problem is really similar to Q-learning (where we use sampled cost-to-go from rollouts instead of Bellman errors) but where **the cost-to-go** of the aggregate dataset corresponds to **outdated policies!** (in contrast, DQN's data is comprised of rewards instead of costs-to-go). Yet, since R-FTL is a no-regret online learning method, the learned policy performs well under all the costs-to-go of previous iterations and the policies as well as the costs-to-go converge. The performance of NRPI is strongly limited to the quality of the exploration distribution. Yet if the exploration distribution is optimal, then NRPI is also optimal (the bound $T\epsilon_{regret} \rightarrow 0$ with enough online iterations). This may be a promising method for not interactive, state-only IL (if you have access to a reward). ## General limitations Both methods are much less sample efficient than DAgger as they require costs-to-go: one full rollout for one data-point. ## Broad contribution Seeing iterative learning methods such as Q-learning in the light of online learning methods is insightful and yields better bounds and understanding of why some methods might work. It presents a good tool to analyze the dynamics that interleaves learning and execution (optimizing and collecting data) for the purpose of generalization. For example, the bound for NRPI can seem quite counter-intuitive to someone familiar with on-policy/off-policy distinction, indeed NRPI optimizes a policy wrt to **costs-to-go of other policies**, yet R-FTL tells us that it converges towards what we want. Additionally, it may give a practical advantage for stability as the policy is optimized with larger batches and thus as to be good across many states and many cost-to-go formulations. ![]() |
[link]
## General Framework The imitation learning problem is here cast into a classification problem: label the state with the corresponding expert action. With this, you can see structured prediction (predict next label knowing your previous prediction) as a degenerated IL problem. They make the **reduction assumption** that you can make the probability of mistake $\epsilon$ as small as desired on the **training distribution** (expert or mixture). They also assume that the difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by expert policy is small (which does not hold if one mistake makes you for fall from a cliff for example!). ## Problem definition There are three problems that are highlighted in this paper **1-interaction between policy and distribution**, **2-violation of the i.i.d assumption**, and **3-distributional shift and resulting compounding error of BC**. The paper implies that **1** implies **2** that implies **3** but they eventually only address **3** with the algorithm they propose (DAgger). 1. **interaction between policy and distribution**: the fact that the learned policy is present in the expectation (through the state-action distribution) as well as in the loss makes the **objective non-convex** and this even for convex loss functions. **Yet this also implies non i.i.d. samples!** Indeed, even if one could directly sample from the state-action distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update). 2. **violation of the i.i.d. assumption**: So we just discussed that 1. implies non-iid samples across updates (batches) yet there is another form of dependency (inside a batch this time) as in practice we collect samples using rollouts and in a trajectory s' depends on (s,a) so we have a dependency because we collected the **samples sequentially**. 3. **distribution-shift and compounding error of Behavior cloning (supervised learning)**: During training, you assume iid data and you do not model that the next state you will have to classify (act upon) is influenced by your previous action. But at execution samples are not iid: if you do a mistake you reach a state that you may have never seen (distribution shift) and you cannot recover(compounding error). If the classifier's probability of making a mistake is $\epsilon$ and the episode length is $T$, then BC can make as many as $\epsilon T^2$ mistakes (CF Levine's class example with the funambulist and the worst-case scenario: as soon as you make a mistake, you cannot recover and you make mistakes until the end of the episode). **This is what this paper (and the papers it compares to) addresses, they propose a worst-case scenario linear in both** $\epsilon$ and $T$. **Additionnal assumptions compared to BC:**During training you are allowed to query the expert as much as you want. ## Other methods * **Forward Training**: trains a **non-stationary policy** (one policy for each time step). Each policy is trained to match the expert on the distribution of states encountered at that time-step when doing rollouts with the previously learned policies (there is no distribution shift since each policy is trained on the actual distribution of states it will see). **Problem**: you have to train $T$ different policies, you cannot stop until you trained all the policies. * **Stochastic Mixing Iterative Learning (SMILe)**: iteratively trains a **stochastic mixture of policies**. Each new policy is trained to match the expert on the state distribution of the current mixture, therefore iteratively accounting for the distribution drift, eventually, the distribution should converge. This yields near-linear regret on $T$ and $\epsilon$. **Problem**: At execution time, we sample a policy from the mixture, which can perform poorly on the given state, especially if it corresponded to a different distribution (early iteration for example). ## DAgger -- Dataset Aggregation **Algo**  $\beta_i$ is related to the proportion of state distribution that we want to collect similar to the expert distribution. This may help in the early iterations, where the policy is bad and may spend a lot of time collecting datapoints in states that are irrelevant. In practice using $\beta_i=I(i=0)$ works well. Intuitively, by training on generated transitions, DAgger reduces the distribution shift. **Link to no Regret Online Learning** DAgger can be seen as a *Follow-the-Leader* (FTL) algorithm where at iteration $n$ we pick the best policy in hindsight, i.e. with respect to all the transitions seen so far (in previous iterations). The use of $\beta_i$ is an "extra trick" that doesn't exist as is in FTL algos. *Regret*: the difference between what I did and the best thing that I could have done. Each iteration of DAgger can be seen as a step of Online Learning (treat mini-batches of trajectories under a single policy as a single online-learning example) where the online algo used is FTL but could actually be any no-regret algo. Indeed, by using the no-regret algo view, the authors show that using no-regret algos (regret decays with $1/N$) as a meta-algorithm for IL (just like FTL for DAgger) yield that, for N big enough (in the order of T), we have a worst-case performance linear in $\epsilon$, $u$ and $T$. Where it is **assumed** that we can reach an error probability of $\epsilon$ on any training distribution and a difference in the cost-to-go (-Q-value) of $u$. In other words, if we can guarantee a classification error rate of $\epsilon$ **on the training distribution** and that the expert is able to recover from a mistake ($u$ is bounded) then the use of a **no-regret online learning algorithm** will yield a policy that will make at most (proportional to) $\epsilon T u$ errors **on its own state distribution (test distribution)**. **Requirements** * An error rate of $\epsilon$ on the training distribution * A strongly convex loss function between expert and learner labels (0-1 loss is not but binary cross-entropy that is usually used for classification is) so that FTL is indeed a no-regret algo. If the loss in not strongly convex, FTL has no guarantees and we must use another no-regret method * The difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by the expert policy must be small (expert must be able to recover: this does not account for falling from a cliff for example!). **Results**      ![]() |
[link]
## Summary The broad goal of this paper is to understand how a neural network learns the underlying distribution of the input data and the properties of the network that describes its generalization power. Previous literature tries to use statistical measures like Rademacher complexity, uniform stability and VC dimension to explain the generalization error of the model. These methods explain generalization in terms of the number of parameters in the model along with the applied regularization. The experiments performed in the [Section 2] of the paper show that the learning capacity of a CNN cannot be sufficiently explained by traditional statistical learning theory. Even the effect of different regularization strategies in CNN is shown to be potentially unrelated to the generalization error, which contradicts the theory behind VC dimension. The experiments of the paper show that the model is able to learn some underlying patterns for random labels and input with different amounts of gaussian noise. When the authors gradually increase the noise in the inputs the generalization error gradually increases while the training error is still able to reach zero. The authors have concluded that big networks are able to completely memorise the complete dataset. ## Personal Thoughts 1) Firstly we need a new theory to explain why and how CNN memorizes the inputs and generalizes itself to new data. Since the paper shows that regularization doesn't have too much effect on the generalization for big networks, maybe the network is actually memorizing the whole input space. But the memorization is very strategic in the sense that only the inputs (eg. noise) where no underlying simple features are found, are completely memorized unlike inputs with a stronger signal where patterns can be found. This may explain the discrepancy in number of training steps between ‘true labels’ and noisy inputs in [Figure 1 a.]. My very general understanding of Information Bottleneck Hypothesis [4] is that networks compresses noisy input data as much as possible while preserving important information. For a network more time is taken to compress noise compared to strong signals in images. This may give some intuision behind the learning process taking place. 2) CNN is highly non-linear with millions of parameters and has a very complex loss landscape. There might be multiple minima and we need a theory to explain which of these minima gives the highest generalization. Unfortunately the working of SGD is still a black box and is very difficult to characterize. There are many interesting phenomena like adversarial attacks, effect of optimizer used on the weights found (Daniel Jiwoong et al., 2016) and the actual understanding of non-linearity in CNN (Ian J. Goodfellow et al., 2015) that all point to lapses in our overall understanding of very high dimensional manifolds. This requires rigorous experimentation to study and understand the effect of the network architecture, optimizer and the actual input (Nitish Shirish et al.,2017) to the network independently on generalization. ## References 1. Im, Daniel Jiwoong et al. “An empirical analysis of the optimization of deep network loss surfaces.” arXiv: Learning (2016): n. pag. 2. Goodfellow, Ian J. and Oriol Vinyals. “Qualitatively characterizing neural network optimization problems.” CoRR abs/1412.6544 (2015): n. pag. 3. Keskar, Nitish Shirish et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” ArXiv abs/1609.04836 (2017): n. pag. 4. https://www.youtube.com/watch?v=XL07WEc2TRI ![]() |
[link]
Is the market efficient? This is perhaps the most prevalent question in all of finance. While this paper does not aim to answer that question, it does frame it in an information-theoretic context. Mainly, Maymin shows that at least the weak form of the efficient market hypothesis (EMH) holds if and only if P = NP. First, he defines what efficient market means: "The weakest form of the EMH states that future prices cannot be predicted by analyzing prices from the past. Therefore, technical analysis cannot work in the long run, though of course any strategy could make money randomly." For $n$ past historical price changes of {UP, DOWN}. Let there be three trading strategies that are neutral, long or short to the market. In order to check if there exists a strategy that statistically significantly makes money, requires checking all $3^n$ possible strategies. Verifying that a strategy is profitable beyond random chance can be done with a linear $O(n)$ pass of the historical data. Thus the problem of finding a profitable strategy is NP. If P=NP, then computing a profitable strategy can be done efficiently in polynomial time, since a trader can check each possible strategy in polynomial time. We can then trade based on our results to make the market efficient as well. If the market is efficient, it becomes impossible for a trader to predict future prices based on historical data, as the current price has all publicly available information "priced in". A future price would be a random fluctuation of the current price. Does an efficient market imply P=NP? An example 3-SAT: $$(a \lor b \lor !c) \land (a \lor !b \lor d)$$ The NP problem of 3-sat can be encoded into the market using order-cancels-order (OCO) orders[^1]. Where each variable is a security and negated variables are sell orders. Place these two OCOs in the market. $$\text{OCO (buy A, buy B, or sell C)}$$ $$\text{OCO (buy A, sell B, or buy D)}$$ After some constant time, any outstanding order is cancelled and all positions are liquidated. If the market is efficient, then there exists a way to execute these two OCO orders such that an overall profit is guaranteed. In other words, a market that is efficient allows us to solve an arbitrary 3-SAT problem in polynomial time. ### **Takeaway**: The author links market efficiency with computational complexity. [^1]: An order-cancels-order is a type of order on two different securities that automatically cancels the other order when one is filled. There is no chance that both orders can be filled. ![]() |
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. ![]() |