Loading [MathJax]/extensions/Safe.js

[link]
Summary by Brady Neal 7 years ago

Main Results (tl;dr)

Deep Linear Networks

  1. Loss function is non-convex and non-concave
  2. Every local minimum is a global minimum
  3. Shallow neural networks don't have bad saddle points
  4. Deep neural networks do have bad saddle points

Deep ReLU Networks

  • Same results as above by reduction to deep linear networks under strong simplifying assumptions
  • Strong assumptions:
    • The probability that a path through the ReLU network is active is the same, agnostic to which path it is.
    • The activations of the network are independent of the input data and the weights.

Highlighted Takeaways

  • Depth doesn't create non-global minima, but depth does create bad saddle points.
  • This paper moves deep linear networks closer to a good model for deep ReLU networks by discarding 5 of the 7 of the previously used assumptions. This gives more "support" for the conjecture that deep ReLU networks don't have bad local minima.
  • Deep linear networks don't have bad local minima, so if deep ReLU networks do have bad local minima, it's purely because of the introduction of nonlinear activations. This highlights the importance of the activation function used.
  • Shallow linear networks don't have bad saddles point while deep linear networks do, indicating that the saddle point problem is introduced with depth beyond the first hidden layer.

Bad saddle point : saddle point whose Hessian has no negative eigenvalues (no direction to descend)

Shallow neural network : single hidden layer

Deep neural network : more than one hidden layer

Bad local minima : local minima that aren't global minima

Position in Research Landscape

More Details

Deep Linear Networks

  • Main result is Result 2, which proves the conjecture from 1989: every local minimum is a global minimum.
  • Not where the strong assumptions come in
  • Assumptions (realistic and practically easy to satisfy):
    • $XX^T$ and $XY^T$ are full rank
    • $d_y \leq d_x$ (output is lower dimension than input)
    • $\Sigma = YX^T(XX^T )^{−1}XY^T$ has $d_y$ distinct eigenvalues
    • specific to the squared error loss function
  • Essentially gives a comprehensive understanding of the loss surface of deep linear networks

Deep ReLU Networks

  • Specific to ReLU activation. Makes strong use of its properties
  • Choromanska et al. (2015) relate the loss function to the Hamiltonian of the spherical spin-glass model, using 3 reshaping assumptions. This allows them to apply existing random matrix theory results. This paper drops those reshaping assumptions by performing completely different analysis.
  • Because Choromanska et al. (2015) used random matrix theory, they analyzed a random Hessian, which means they need to make 2 distributional assumptions. This paper also drops those 2 assumptions and analyzes a deterministic Hessian.
  • Remaining Unrealistic Assumptions:
    • The probability that a path through the ReLU network is active is the same, agnostic to which path it is.
    • The activations of the network are independent of the input data and the weights.
more
Your comment:

Send Feedback
ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: