Definition, Property and Theorem about Coherence in 《ON THE GENERALIZATION MYSTERY IN DEEP LEARNING》
- 1. Relative work
- 2. Future work
- 3. Regularization techniques
- 4. Coherence
- 5. Shortcomings
- 6. Generalization Theorem of Coherence
- 6.1. Gap of Generalization
- 6.2. Stability
- 6.3. Stability equals generalization
- 6.4. Smoothness Assumptions.
- 6.5. Stability of (Stochastic) Gradient Descent
- 6.6. 1-Step Expansion
- 6.7. Unrolling recursion).
- 6.8. Bound the difference between examples by the coherence as measured by α
- 6.9. Stability Theorem
- 6.10. Generalization Theorem
- 6.11. Corollary: fixed step sizes(learning rate)
- 6.12. Corollary: step sizes decay linearly
1. Relative work
1.1. Neural networks can memorize random dataset
Understanding deep learning requires rethinking generalization.
Typical neural networks trained with stochastic gradient descent, which is the usual training method, can easily memorize a random dataset of the same size as the (real) dataset that they were designed for.
1.2. Implicit bias
Implicit bias:from among all the models that fit the training set perfectly in an over-parameterized setting, how does gradient descent find one that generalizes well to unseen data when such a model exists
1.3. Uniform convergence may be unable to explain generalization in deep learning.
Uniform convergence may be unable to explain generalization in deep learning.
1.4. What makes a pattern “simple” and why are such patterns reliably learned early
A Closer Look at Memorization in Deep Networks
In a study on memorization in shallow fully-connected networks and small convolutional networks on mnist and cifar-10, Arpit et al. 2017 discovered that for real datasets, starting from different random initializations, many examples are consistently classified correctly or incorrectly after one epoch of training. They call these easy and hard examples respectively. They hypothesize that this variability of difficulty in real data “is because the easier examples are explained by some simple patterns, which are reliably learned within the first epoch of training.”
1.5. Comments
First, the difficulty of an example is not simply a property of that example (whether it has simple patterns or not), but depends on the relationship of that example to others in the training set (what it shares with other examples).
Second, the dynamics of training, including initialization, can determine the difficulty of examples. Consequently, it can accommodate the observed phenomenon of adversarial initialization where examples that are easy to learn with random initialization become significantly harder to learn with a different, specially constructed initialization, and the generalization performance of the network suffers. Any notion of simplicity of patterns intrinsic to an example alone cannot explain adversarial initialization since the dataset remains the same, and therefore, so do the patterns in the data.
Examples learned late by gradient descent (that is, the hard examples), by themselves, are insufficient to define the decision boundary to the same extent that early (or easy) examples are.
2. Future work
- Better metrics for coherence and a tighter bounds.
- Providing a continuous explanation for the phenomenon of deep double descent
- nonvacuous bound on the generalization gap
- Generalization and width.
- Practical use for new algorithms
3. Regularization techniques
- Gradients mean estimation techniques
- View L2 regularization
- Early stopping
4. Coherence
4.1. Boundedness and Scale Invariance
where iff and iff all the vectors are equal.
Furthermore, for non-zero , we have,
where denotes the distribution of the random variable where is drawn from .
4.2. Stylized mini-batching
Let be i.i.d. variables drawn from . Let denote the distribution of the random variable . We have,
Furthermore, with equality iff or
When but non-zero (i.e., we have high gradient diversity), creating mini-batches of size increases coherence almost times. But, when (i.e., low diversity) there is not much point in creating mini-batches since there is little room for improvement.
4.3. Effect of zero vectors
If denotes the distribution where with probability we pick a vector from and with probability we pick the zero vector then
4.4. Combining orthogonal distributions
If denotes the distribution where with probability we pick a vector from and with probability we pick a vector from and all elements in the support of are orthogonal to those in the support of then we have
4.5. Bound on Expected Difference
4.6. Decomposition
Let be the support of the distribution . Furthermore, let
where the subspaces are orthogonal to each other, that is, is the orthogonal direct sum of the induces a distribution on each . Suppose each exists. Then,
where and and .
5. Shortcomings
As a measure of coherence, αm/α⊥m over the whole network, being an average, is a blunt instrument, and therefore, a finer-grained analysis, for example, on a per-layer basis, is sometimes necessary.
6. Generalization Theorem of Coherence
6.1. Gap of Generalization
6.2. Stability
6.3. Stability equals generalization
6.4. Smoothness Assumptions.
-
We assume that is -Lipschitz and differentiable for every , that is,
and
-
We also assume that is -smooth for every . That is, we assume,
6.5. Stability of (Stochastic) Gradient Descent
Let be an indicator variable that is 1 if the ith training example is selected in the mini-batch used at time step t ∈ [T ], and 0 otherwise. Therefore,
Let , and let . then
6.6. 1-Step Expansion
For , we have,
6.7. Unrolling recursion).
6.8. Bound the difference between examples by the coherence as measured by α
6.9. Stability Theorem
6.10. Generalization Theorem
6.11. Corollary: fixed step sizes(learning rate)
If the step sizes are fixed, that is, then
6.12. Corollary: step sizes decay linearly
If we assume as in Hardt et al. [2016] that step sizes decay linearly, that is, for some we have then