Coherence

Definition, Property and Theorem about Coherence in 《ON THE GENERALIZATION MYSTERY IN DEEP LEARNING》

1. Relative work
2. Future work
3. Regularization techniques
4. Coherence
5. Shortcomings
6. Generalization Theorem of Coherence

1. Relative work

1.1. Neural networks can memorize random dataset

Understanding deep learning requires rethinking generalization.
Typical neural networks trained with stochastic gradient descent, which is the usual training method, can easily memorize a random dataset of the same size as the (real) dataset that they were designed for.

1.2. Implicit bias

Implicit bias:from among all the models that fit the training set perfectly in an over-parameterized setting, how does gradient descent find one that generalizes well to unseen data when such a model exists

1.3. Uniform convergence may be unable to explain generalization in deep learning.

Uniform convergence may be unable to explain generalization in deep learning.

1.4. What makes a pattern “simple” and why are such patterns reliably learned early

A Closer Look at Memorization in Deep Networks
In a study on memorization in shallow fully-connected networks and small convolutional networks on mnist and cifar-10, Arpit et al. 2017 discovered that for real datasets, starting from different random initializations, many examples are consistently classified correctly or incorrectly after one epoch of training. They call these easy and hard examples respectively. They hypothesize that this variability of difficulty in real data “is because the easier examples are explained by some simple patterns, which are reliably learned within the first epoch of training.”

1.5. Comments

First, the difficulty of an example is not simply a property of that example (whether it has simple patterns or not), but depends on the relationship of that example to others in the training set (what it shares with other examples).

Second, the dynamics of training, including initialization, can determine the difficulty of examples. Consequently, it can accommodate the observed phenomenon of adversarial initialization where examples that are easy to learn with random initialization become significantly harder to learn with a different, specially constructed initialization, and the generalization performance of the network suffers. Any notion of simplicity of patterns intrinsic to an example alone cannot explain adversarial initialization since the dataset remains the same, and therefore, so do the patterns in the data.

Examples learned late by gradient descent (that is, the hard examples), by themselves, are insufficient to define the decision boundary to the same extent that early (or easy) examples are.

2. Future work

Better metrics for coherence and a tighter bounds.
Providing a continuous explanation for the phenomenon of deep double descent
nonvacuous bound on the generalization gap
Generalization and width.
Practical use for new algorithms

3. Regularization techniques

Gradients mean estimation techniques
View L2 regularization
Early stopping

4. Coherence

α \equiv \frac{ℓ ( w + h ) - ℓ ( w )}{z \sim D E [ ℓ _{z} ( w + h _{z} ) - ℓ _{z} ( w ) ]} = \frac{z \sim D , z ^{'} \sim D E [ g _{z} \cdot g _{z^{'}} ]}{z \sim D E [ g _{z} \cdot g _{z} ]} = \frac{z \sim D E [ g _{z} ] ] z \sim D E [ g _{z} ]}{z \sim D E [ g _{z} \cdot g _{z} ]}

α (V) \equiv \frac{v \sim V , v ^{'} \sim V E [ v \cdot v ^{'} ]}{v \sim V E [ v \cdot v ]}

4.1. Boundedness and Scale Invariance

0 \leq α (V) \leq 1

where $α (V) = 0$ iff $E_{v \sim V} [v] = 0$ and $α (V) = 1$ iff all the vectors are equal.
Furthermore, for non-zero $k \in R$ , we have,

α (k V) = α (V)

where $k V$ denotes the distribution of the random variable $k v$ where $v$ is drawn from $V$ .

4.2. Stylized mini-batching

Let $v_{1}, v_{2}, .., v_{k}$ be $k$ i.i.d. variables drawn from $V$ . Let $W$ denote the distribution of the random variable $w = \frac{1}{k} \sum_{i = 1}^{k} v_{i}$ . We have,

α (W) = α (k W) = \frac{k \cdot α ( V )}{1 + ( k - 1 ) \cdot α ( V )}

Furthermore, $α (W) \geq α (V)$ with equality iff $α (V) = 0$ or $α (V) = 1$

When $α ≪ 1/ k$ but non-zero (i.e., we have high gradient diversity), creating mini-batches of size $k$ increases coherence almost $k$ times. But, when $α \approx 1$ (i.e., low diversity) there is not much point in creating mini-batches since there is little room for improvement.

4.3. Effect of zero vectors

If $W$ denotes the distribution where with probability $p > 0$ we pick a vector from $V$ and with probability $1 - p$ we pick the zero vector then $α (W) = p \cdot α (V)$

4.4. Combining orthogonal distributions

If $W$ denotes the distribution where with probability $p > 0$ we pick a vector from $U$ and with probability $1 - p$ we pick a vector from $V$ and all elements in the support of $U$ are orthogonal to those in the support of $V$ then we have

α (W) \leq p \cdot α (U) + (1 - p) \cdot α (V)

4.5. Bound on Expected Difference

v \sim V, v^{'} \sim V E [∥ v - v^{'} ∥] \leq 2 (1 - α (V)) v \sim V E [v \cdot v]

4.6. Decomposition

Let $V$ be the support of the distribution $V$ . Furthermore, let

V = V_{1} \oplus V_{2} \oplus \dots \oplus V_{k}

where the subspaces $V_{i} (1 \leq i \leq k)$ are orthogonal to each other, that is, $V$ is the orthogonal direct sum of the $V_{i} . V$ induces a distribution $V_{i}$ on each $V_{i}$ . Suppose each $α (V_{i})$ exists. Then,

α (V) = f_{1} \cdot α (V_{1}) + f_{2} \cdot α (V_{2}) + \dots + f_{k} \cdot α (V_{k})

where $f_{i} \equiv E_{v_{i} \sim V_{i}} [v_{i} \cdot v_{i}] / (\sum_{i = 1}^{k} v_{i} \sim V_{i} E [v_{i} \cdot v_{i}])$ and $0 \leq f_{i} \leq 1$ and $\sum_{i = 0}^{k} f_{i} = 1$ .

5. Shortcomings

As a measure of coherence, αm/α⊥m over the whole network, being an average, is a blunt instrument, and therefore, a finer-grained analysis, for example, on a per-layer basis, is sometimes necessary.

6. Generalization Theorem of Coherence

6.1. Gap of Generalization

gap (D, m) \equiv E_{S \sim D^{m}} E_{θ} [R (A_{θ} (S)) - \hat{R} (A_{θ} (S), S)]

6.2. Stability

stab (D, m) \equiv E_{S \sim D^{m}} E_{S^{'} \sim D^{m}} E_{θ} \frac{1}{m} i \in [m] \sum [ℓ (A_{θ} (S^{(i)}), z_{i}) - ℓ (A_{θ} (S), z_{i})]

6.3. Stability equals generalization

gap (D, m) = stab (D, m) .

6.4. Smoothness Assumptions.

We assume that $ℓ (\cdot, z)$ is $L$ -Lipschitz and differentiable for every $z \in Z$ , that is,
$∣ ℓ (w, z) - ℓ (w^{'}, z) ∣ \leq L ∥ w - w^{'} ∥$
and
$∥ g (w, z) ∥ \leq L$
We also assume that $ℓ (\cdot, z)$ is $β$ -smooth for every $z \in Z$ . That is, we assume,
$∥ g (w, z) - g (w^{'}, z) ∥ \leq β ∥ w - w^{'} ∥$

6.5. Stability of (Stochastic) Gradient Descent

Let $I_{t} (θ)$ be an indicator variable that is 1 if the ith training example is selected in the mini-batch used at time step t ∈ [T ], and 0 otherwise. Therefore,

θ E [I_{t} (θ)] = b / m

Let $δ_{t} = ∥ w_{t} - w_{t}^{'} ∥$ , and let $Δ_{t}^{''} (b) = \frac{1}{b} \sum_{j \in [b]} [g (w_{t - 1}, z_{t j}) - g (w_{t - 1}^{'}, z_{t j}^{'})]$ . then

δ_{t} \leq δ_{t - 1} + η_{t} ∥ Δ_{t}^{''} (b) ∥

6.6. 1-Step Expansion

For $t \in [T]$ , we have,

δ_{t} \leq (1 + η_{t} β) \cdot δ_{t - 1} + \frac{η _{t} I _{t} ( θ )}{b} \cdot ∥ g (w_{t - 1}, z_{i}) - g (w_{t - 1}, z_{i}^{'}) ∥

6.7. Unrolling recursion).

δ_{T} \leq t \in [T] \sum (\frac{η _{t} I _{t} ( θ )}{b} \cdot ∥ g (w_{t - 1}, z_{i}) - g (w_{t - 1}, z_{i}^{'}) ∥ \cdot k = t + 1 \prod T (1 + η_{k} β))

6.8. Bound the difference between examples by the coherence as measured by α

S \sim D^{m} E z_{i}^{'} \sim D E θ E [ℓ (A_{θ} (S^{(i)}), z_{i}) - ℓ (A_{θ} (S), z_{i})] \leq \frac{L ^{2}}{m} t \in [T] \sum [η_{k} β]_{k = t + 1}^{T} \cdot η_{t} \cdot 2 (1 - α (w_{t - 1}))

6.9. Stability Theorem

∣ stab (D, m) ∣ \leq \frac{L ^{2}}{m} t \in [T] \sum [η_{k} β]_{k = t + 1}^{T} \cdot η_{t} \cdot 2 (1 - α (w_{t - 1}))

6.10. Generalization Theorem

∣ gap (D, m) ∣ \leq \frac{L ^{2}}{m} t \in [T] \sum [η_{k} β]_{k = t + 1}^{T} \cdot η_{t} \cdot 2 (1 - α (w_{t - 1}))

6.11. Corollary: fixed step sizes(learning rate)

If the step sizes are fixed, that is, $η_{t} = η$ then

∣ gap (D, m) ∣ \leq \frac{L ^{2} η}{m} t \in [T] \sum exp ((T - t) η β) \cdot 2 (1 - α (w_{t - 1}))

6.12. Corollary: step sizes decay linearly

If we assume as in Hardt et al. [2016] that step sizes decay linearly, that is, for some $η > 0$ we have $η_{t} \leq η / t$ then

∣ gap (D, m) ∣ \leq \frac{L ^{2} η T ^{η β}}{m} t \in [T] \sum 2 (1 - α (w_{t - 1}))

Blogs

探索