model compression
quantization
- pytorch的quantization
PTQ and QAT
- PTQ(Post Training Qunatization)
- QAT(Quantization Aware Training)
  - 关于QAT中的伪量化
paper(OneBit: Towards Extremely Low-bit Large Language Models)
参考资料

model compression

模型压缩（轻量化）的四种常见方法

pruning
把权重矩阵W中的某些元素置0,希望对输出结果影响较小。
如果置0的时候按行或按列或按块，就称为结构化剪枝，否则为非结构化剪枝。
剪枝的时候可以调整权重来补偿剪枝带来的误差
knowledge distillation
训练一个较小的模型，用大模型（教师模型）来指导小模型（学生模型），学生模型不仅要去拟合数据，也要去逼近教师模型的分布。例如可以通过损失函数的定义来实现。例如 $l = l_{1} + α l_{2}$
matrix decomposition
将原来(矩阵的形状)大的权重矩阵分解成多个小的(低秩的)矩阵，用低秩矩阵近似原有权重矩阵。这样可以大大降低模型分解之后的计算量
把实数上的运算转化到整数域上，加速运算和减少存储需求。

quantization

把浮点数上的存储和运算转化成整数的存储和运算。
$[a, b]$ 上的运算转化到 $[c, d]$ , 其中 $[a, b]$ 是连续的， $[c, d]$ 是离散的整数。
$x_{R}$ 被量化后的指为 $x_{I}$ , $x_{I}$ 被反量化后的值为 $x_{Q}$

$x_{I} = ro u n d ((x_{R} - a) (\frac{d - c}{b - a}) + c)$
$x_{Q} = (x_{I} - c) (\frac{b - a}{d - c}) + a$

把 $\frac{b - a}{d - c}$ 记作s

x_{I} = ro u n d ((x_{R} - a) (\frac{1}{s}) + c) (1)

x_{Q} = (x_{I} - c) s + a (2)

把 $x_{I}$ 带入 $x_{Q}$ 中

x_{Q} = ro u n d ((x_{R} - a) (\frac{1}{s})) s + a (3)

如果 $x_{R}$ 不在 $[a, b]$ 内,那 $x_{I}$ 也极有可能不在[c,d]内。
如果 $x_{I}$ 超出了[c,d],那就取较近的边界值。

x_{I} = c l i p (ro u n d ((x_{R} - a) (\frac{1}{s}) + c), c, d) (4)

x_{Q} = (c l i p (ro u n d ((x_{R} - a) (\frac{1}{s}) + c), c, d) - c) s + a (5)

观察上述(1),(2),(3),(4)可以不依赖与区间 $[a, b]$ 和 $[c, d]$ 的存在。可以看作对整个数轴的线性变换。s为缩放比例。a和c决定的平移。

我们假设值都在正常范围内，没有超出 $[c, d]$ ，不考虑clip(即使用(3)中的 $x_{Q}$ )
$x_{Q} = x_{R} + ϵ$

ϵ = x_{Q} - x_{R} = ro u n d ((x_{R} - a) (\frac{1}{s})) s + a - x_{R} = ((x_{R} - a) (\frac{1}{s}) + δ) s + a - x_{R}, 其中 δ \sim U (- 0.5, 0.5) 均匀分布 = sδ

所以 $ϵ \sim U (- 0.5 s, 0.5 s)$

pytorch的quantization

1. 根据权重和数据的分布（或最值）确定量化参数scale和zero_point

例如，8比特的对称量化和非对称量化

if Symmetric: Otherwise: s = 2 max (∣ x_{min} ∣, x_{max}) / (Q_{max} - Q_{min}) z = {0128 if dtype is qint8 otherwise s = (x_{max} - x_{min}) / (Q_{max} - Q_{min}) z = Q_{min} - round (x_{min} / s)

其中的最大值和最小值不一定是“真实”的最大值和最小值

确定最值有多种方式

真实的最值 $x_{min} x_{max} = {min (X) min (x_{min}, min (X)) if x_{min} = None otherwise = {max (X) max (x_{max}, max (X)) if x_{max} = None otherwise$
最值的移动平均： $x_{min} = {min (X) (1 - c) x_{min} + c min (X) if x_{min} = None otherwise x_{max} = {max (X) (1 - c) x_{max} + c max (X) if x_{max} = None otherwise$ 还有一种是根据数据的分布确定最值。（具体的怎么算得我还没看。）

2. 量化

x_{I} = ro u n d (\frac{x _{R}}{s}) + z

x_{Q} = (x_{I} - z) s

x_{Q} = (ro u n d (\frac{x _{R}}{s})) s

这样量化使得整数z对应实数0

pytorch的量化误差

量化误差和之前得到的一样。

ϵ = x_{Q} - x_{R} = ro u n d ((x_{R}) (\frac{1}{s})) s - x_{R} = ((\frac{x _{R}}{s}) + δ) s - x_{R}, 其中 δ \sim U (- 0.5, 0.5) 均匀分布 = sδ

所以 $ϵ \sim U (- 0.5 s, 0.5 s)$

PTQ and QAT

PTQ和QAT都是训练完后再执行量化操作。

PTQ(Post Training Qunatization)

直接对模型进行量化。

根据是否确定actvation的量化参数，可以分为PTDQ和PTSQ。

只对权重量化，因为activation的范围是未知的（scale和zero_point未知）。等到推理的时候，计算出了activation,确定了量化参数，再对activation进行量化，称为PTDQ(D表示Dynamic)，（这个可能理解有点不准确）
对权重和activation都量化，虽然不知道actvation的范围但可以根据已有的数据进行估计。称为PTSQ(S表示static)

在对模型量化前，在模型需要量化的结点插入Observer,用来统计和计算最值，从而确定量化参数。

QAT(Quantization Aware Training)

获取一个训练好的模型。
插入伪量化（fake quantization）结点，在某个数据集上进行微调。（模型会在微调期间确定量化的参数）
使用第2步中得到参数对模型进行量化。

QAT一般只静态量化.

关于QAT中的伪量化

伪量化=量化+反量化。伪量化是模型的参数仍然都是浮点数。

x_{Q} = (c l i p (ro u n d ((x_{R} - a) (\frac{1}{s}) + c), c, d) - c) s + a

模型在微调时模拟了量化。伪量化结点也充当了Observer的角色。

伪量化结点的反向传播使用了（STE）

Quantization is discrete-valued, and thus the derivative is 0 almost everywhere.

\frac{\partial Q ( W )}{\partial W} = 0

The neural network will learn nothing since gradients become 0 and the weights won’t get updated.

g_{W} = \frac{\partial L}{\partial W} = \frac{\partial L}{\partial Q ( W )} \cdot \frac{\partial Q ( W )}{\partial W} = 0

Straight-Through Estimator (STE) simply passes the gradients through the quantization as if it had been the identity function.

g_{W} = \frac{\partial L}{\partial W} = \frac{\partial L}{\partial Q ( W )}

也就是说，“将被量化”的结点后会插入一个“伪量化”结点。前向传播时，伪量化结点模拟了量化误差，使得模型在微调阶段适应量化误差；反向传播时，伪量化结点的梯度首先会被计算出来，这里STE的意思就是把对【伪量化结点的梯度】作为【将被量化的结点的梯度】

paper(OneBit: Towards Extremely Low-bit Large Language Models)

rank-1 approximation

奇异值分解
$A = U Σ V^{T}$
取最大的奇异值( $σ_{1}$ )，和最大的奇异值对应的左奇异向量和右奇异向量
$\hat{A} = u_{1} σ_{1} v_{1}^{T}$

(《Foundations of Data Science》的3.5节证明了这样分解是Frobenius范数下最好的秩1分解，证明我还没看)

$W = W_{s i g n} ⊙ W_{v a l u e}$ (符号和绝对值的Hadamard积)

$W_{v a l u e} = a b^{T}$ (也就是说把矩阵的绝对值秩1分解)

$W_{v a l u e} \approx W_{s i g n} ⊙ (a b^{T})$

proposition 1

$X W^{T} \approx [(X ⊙ b^{T}) W_{sign}^{T}] ⊙ a^{T}$
这个命题的是为了得到想到的网络结构。这个里面的b和a就是模型架构里的g和h，其中 $X ⊙ b^{T}$ 操作表示对 $b^{T}$ 进行广播（ $b^{T}$ 是一个行向量，广播的意思就是把 $b^{T}$ 一行一行重复地排列下去，直到和X的形状一样。），然后和X作hadamard积。

proof

$w_{ij} \approx s_{ij} \cdot a_{i} b_{j}$ , where $s_{ij}$ is the element of $W_{sign}$ . Hence we have

(X W^{T})_{ij} = k \sum x_{ik} w_{kj}^{T} = k \sum x_{ik} w_{jk} \approx k \sum x_{ik} s_{jk} a_{j} b_{k} = k \sum x_{ik} b_{k} s_{jk} a_{j} = k \sum (X ⊙ b^{T})_{ik} s_{kj}^{T} a_{j} = [(X ⊙ b^{T}) W_{sign}^{T}]_{ij} a_{j} = {[(X ⊙ b^{T}) W_{sign}^{T}] ⊙ a^{T}}_{ij} .

lemma 1

Let $σ_{i} (W)$ denote the $i$ -th biggest singular value of matrix $W$ . The following inequality holds:

σ_{1} (∣ W ∣) \geq σ_{1} (W) .

Proof

According to the definition of induced norm, there are

σ_{1} (W) = ∥ W ∥_{2} = x, ∥ x ∥_{2} = 1 max ∥ Wx ∥_{2}, σ_{1} (∣ W ∣) = ∥ W ∥_{2} = y, ∥ y ∥_{2} = 1 max ∥∣ W ∣ y ∥_{2} .

Note that for $\forall x, ∥ x ∥_{2} = 1$ and we have

W ∥ x ∣ ∥_{2}^{2} = i \sum (j \sum ∣ w_{ij} ∣ ∣ x_{j} ∣)^{2} \geq i \sum (j \sum w_{ij} x_{j})^{2} = i \sum (j \sum w_{ij} x_{j})^{2} = ∥ Wx ∥_{2}^{2} .

因为我们总是可以让y取x的绝对值，所以：

y, ∥ y ∥_{2} = 1 max ∥∣ W ∣ y ∥_{2} \geq x, ∥ x ∥_{2} = 1 max ∥ Wx ∥_{2} .

This lemma is proved.

proposition 2

Given matrices $W$ and $∣ W ∣$ , $W = W_{sign} ⊙ ∣ W ∣$ . We decompose these matrices in the way $W = a b^{T} + E_{1}$ and $∣ W ∣ = \tilde{a} \tilde{b}^{T} + E_{2}$ , where $E_{i}$ denotes the error matrices. In terms of the Frobenius-norm, the SVID is closer to the original matrix $W$ :

W - W_{sign} ⊙ \tilde{a} \tilde{b}^{T}_{F}^{2} \leq W - a b^{T}_{F}^{2} .

这个命题主要想说，在Frobenius范数意义下，先取绝对值再分解比直接分解的误差小。

proof

Here we consider SVD to prove it. For SVD, the norm of the error matrix $E$ in the rank1 approximation is the sum of the squares of all singular values except for the largest one. We have
$E = A - \hat{A} = A - u_{1} σ_{1} v_{1}^{T}$

$∥ E ∥_{F}^{2} = \sum_{i, j} (A_{ij} - σ_{1} u_{1 i} v_{1 j})^{2}$

$∥ E ∥_{F}^{2} = \sum_{i, j} A_{ij}^{2} - 2 σ_{1} \sum_{i, j} A_{ij} u_{1 i} v_{1 j} + σ_{1}^{2} \sum_{i, j} u_{1 i}^{2} v_{1 j}^{2}$

u和v是U和V里的奇异向量， $u_{1} * σ_{1} = A v_{1}$ , 且u和v都是单位向量。 $\sum_{i, j} u_{1 i}^{2} v_{1 j}^{2} = (∣ u_{1} ∣^{2} ∣ u_{1} ∣^{2}) = 1$ ,所以

$∥ E ∥_{F}^{2} = \sum_{i, j} A_{ij}^{2} - 2 σ_{1} \sum_{i, j} A_{ij} u_{1 i} v_{1 j} + σ_{1}^{2}$

然后使用迹技巧

$i, j \sum A_{ij} u_{1 i} v_{1 j} = tr (A^{T} u_{1} v_{1}^{T}) = tr (v_{1}^{T} A^{T} u_{1}) = tr (u_{1}^{T} σ_{1} u_{1}) = σ_{1} tr (u_{1}^{T} u_{1}) = σ_{1} (v_{1}^{T} u_{1}) = σ_{1}$

所以

$∥ E ∥_{F}^{2} = i, j \sum A_{ij}^{2} - 2 σ_{1}^{2} + σ_{1}^{2} ∥ E ∥_{F}^{2} = i, j \sum A_{ij}^{2} - σ_{1}^{2}$

因为矩阵的Frobenius范数的平方等于矩阵的所有奇异值的平方和，即

$∥ A ∥_{F} = σ_{1}^{2} + σ_{2}^{2} + \dots + σ_{n}^{2}$

所以误差矩阵的Frobenius范数的平方等于除了最大奇异值的其他所有奇异值的平方和。

∥ E_{1} ∥_{F}^{2} = i = 2 \sum n σ_{i}^{2} (W), ∥ E_{2} ∥_{F}^{2} = i = 2 \sum n σ_{i}^{2} (∣ W ∣) .

Based on $∥ W ∥_{F}^{2} = ∥∣ W ∣ ∥_{F}^{2}$ , we have

i = 1 \sum n σ_{i}^{2} (W) = i = 1 \sum n σ_{i}^{2} (∣ W ∣) .

According to Lemma 1, we can conclude

∥ E_{2} ∥_{F}^{2} \leq ∥ E_{1} ∥_{F}^{2} .

From the equation in this proposition, we can formulate

W_{sign} ⊙ ∣ W ∣ = W_{sign} ⊙ \overline{a} \tilde{b}^{T} + W_{sign} ⊙ E_{2.}

Hence we have

W - W_{sign} ⊙ \overline{a} \tilde{b}^{T} = W_{sign} ⊙ E_{2} .

Therefore

∥ W_{sign} ⊙ E_{2} ∥_{F}^{2} = i, j \sum s_{ij}^{2} e_{ij}^{2} = i, j \sum e_{ij}^{2} = ∥ E_{2} ∥_{F}^{2} \leq ∥ E_{1} ∥_{F}^{2},

where $s_{ij} = \pm 1$ is the element of $W_{sign}$ . Hence the inequation in this proposition is proved.

知识蒸馏

这里就是用命题1得到的Y的表达式。
$W_{\pm 1} = Sign (W), Y = [(X ⊙ g) W_{\pm 1}^{T}] ⊙ h, Z = LayerNorm (Y),$

L_{CE} = - \frac{1}{n _{s}} i = 1 \sum n_{s} c \sum P_{c}^{T} (o_{i}) lo g P_{c}^{S} (o_{i}),

where $c$ denotes the number of classes and $n_{s}$ denotes the number of training samples in the current batch. $T$ and $S$ are the teacher model and student model, respectively. The error of hidden states is defined as

L_{MSE} = i = 1 \sum n_{s} j = 1 \sum n_{l} \frac{q _{i, j}^{T}}{q _{i, j}^{T} _{2}} - \frac{q _{i, j}^{S}}{q _{i, j}^{S} _{2}}_{2}^{2},

where $n_{l}$ denotes the number of layers and $q$ denotes the hidden state. Hence the final objective function can be formulated as

L_{KD} = L_{CE} + α L_{MSE},

$L_{CE}$ 使得学生模型在输出上接近教师模型。
$L_{MSE}$ 使得学生模型在内部参数上接近教师模型。

参考资料

https://leimao.github.io/article/Neural-Networks-Quantization/#Quantization
https://zhuanlan.zhihu.com/p/548174416
https://www.dropbox.com/scl/fi/1mo0umu0qtq7uxap2l5m3/lec06.pdf?rlkey=bdl2mgusgajddjuvjxb0fot36&dl=0
https://zhuanlan.zhihu.com/p/645259854
论文：OneBit: Towards Extremely Low-bit Large Language Models
论文：Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
论文：A Survey on Model Compression for Large Language Models

Blogs

探索

quantization

model compression

quantization

pytorch的quantization

1. 根据权重和数据的分布（或最值）确定量化参数scale和zero_point

其中的最大值和最小值不一定是“真实”的最大值和最小值

2. 量化

pytorch的量化误差

PTQ and QAT

PTQ(Post Training Qunatization)

QAT(Quantization Aware Training)

关于QAT中的伪量化

paper(OneBit: Towards Extremely Low-bit Large Language Models)

rank-1 approximation

proposition 1

proof

lemma 1

Proof

proposition 2

proof

知识蒸馏

参考资料

关系图谱

目录