问题定义

$\mu(F(m, r))=m+r-1$

\begin{aligned} \mu(F(m \times n, r \times s)) &=\mu(F(m, r)) \mu(F(n, s)) \\ &=(m+r-1)(n+s-1) \end{aligned}

一个例子 F(2, 3)

$F(2, 3) = \left[ \begin{array}{lll}{d_{0}} & {d_{1}} & {d_{2}} \\ {d_{1}} & {d_{2}} & {d_{3}}\end{array}\right] \left[ \begin{array}{l}{g_{0}} \\ {g_{1}} \\ {g_{2}}\end{array}\right]=\left[ \begin{array}{c}{r_0} \\ {r_1}\end{array}\right]$

$\begin{array}{l}{r_{0}=\left(d_{0} \cdot g_{0}\right)+\left(d_{1} \cdot g_{1}\right)+\left(d_{2} \cdot g_{2}\right)} \\ {r_{1}=\left(d_{1} \cdot g_{0}\right)+\left(d_{2} \cdot g_{1}\right)+\left(d_{3} \cdot g_{2}\right)}\end{array}$

$F(2,3)=\left[ \begin{array}{lll}{d_{0}} & {d_{1}} & {d_{2}} \\ {d_{1}} & {d_{2}} & {d_{3}}\end{array}\right] \left[ \begin{array}{l}{g_{0}} \\ {g_{1}} \\ {g_{2}}\end{array}\right]=\left[ \begin{array}{c}{m_{1}+m_{2}+m_{3}} \\ {m_{2}-m_{3}-m_{4}}\end{array}\right]$

$\begin{array}{ll}{m_{1}=\left(d_{0}-d_{2}\right) g_{0}} & {m_{2}=\left(d_{1}+d_{2}\right) \frac{g_{0}+g_{1}+g_{2}}{2}} \\ {m_{4}=\left(d_{1}-d_{3}\right) g_{2}} & {m_{3}=\left(d_{2}-d_{1}\right) \frac{g_{0}-g_{1}+g_{2}}{2}}\end{array}$

• 输入信号$$d$$上：4次加法（减法）
• 卷积核$$g$$上：3次加法（$$g_1+g_2$$中间结果可保留），2次乘法（除法）
• 输出$$m$$上：4次乘法，4次加法

$Y=A^{T}\left[(G g) \odot\left(B^{T} d\right)\right]$

$B^{T}=\left[ \begin{array}{cccc}{1} & {0} & {-1} & {0} \\ {0} & {1} & {1} & {0} \\ {0} & {-1} & {1} & {0} \\ {0} & {1} & {0} & {-1}\end{array}\right]$

$G=\left[ \begin{array}{ccc}{1} & {0} & {0} \\ {\frac{1}{2}} & {\frac{1}{2}} & {\frac{1}{2}} \\ {\frac{1}{2}} & {-\frac{1}{2}} & {\frac{1}{2}} \\ {0} & {0} & {1}\end{array}\right]$

$A^{T}=\left[ \begin{array}{llll}{1} & {1} & {1} & {0} \\ {0} & {1} & {-1} & {-1}\end{array}\right]$

$g=\left[ \begin{array}{lll}{g_{0}} & {g_{1}} & {g_{2}}\end{array}\right]^{T}$

$d=\left[ \begin{array}{llll}{d_{0}} & {d_{1}} & {d_{2}} & {d_{3}}\end{array}\right]^{T}$

• $$g$$：卷积核
• $$d$$：输入信号
• $$G$$：Filter transform矩阵，尺寸$$(m+r-1)\times r$$
• $$B^T$$：Input transform矩阵，尺寸$$(m+r-1)\times (m+r-1)$$
• $$A^T$$：Output transform矩阵，尺寸$$m \times (m+r-1)$$

• Input transform
• Filter transform
• Output transform

1D to 2D，F(2, 3) to F(2x2, 3x3)

A minimal 1D algorithm F(m, r) is nested with itself to obtain a minimal 2D algorithm,F(m×m, r×r).

$Y=A^{T}\left[\left[G g G^{T}\right] \odot\left[B^{T} d B\right]\right] A$

$$D_0 = [k_0, k_1, k_2, k_3]^T$$，即窗口中的第0行元素，$$D_1 \ D_2 \ D_3$$表示第1、2、3行；$$W_0=[w_0, w_1, w_2]^T$$

\begin{aligned} \left[ \begin{array}{c}{r_0} \\ {r_1} \\ {r_2} \\ {r_3}\end{array}\right] &= \left[ \begin{array}{c}{R_0} \\ {R_1}\end{array}\right] = \left[ \begin{array}{c}{K_0 W_0 + K_1 W_1 + K_2 W_2} \\ {K_1 W_0 + K_2 W_1 + K_3 W_2} \end{array} \right] \\ &= \left[ \begin{array}{c} {A^{T}\left[(G W_0) \odot\left(B^{T} D_0 \right)\right] + A^{T}\left[(G W_1) \odot\left(B^{T} D_1 \right)\right] + A^{T}\left[(G W_2) \odot\left(B^{T} D_2 \right)\right]} \\ {A^{T}\left[(G W_0) \odot\left(B^{T} D_1 \right)\right] + A^{T}\left[(G W_1) \odot\left(B^{T} D_2 \right)\right] + A^{T}\left[(G W_2) \odot\left(B^{T} D_3 \right)\right]} \end{array} \right] \\ \\ &=A^{T}\left[\left[G [W_0 \ W_1 \ W_2 ] G^{T}\right] \odot\left[B^{T} [d_0 \ d_1 \ d_2 \ d_3] B\right]\right]A \\ \\ &=A^{T}\left[\left[G g G^{T}\right] \odot\left[B^{T} d B\right]\right] A \end{aligned}

• 上面我们仅仅是针对一个小的image tile，但是在卷积神经网络中，feature map的尺寸可能很大，难道我们要实现$$F(224, 3)$$吗？

• Input transform
• Filter transform
• Batched-GEMM（批量矩阵乘法）
• Output transform

总结

• Winograd算法通过减少乘法次数来实现提速，但是加法的数量会相应增加，同时需要额外的transform计算以及存储transform矩阵，随着卷积核和tile的尺寸增大，就需要考虑加法、transform和存储的代价，而且tile越大，transform矩阵越大，计算精度的损失会进一步增加，所以一般Winograd只适用于较小的卷积核和tile（对大尺寸的卷积核，可使用FFT加速），在目前流行的网络中，小尺寸卷积核是主流，典型实现如$$F(6\times 6, 3\times 3)$$$$F(4\times 4, 3\times 3)$$$$F(2\times 2, 3\times 3)$$等，可参见NCNNFeatherCNNARM-ComputeLibrary等源码实现。
• 与im2col+GEMM+col2im相比，winograd在划分时使用了更大的tile，就划分方式而言，$$F(1\times 1, r\times r)$$与im2col相同。