Moore's Law can't continue forever

"Moore's Law can't continue forever … We have another 10 to 20 years before we reach a fundamental limit."

Gordon Moore, Intel Co-Founder (2005)

Gordon Moore not only predicted the amazing growth of transistors per chip in 1965, but the opening chapter quote shows that he also predicted its demise 50 years later. As evidence, Figure 7.1 shows that even the company he founded - which for decades proudly used Moore's Law as a guideline for capital investment - is slowing its development of new semiconductor processes. During the semiconductor boom time, architects rode Moore's Law to create novel mechanisms that could turn the cornucopia of transistors into higher performance. The resources for a five-stage pipeline, 32-bit RISC processor - which needed as little as 25,000 transistors in the 1980s - grew by a factor of 100,000 to enable features that accelerated general-purpose code on general-purpose processors, as earlier chapters document.

The Tensor Processing Unit (TPU) is Google’s first custom ASIC DSA for WSCs. Its domain is the inference phase of DNNs, and it is programmed using the TensorFlow framework, which was designed for DNNs. The first TPU was been deployed in Google data centers in 2015. The heart of the TPU is a 65,536 (256x256) 8-bit ALU Matrix Multiply Unit and a large software-managed on-chip memory. The TPU’s single-threaded, deterministic execution model is a good match to the 99th-percentile response-time requirement of the typical DNN inference application.

The Matrix Multiply Unit is the heart of the TPU. It contains 256x256 ALUs that can perform 8-bit multiply-and-adds on signed or unsigned integers. The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below the matrix unit. When using a mix of 8-bit weights and 16-bit activations (or vice versa), the Matrix Unit computes at half-speed, and it computes at a quarter-speed when both are 16 bits. It reads and writes 256 values per clock cycle and can perform either a matrix multiply or a convolution. The nonlinear functions are calculated by the Activation hardware.

The weights for the matrix unit are staged through an on-chip Weight FIFO that reads from an off-chip 8 GiB DRAM called Weight Memory (for inference, weights are read-only; 8 GiB supports many simultaneously active models). The intermediate results are held in the 24 MiB on-chip Unified Buffer, which can serve as inputs to the Matrix Multiply Unit. A programmable DMA controller transfers data to or from CPU Host memory and the Unified Buffer.

六级/考研单词: amaze, transistor, guideline, invest, accelerate, domain, infer, deploy, data, execute, vice, compute, hardware, simultaneous, intermediate, unify, buffer

posted @ 2022-04-06 23:11  华容道专家  阅读(35)  评论(0)    收藏  举报