ASICs, the tide drops to

With the development of machine learning, edge computing, and autonomous driving, the generation of a large number of data processing tasks has made people put forward high requirements for chip computing efficiency, computing power, and energy consumption ratio. In this context, ASIC has gained more and more people’s attention.

In March 1981, Sinclair launched the ZX81, an 8-bit personal computer, in which the Z80 processor is considered the earliest ASIC prototype. ASIC (Application Specific Integrated Circuit) chip is an application-specific integrated circuit. It is a proprietary application chip designed and manufactured from the root level to meet the needs of users for specific electronic systems. It is widely used in artificial intelligence equipment, virtual currency mining equipment, and consumables. Intelligent terminals such as printing equipment and military defense equipment. At the hardware level, ASIC chips are composed of basic silicon materials, gallium phosphide, gallium arsenide, gallium nitride and other materials. At the physical structure level, ASIC chip modules usually include 32-bit microprocessors, memory blocks, network circuits, etc.


Different ASIC chips

ASIC chips can be divided into TPU chips, DPU chips and NPU chips according to different terminal functions. Among them, TPU is a tensor processor, dedicated to machine learning. For example, Google developed a programmable AI accelerator for the Tensorflow platform in May 2016, and its internal instruction set can run when the Tensorflow program changes or the algorithm is updated. DPU stands for Data Processing Unit, which can provide an engine for computing scenarios such as data centers. NPU is a neural network processor that simulates human neurons and synapses at the circuit level, and directly processes large-scale electronic neurons and synapses with deep learning instruction sets.

There are two design methods of ASIC, full custom and semi-custom. Full customization relies on huge labor and time costs to complete the entire integrated circuit design process in a completely autonomous manner. Although it is more flexible and better than semi-custom ASICs, its development efficiency is much lower than semi-custom ASICs.

As the design of functional block circuits and cell libraries matures, semi-custom ASIC design gradually replaces full-custom methods. Designers can more easily design directly using standard logic cells from pre-completed cell libraries, or using gate arrays, and now rarely do complete circuit designs with a full-custom approach. Based on standard logic cells and based on gate arrays are the two main design methods used in the current semi-customized ASIC design.

The standard cell-based method directly selects standard logic cells from the cell library, such as various small and medium-sized integrated circuit cells and gate-level, behavior-level or even system-level circuit modules. These standard cells have been pre-designed before being used in ASIC design. It has been verified by strict design rules and has high reliability. Semi-custom designers can directly use it from the unit library for system design, which is easy to use.

The method based on the gate array is to determine the mask with full customization on the transistor array formed by the interconnection metal layer arrangement, and complete the design through the interconnection between the masks. This gate array is called MGA because of its prominent form ( masked gate array). The gate array library customizes metal interconnection lines on the basis of the same logic cell layout.

The flow of ASIC design is from top to bottom—the design idea of “Top-Down” is usually adopted by ASIC based on standard cells, and the basic flow chart of its design is shown in the figure.


Comparison of ASIC and CPU, FPGA, etc.

CPU: Based on low-latency design, it has strong single-time logical processing capability, but has limited processing capability in the face of a large amount of data with limited power consumption. The central processing unit (CPU) requires a strong computing power for processing different types of data and a logical judgment ability for processing branches and jumps, all of which make the internal structure of the CPU extremely complex. Deep learning models need to be trained with a large amount of data to achieve the desired effect. The sudden burst of data torrents meets the requirements of the deep learning algorithm for the amount of training data, but the implementation of the algorithm also requires the support of the extremely high computing speed of the corresponding processor. The current popular traditional CPU processor architectures including X86 and ARM often require hundreds or even thousands of instructions to complete the processing of a neuron, but for programs that do not require too many instructions, they require the depth of massive data operations. Given the computational demands of learning, this structure becomes very clumsy. Especially under the current power consumption limit, there is no way to increase the CPU frequency to speed up the execution of instructions, and this contradiction is becoming more and more irreconcilable.

GPU: A more mature ecosystem, the first to benefit from the explosion of artificial intelligence. A GPU is similar to a CPU, except that it is a microprocessor that specializes in image computing. GPUs are designed to perform the complex mathematical and geometric calculations necessary for graphics rendering. GPU can provide dozens or even hundreds of times the performance of CPU in some calculations such as floating-point operations and parallel computing. But it has three limitations: 1. The advantages of parallel computing cannot be fully utilized in the application process. 2. The hardware structure is fixed and does not have programmability. 3. Running deep learning algorithms is far less energy efficient than ASICs and FPGAs.

FPGA: AI whiteboard with medium energy efficiency, high flexibility, and high cost, has three types of limitations. FPGA is called Field Programmable Gate Array, and users can perform repeated programming according to their own needs. Compared with GPU and CPU, it has the characteristics of high performance, low energy consumption, and hardware programming. At the same time, there are three types of limitations: 1. The computing power of the basic unit is limited; 2. The speed and power consumption need to be improved; 3. The price of FPGA is relatively expensive.

ASIC: Designed for a specific purpose. Unlike the flexibility of GPUs and FPGAs, customized ASICs cannot be changed once they are manufactured, so the high initial cost and long development cycle make the barriers to entry high. At present, most of the giants that have AI algorithms and are good at chip research and development participate, such as Google’s TPU. ASIC chips have the following advantages: 1. Advantages of specifications: ASIC chips make full use of the unit operation unit function during design to avoid the existence of redundant calculation units, which is conducive to reducing the size of the chip. 2. Advantages in energy consumption: The energy consumption per unit computing power of ASIC chips is lower than that of CPU, GPU, and FPGA. Constraints on energy consumption by new smart home appliances. 3. Integration advantages: Due to the use of customized design, the ASIC chip system, circuit, and process are highly integrated, which helps customers obtain high-performance integrated circuits. For example, TPU1 is 14-16 times that of traditional GPU, and NPU is 118 times that of GPU. The Cambrian has released an instruction set for external applications, and ASIC will be the core of future AI chips.


What is the future of ASICs?

ASIC chips and their supporting products have initially formed an application model in the downstream smart home appliance market, and have a broad market space. Affected by the trend of the Internet of Things, home appliance manufacturers such as Midea, Gree, Haier, and Hisense have successively deployed various smart home appliances. By embedding ASIC chips, home appliance manufacturers can obtain higher profits and promote the construction of smart cities.

The optimized algorithm architecture Tensor Processing Unit developed by Google, TPU is between CPU and fully customized ASIC in algorithm architecture, and has the functions of desktop computing equipment and embedded computing equipment. The TPU algorithm is more tolerant and fault-tolerant, and its hardware composition is simpler than that of CPU general-purpose chips. Under the condition of the same number of transistors, the ASIC chip of the TPU algorithm architecture can complete a higher amount of calculation. Compared with CPUs and GPUs of the same level, this type of ASIC chip can improve computing performance by 15 to 30 times, and improve energy efficiency by 30 to 80 times. In addition, Cisco launched a firewall-specific ASIC chip that uses network acceleration protocols in algorithm, and Qualcomm launched a baseband-specific ASIC chip that uses communication protocols, Fourier transform and other optimization algorithms. The autonomous driving computing system is in the stage of rapid change and evolution, or will enter the algorithm stabilization stage within 5 years. Experts pointed out that the ASIC chip based on the optimized design of the fixed algorithm will become the mainstream core module of the automatic driving computing system.

Because the ASIC algorithm architecture is closer to the underlying algorithm and greatly reduces redundant transistors and connections in the physical structure, ASIC chips outperform traditional chips in terms of computing throughput, delay, power consumption and other parameters. At this stage, the core chip of the autonomous driving system has shifted from GPU to FPGA, and is gradually transitioning to ASIC. Compared with FPGA chips, under the ASIC architecture, the computing efficiency and computing power of the autonomous driving system can be customized. Once it reaches mass production scale, its average cost will be lower than that of FPGA chips. Under the same process conditions, the calculation speed of ASIC is about 5 times or more than that of FPGA.


Development status at home and abroad

ASIC chips are being valued in the chip industry. Including categories such as DPU and NPU. The DPU is mainly responsible for the accelerated processing tasks of the network, storage and security, and is designed to meet the dedicated computing needs of the network side, especially for scenarios with a large number of servers and strict requirements on data transmission rates. Specifically, the DPU can smoothly take over data processing tasks such as network protocol processing, data encryption and decryption, and data compression that the CPU is not good at, and manage, expand, and schedule various resources separately. In the first half of 2020, NVIDIA acquired the Israeli network chip company Mellanox Technologies at a consideration of US$6.9 billion, and launched the BlueField-2 DPU in the same year, defining it as the “third main chip” after the CPU and GPU, and officially opened the DPU. Prelude to a great development.

Google recently released its new generation of tensor processor TPU v4 cluster at the I/O 2022 event. The company CEO Sundar Pichai said that the new computing power cluster is called Pod, which contains 4096 v4 chips and can provide more than 1 The floating-point performance of exaflops, Pichai said that it will deploy 8 TPU v4 clusters in the data center in Oklahoma, achieving a total performance of about 9 exaflops,

In August this year, two products, Intel Agilex FPGA and Stratix 10 NX FPGA, were deployed to the China Innovation Center. Intel Agilex FPGA integrates Intel SuperFin process technology, Chiplet, 3D packaging, etc. It has made significant progress in production, process, packaging, interconnection, etc. compared with previous generation products, and can be widely used in 5G and artificial intelligence scenarios. A data-centric world offers agility and flexibility. Compared to Intel Stratix 10 FPGAs, Intel Agilex FPGAs provide 45% higher performance and 40% lower power consumption.

China is also making efforts in the ASIC market. Alibaba officially released the new Hanguang 800 AI chip. The performance breakthrough of Pingtouge Hanguang 800 chip has benefited from the collaborative innovation of software and hardware: the self-developed chip architecture is adopted at the hardware level, and the chip performance bottleneck problem is effectively solved through inference acceleration and other technologies; And visual algorithms are deeply optimized for calculation and storage density, which can realize the calculation of large network models on one NPU.

Zhongke Yushu has designed the industry’s first DPU chip and smart network card series products with integrated acceleration function of network database. The founding team is from a scientific research institute, and is developing the third-generation DPU chip K2 Pro, and is committed to the domestic replacement of DPU chips. OPPO released the self-developed image-specific NPU chip “Maria? MariSilicon? X”.

The diannao series of NPU chips produced by Cambrian. On August 18, 2021, Baidu launched its first 7nm self-developed “Kunlun 2nd Generation AI Chip” at the World Conference. The performance, versatility, and ease of use of Kunlun Core 2 have been significantly enhanced compared with the first-generation products. The chip adopts the world’s leading 7nm process and is equipped with the self-developed second-generation XPU architecture, which improves the performance by 2-3 times compared with the first generation. Integer precision (INT8) computing power reaches 256 TeraOPS, half precision (FP16) is 128 TeraFLOPS, and the maximum power consumption is only 120W.

ASIC deep learning, data centers, edge computing and other fields have been widely used and are developing rapidly.