A Survey on High-Throughput Non-Binary LDPC Decoders: ASIC, FPGA and GPU Architectures
Authors: Oscar Ferraz, Srinivasan Subramaniyan, Ramesh Chinthalaa, João Andrade, Joseph R Cavallaro, Soumitra K Nandy, Vitor Silva, Xinmiao Zhang, Madhura Purnaprajna, Gabriel Falcao
Abstract: Non-binary low-density parity-check ( nbldpc) codes show higher error-correcting performance than binary codes when the codeword length is moderate and/or the channel has bursts of errors. The need for high-speed decoders for future digital communications led to the investigation of optimized nbldpc decoding algorithms and efficient implementations that target high throughput and low energy consumption levels. We carried out a comprehensive survey of existing nbldpc decoding hardware that targets the optimization of these parameters. Even though existing nbldpc decoders are optimized with respect to computational complexity and memory requirements, they still lag behind their binary counterparts in terms of throughput, power and area optimization. This study contributes to an overall understanding of the state-of-the-art on, and based systems, and highlights the current challenges that still have to be overcome on the path to more efficient nbldpc decoder architectures.
Authors: Sagar Eetha, PK Sruthi, Vibha Pant, Sai Vikram, Mihir Mody, Madhura Purnaprajna
Abstract: Convolutional Neural Networks (CNNs) are popular in Advanced Driver Assistance Systems (ADAS) for camera perception. The versatility of the algorithm makes it applicable in multiple applications like object detection, lane detection and semantic segmentation. For image processing to be viable in driver assistance systems, the throughput requirement ranges in the order of a few tens of TeraMACs per second (TMACs). In addition, high accuracy levels of image detection and recognition cannot be compromised for the need for high throughput.
In this paper, we present TileNET, a novel tiled architecture for ternary-weighted CNNs. TileNET is modular and scalable across variations in network organization and device configurations. Two modes of the implementation are presented, viz., systolic and streaming. A high-level estimation technique has been developed that facilitates fast performance evaluation through design space exploration among a range of target devices and varying CNN models.
Performance has been verified for area and throughput estimation for Xilinx Virtex, Artix, Kintex and Zynq devices. TileNET implemented on Virtex-7 (XC7VX1140T) results in a throughput of about 16 Tera-operations per second (TOPs) for LeNet, AlexNet, ResNet-50 and VGG-16. In addition, the 45nm standard cell implementation of TileNet shows a throughput of about 30 TOPs respectively.
Authors: Remya Ramakrishnan, Aditya KV Dev, AS Darshik, Renuka Chinchwadkar, Madhura Purnaprajna
Abstract: Convolutional Neural Networks (CNNs) are known for their high-performance despite its huge memory requirement and computational complexity. A wide range of compression techniques to reduce the number of parameters and hence computational and memory complexity have been exploring recently. In this paper, we analyse three widely used categories of techniques viz. quantization, pruning and tensor decomposition to make a cross-platform performance comparison on CPU, GPU and FPGA. These techniques are not mutually exclusive and hence can be combined to get better compression and a better speed-up on devices. Our focus is to highlight the contrasting impact of optimization techniques on devices and performance objectives. We observe a speed-up of 3.8 to 15.6× on CPU, 3.4 to 7.2× on GPU and 10.5 to 29.4×on FPGA across the models and compression techniques under consideration. We also achieved a compression of 93 to 97% across models with acceptable accuracy. Blended techniques have shown a better speed-up on FPGA compared to CPU and GPU as the caching effects, memory accesses and compiler optimizations slow down the inference on these general-purpose machines.
Authors: Srinivasan Subramaniyan, Oscar Ferraz, MR Ashuthosh, Santosh Krishna, Guohui Wang, Joseph R Cavallaro, Vitor Silva, Gabriel Falcao, Madhura Purnaprajna
Abstract: Signal processing hardware designers of Low-Density Parity-Check (LDPC) decoders used in modern optical communications are confronted with the need to perform multi-parametric design space exploration, targeting very high throughput (hundreds of Mbit/s) and low-power systems. This work addresses the needs of current designers of dedicated GF(2 m ) NB-LDPC decoders that necessitate robust approaches for dealing with the ever-increasing demand for higher BER performance. The constraints pose tremendous pressure on the on-chip design of irregular data structures and micro-circuit implementation for supporting the complex Galois field mathematics and communications of hundreds of check nodes with hundreds of variable node processors. We have developed kernels targeting GPU and FPGA (HLS and its equivalent RTL) descriptions of this class of complex circuits for comparing area, frequency of operation, latency, parallelism and throughput. Exploiting techniques such as using custom bit-widths, pipelining, loop-unrolling, array-partitioning and the replication of compute units, results in considerably faster design cycles and demands less non-recurring engineering effort. We report a throughput performance of 800 Mbps for the FPGA case.
Authors: Anu George, Madhura Purnaprajna and Prashanth Athri
Abstract: Adaptive sampling molecular dynamics based on Markov State Models use short parallel MD simulations to accelerate simulations, and are proven to identify hidden conformers. The accuracy of the predictions provided by it depends on the features extracted from the simulated data that is used to construct it. The identification of the most important features in the trajectories of the simulated system has a considerable effect on the results. Methods. In this study, we use a combination of Laplacian scoring and genetic algorithms to obtain an optimized feature subset for the construction of the MSM. The approach is validated on simulations of three protein folding complexes, and two protein ligand binding complexes. Results. Our experiments show that this approach produces better results when the number of samples is significantly lesser than the number of features extracted. We also observed that this method mitigates over fitting that occurs due to high dimensionality of large biosystems with shorter simulation times.
Authors: Oscar Ferraz, Srinivasan Subramaniyan, Guohui Wang, Joseph R Cavallaro, Gabriel Falcao, Madhura Purnaprajna
Abstract: It is commonly perceived that an HLS specification targeted for FPGAs cannot provide throughput performance in par with equivalent RTL descriptions. In this work we developed a complex design of a non-binary LDPC decoder, that although hard to generalise, shows that HLS provides sufficient architectural refinement options. They allow attaining performance above CPU- and GPU-based ones and excel at providing a faster design cycle when compared to RTL development.