Pushing the Limits of Energy Efficiency for Non-Binary LDPC Decoders on GPUs and FPGAs


Authors: Srinivasan Subramaniyan, Oscar Ferraz, MR Ashuthosh, Santosh Krishna, Guohui Wang, Joseph R Cavallaro, Vitor Silva, Gabriel Falcao, Madhura Purnaprajna

Description: Signal processing hardware designers of Low-Density Parity-Check (LDPC) decoders used in modern optical communications are confronted with the need to perform multi-parametric design space exploration, targeting very high throughput (hundreds of Mbit/s) and low-power systems. This work addresses the needs of current designers of dedicated GF(2 m ) NB-LDPC decoders that necessitate robust approaches for dealing with the ever-increasing demand for higher BER performance. The constraints pose tremendous pressure on the on-chip design of irregular data structures and micro-circuit implementation for supporting the complex Galois field mathematics and communications of hundreds of check nodes with hundreds of variable node processors.


Laplacian score and genetic algorithm based automatic feature selection for Markov State Models in adaptive sampling based molecular dynamics

Authors: Anu George, Madhura Purnaprajna, Prashanth Athri

Description: Adaptive sampling molecular dynamics based on Markov State Models use short parallel MD simulations to accelerate simulations, and are proven to identify hidden conformers. The accuracy of the predictions provided by it depends on the features extracted from the simulated data that is used to construct it. The identification of the most important features in the trajectories of the simulated system has a considerable effect on the results.


Gbit/s Non-Binary LDPC Decoders: High-Throughput using High-Level Specifications

Authors: Oscar Ferraz, Srinivasan Subramaniyan, Guohui Wang, Joseph R Cavallaro, Gabriel Falcao, Madhura Purnaprajna

Description: It is commonly perceived that an HLS specification targeted for FPGAs cannot provide throughput performance in par with equivalent RTL descriptions. In this work we developed a complex design of a non-binary LDPC decoder, that although hard to generalise, shows that HLS provides sufficient architectural refinement options. They allow attaining performance above CPU- and GPU-based ones and excel at providing a faster design cycle when compared to RTL development.


Recurrent neural networks: An embedded computing perspective

Authors: Nesma M Rezk, Madhura Purnaprajna, Tomas Nordström, Zain Ul-Abdin

Description: Recurrent Neural Networks (RNNs) are a class of machine learning algorithms used for applications with time-series and sequential data. Recently, there has been a strong interest in executing RNNs on embedded devices. However, difficulties have arisen because RNN requires high computational capability and a large memory space. In this paper, we review existing implementations of RNN models on embedded platforms and discuss the methods adopted to overcome the limitations of embedded systems. We will define the objectives of mapping RNN algorithms on embedded platforms and the challenges facing their realization. Then, we explain the components of RNN models from an implementation perspective. We also discuss the optimizations applied to RNNs to run efficiently on embedded platforms.


Accelerated CNN Co-Inference through data partitioning on heterogeneous devices

Authors: K Vanishree, Anu George, Srivatsav Gunisetty, Srinivasan Subramanian, Shravan Kashyap, Madhura Purnaprajna

Description: In Convolutional Neural Networks (CNN), the need for low inference time per batch is crucial for real-time applications. To improve the inference time, we present a method (CoIn) that benefits from the use of multiple devices that execute simultaneously. Our method achieves the goal of low inference time by partitioning images of a batch on diverse micro-architectures. The strategy for partitioning is based on offline profiling on the target devices. We have validated our partitioning technique on CPUs, GPUs and FPGAs that include memory-constrained devices in which case, a re-partitioning technique is applied. An average speedup of 1.39x and 1.5x is seen with CPU-GPU and CPU-GPU-FPGA co-execution respectively. In comparison with the approach of the state-of-the-art, CoIn has an average speedup of 1.62x across all networks.


Performance Estimation on Heterogeneous Systems: Making the most of Static Analysis

Authors: K Vanishree, Madhura Purnaprajna

Description: Heterogeneous Computing System (HCS) comprising of accelerators such as GPU, FPGA and DSP are extensively used in the parallel computing domain. The diversity in their micro-architectures makes them suitable for the various parallel scientific applications. Most of the existing systems that address data distribution in HCS heavily depend on the target architecture that limits the design space exploration to a known device micro-architecture. In contrast, this work uses static code analysis to develop a target-independent performance model to suggest the suitability of a data-parallel regular application to CPU or GPU in an heterogeneous node. This model uses information available at compile time to estimate the performance and the objective is to statically obtain relative performance. 


CPU Performance Modeling through Analysis of Primitive Operations

Authors: K Vanishree, Madhura Purnaprajna

Description: Modern multi-core processors are complex because of their complicated memory hierarchies, superscalar issue of instructions, pipeline architecture, out-of-order execution and speculative execution due to branches in the program code. These features of the CPU are beneficial to improve the application performance. These processors have to be modelled to arrive at the trade-offs of design decisions such as power, time, throughput and latency. Modeling these complex micro-architectures is a very challenging task. In this work, we present a simple CPU modeling technique for data-parallel applications based on minimum offline profiling information and detailed static code analysis.


k-core: Hardware Accelerator for k-mer Generation and Counting used in Computational Genomics

Authors: Simmi M Bose, Varsha S Lalapura, S Saravanan, Madhura Purnaprajna

Description: In computational genomics, the term k-mer typically refers to all the possible subsequences of length k from a single read obtained through DNA sequencing. In genome assembly, generating frequency of k-mers takes the highest compute time. k-mer counting is considered as one of the important analyses and the first step in sequencing experiments. Here, we present an FPGA based fast k-mer generator and counter, k-core to generate unique k-mers and count their frequency of occurrence. The IP core is parameterizable in terms of the line length (l) and k-mer size (k) and is implemented on XCVU095 FPGA. 


Performance modeling for data distribution in heterogeneous computing systems: work in progress

Authors: Madhura Purnaprajna

Description: Balanced data distribution among devices in Heterogeneous Computing System (HCS) is key to improved application performance. This work presents a model that estimates data distribution ratio for CPU-GPU co-execution of a data-parallel application in an HCS node. This estimation is based on relative application performance on the devices. To find the relative performance, we build individual performance models of each device (CPU-GPU) using off-line profile information. Our CPU and GPU models have an average performance estimation error of 8.64% and 9.03% respectively wrt the measured performance, across a set of data-parallel benchmarks. With CPU-GPU co-execution, an average performance improvement of 37.3% is seen across benchmarks.


Streaming Tiles: Flexible Implementation of Convolution Neural Networks Inference on Manycore Architectures

Authors: Nesma M Rezk, Madhura Purnaprajna, Zain Ul-Abdin

Description: Convolution neural networks (CNN) are extensively used for deep learning applications such as image recognition and computer vision. The convolution module of these networks is highly compute-intensive. Having an efficient implementation of the convolution module enables realizing the inference part of the neural network on embedded platforms. Low precision parameters require less memory, less computation time, and less power consumption while achieving high classification accuracy. Furthermore, streaming the data over parallelized processing units saves a considerable amount of memory, which is a key concern in memory constrained embedded platforms. In this paper, we explore the design space for streamed CNN on Epiphany manycore architecture using varying precisions for weights (ranging from binary to 32-bit).

© 2021 CHIPS | PES University | Electronic City Campus, Bengaluru