
Research.
The efficacy of a computing innovation over a broad range of applications largely defines its success. Given the vast diversity in application characteristics, it is evident to predict the next generation target architectures to be heterogeneous---comprising variable granularity computing blocks and varied communication mechanisms. Capturing this heterogeneity within an application is an area of research that has a profound impact.
Objectives
-
Heterogeneous computing using CPUs, GPUs & FPGAs
-
Automating application-to-architecture mapping
Faster access to innovative solutions and their implementations leads to new scientific advancements. Faster computing capabilities not only accelerate existing applications but also identify novel solutions that were previously infeasible on account of high design time. In addition to speeding up applications, the proposed research plan also aims at reducing energy consumption and enhancing resource efficiency of computing systems. Reduced energy consumption makes a direct impact on costs and portability of computing systems.
Current Projects.

Customizable, domain optimized RISCV-based FPGA Overlays
This project has been funded by Semiconductor Research Corporation
With the deceleration of Moore’s law and Dennard scaling, general-purpose compute architectures will need to be complemented by domain-specific acceleration for significant performance improvement. Deploying several different hardened application-specific accelerators (such as the Google TPU)presents datacenter scale provisioning and orchestration challenges.
Other more general-purpose programmable compute engines such as CPUs and GPUs are not specialized for the workload at hand and are therefore inherently inefficient. FPGAs are a potent, flexible acceleration architecture that is intrinsically capable of adapting to very different workloads. However, leveraging FPGAs as accelerator comes with its own challenges that require specialized skills and hinder programmer productivity.

Hardware Accelerator for Sparse Dense Matrix Multiplication
Matrix Multiplication has gained importance due to its wide usage in Deep Neural Networks. The presence of sparsity in matrices needs special considerations to avoid redundancy in computations and memory accesses. Sparsity becomes relevant in the choice of compression format for storage and memory access. In addition to compression format, the choice of the algorithm also influences the performance of the matrix multiplier. The interplay of algorithm and compression formats results in significant variations in several performance parameters such as execution time, memory, and total energy consumed.
Our custom hardware accelerator for sparse-dense matrix multiplication shows a difference in speedup by 2X and a difference in energy consumption by about 1.8X.
We show that an intelligent choice of algorithm and compression format based on the variations in sparsity, matrix dimensions, and device specifications is necessary for performance acceleration. Our exploration tool for identifying the right mix-and-match is available on Github.

Muscle Strain Monitoring using FSRs
In the recent past, people have adapted to a sedentary lifestyle with the work-from-home in place. After a long period of isolation, people realize the importance of exercise and fitness. This leads to enthusiasts developing injuries due to exercising wrong postures and techniques in their workout routine. They risk permanent muscle damage, which needs to be prevented. Any delay in care could be fatal for the physiological well-being of the user in question.
We propose to use a piezoresistive sensor due to its reliability, low cost, and robust nature. Our device would be portable and operate on low power. We propose to use Actor critic-based RL algorithms to provide suggestions to the user and predict fatalities before they occur. Our device would also have the feature of inter-device communication for synchronized data collection and analysis. We can also provide supportive visual interaction for correction.
We note the benefits and impact of this product for daily targeted exercises. As this device would be portable and robust using non-invasive techniques, there is no requirement for a medical professional to install/use this device at home. This can be used to improve the quality of diagnosis and life of chronically ill and distant patients.

Acceleration of 3D Thermal Model using FPGAs
The project looks at parallelising the 3D model of the thermal solver using High-level synthesis on FPGAs. This is in collaboration with Sankhyasutra Labs

Sub-Threshold Standard Cell Design
In the present era of high-density and high-speed nanoelectronics, power consumption has been one of the most concerning factors. Hence there is a rapidly growing demand for ultra-low power devices and advanced energy-saving methods for digital integrated circuits. The need for low-power circuits has up to now been limited to a small number of products, but this situation has changed drastically in the last few years, primarily because of the growing need for portability in computing and telecommunication products.
We further reduce the energy consumption of the commercial UMC 28nm High-Performance Compact CMOS Process Technology by down-scaling the supply voltage. We verify an 8-bit ALU consuming 41.9 uW with X1 standard cells as opposed to the 1.29 mW with regular cells. We verify a 2-stage RISC-V-v2 processor with and without branch predictors at 5MHz, 10MHz, 20MHz. We reduce the power consumption by up to 30.78 times with the ALU and are able to achieve a better quality of results upon using the optimization algorithms.
We automate significant parts of the logic gate design process, enabling the rapid adoption of new processes or alternative designs.