Projects.
CHIPS is always looking for bright and motivated students with a strong interest in computer architecture, FPGA/VLSI design, parallel processing and compilers.
Ongoing Projects
Customizable, domain optimized RISCV-based FPGA Overlays
With the deceleration of Moore’s law and Dennard scaling, general-purpose compute architectures will need to be complemented by domain-specific acceleration for significant performance improvement. Deploying several different hardened application-specific accelerators (such as the Google TPU)presents datacenter scale provisioning and orchestration challenges.
Other more general-purpose programmable compute engines such as CPUs and GPUs are not specialized for the workload at hand and are therefore inherently inefficient. FPGAs are a potent, flexible acceleration architecture that is intrinsically capable of adapting to very different workloads. However, leveraging FPGAs as accelerator comes with its own challenges that require specialized skills and hinder programmer productivity.
Team members: Shreenithi Iyer, Hrishikesh Nair, Aditya Jain and Ashuthosh M. R.
Sub-Threshold Standard Cell Design
In the present era of high-density and high-speed nanoelectronics, power consumption has been one of the most concerning factors. Hence there is a rapidly growing demand for ultra-low power devices and advanced energy-saving methods for digital integrated circuits. The need for low-power circuits has up to now been limited to a small number of products, but this situation has changed drastically in the last few years, primarily because of the growing need for portability in computing and telecommunication products.
We further reduce the energy consumption of the commercial UMC 28nm High-Performance Compact CMOS Process Technology by down-scaling the supply voltage. We verify an 8-bit ALU consuming 41.9 uW with X1 standard cells as opposed to the 1.29 mW with regular cells. We verify a 2-stage RISC-V-v2 processor with and without branch predictors at 5MHz, 10MHz, 20MHz. We reduce the power consumption by up to 30.78 times with the ALU and are able to achieve a better quality of results upon using the optimization algorithms.
We automate significant parts of the logic gate design process, enabling the rapid adoption of new processes or alternative designs.
Team members: Karthik & Vinay
Hardware Accelerator for Sparse Dense Matrix Multiplication
Matrix Multiplication has gained importance due to its wide usage in Deep Neural Networks. The presence of sparsity in matrices needs special considerations to avoid redundancy in computations and memory accesses. Sparsity becomes relevant in the choice of compression format for storage and memory access. In addition to compression format, the choice of the algorithm also influences the performance of the matrix multiplier. The interplay of algorithm and compression formats results in significant variations in several performance parameters such as execution time, memory, and total energy consumed.
Our custom hardware accelerator for sparse-dense matrix multiplication shows a difference in speedup by 2X and a difference in energy consumption by about 1.8X.
We show that an intelligent choice of algorithm and compression format based on the variations in sparsity, matrix dimensions, and device specifications is necessary for performance acceleration. Our exploration tool for identifying the right mix-and-match is available on Github.
Team Members: Ashuthosh, Santosh, Srinivasan and Vishvas
Muscle Strain Monitoring using FSRs
In the recent past, people have adapted to a sedentary lifestyle with the work-from-home in place. After a long period of isolation, people realize the importance of exercise and fitness. This leads to enthusiasts developing injuries due to exercising wrong postures and techniques in their workout routine. They risk permanent muscle damage, which needs to be prevented. Any delay in care could be fatal for the physiological well-being of the user in question.
We propose to use a piezoresistive sensor due to its reliability, low cost, and robust nature. Our device would be portable and operate on low power. We propose to use Actor critic-based RL algorithms to provide suggestions to the user and predict fatalities before they occur. Our device would also have the feature of inter-device communication for synchronized data collection and analysis. We can also provide supportive visual interaction for correction.
We note the benefits and impact of this product for daily targeted exercises. As this device would be portable and robust using non-invasive techniques, there is no requirement for a medical professional to install/use this device at home. This can be used to improve the quality of diagnosis and life of chronically ill and distant patients.
Team Members: Kavita, Dhushyanth and Karthik
Past Projects
Accelerating Molecular Dynamics
Within this project, our objective is to identify the drawbacks in current FPGA architecture, since they have been mainly targeted for DSP applications and algorithms. Some questions that we want to answer are (1) What are the features that FPGAs lack that could simplify/accelerate drug discovery 1 applications? (2) How can FPGAs help accelerate molecular dynamics (3) Could partial reconfiguration be a boon for computational chemists? (4) Could we re-architect the FPGA specifically for drug discovery applications?
Circuit-level exploration of FPGA architectures
Transistor sizing in FPGAs is a complex optimisation problem. Studies in this direction have explored the impact of sizing closely coupled Lookup tables (LUT) in identical tile-based FPGAs. Our focus is to analyse the impact of application-level variabilities introduced through the configuration data or input changes. In this paper, our objective is to: (1) understand the impact of application-level data on transistor sizing in pass-transistor-based LUTs and (2) suggest an alternative LUT implementation that guarantees constant response time.
Hardware accelerators for Deep Learning
Convolution Neural Networks (CNNs) are becoming increasingly popular in Advanced driver assistance systems (ADAS) and Automated driving (AD) for camera perception enabling multiple applications like object detection, lane detection and semantic segmentation. Ever increasing need for high resolution multiple cameras around car necessitates a huge-throughput in
the order of about few 10’s of TeraMACs per second (TMACS) along with high accuracy of detection. This project will suggest an architecture that is scalable exceeding few 100s of GOPs.
Accelerating Genome Sequence Analysis
In computational genomics, the term kmer typically refers to all the possible subsequences of length k from a single read obtained through DNA sequencing. In genome assembly, generating frequency of k-mers takes the highest compute time. k-mer counting is considered as one of the important analyses and the first step in sequencing experiments. Here, we explore an FPGA based fast k-mer generator and counter,k-core to generate unique k-mers and count their frequency of occurrence.
Graph Algorithms
We considered a few popular graph algorithms– PageRank, Single Source Shortest Path (SSSP), Breadth-First Search (BFS), and Depth-First Search (DFS). We employed the High-level synthesis and its optimization methodologies to design an FPGA accelerator for the respective algorithms. Due to resource constraints on the device, we adopted algorithm-specific graph partitioning schemes to process large graphs. Using the GAP Benchmark Suite running on a CPU as the baseline for evaluating the performance of our design we obtained a speedup of 5x for BFS, 20x for SSSP.
Source code and reports are here
Accearator for Genomics
Genome sequencing is increasingly used in healthcare and research to identify genetic variations. The genome of an organism consists of a few million to billions of base pairs. Oxford Nanopore sequencing works by monitoring changes in electrical current and the resulting signal is basecalled using a neural network to produce the DNA sequence. Deep Learning operations are computationally intensive because they involve multiplying tensors (multi-dimensional matrices). We use methods like pruning and quantization to lower the amount of computation and reduce the size of the model.