The efficacy of a computing innovation over a broad range of applications largely defines its success. Given the vast diversity in application characteristics, it is evident to predict the next generation target architectures to be heterogeneous---comprising variable granularity computing blocks and varied communication mechanisms. Capturing this heterogeneity within an application is an area of research that has a profound impact.
Heterogeneous computing using CPUs, GPUs & FPGAs
Automating application-to-architecture mapping
Faster access to innovative solutions and their implementations leads to new scientific advancements. Faster computing capabilities not only accelerate existing applications but also identify novel solutions that were previously infeasible on account of high design time. In addition to speeding up applications, the proposed research plan also aims at reducing energy consumption and enhancing resource efficiency of computing systems. Reduced energy consumption makes a direct impact on costs and portability of computing systems.
Domain Specific Many-Core
With the deceleration of Moore’s law and Dennard scaling, general-purpose compute architectures will need to be complemented by domain-specific acceleration for significant performance improvement. Deploying several different hardened application-specific accelerators (such as the Google TPU) presents datacenter scale provisioning and orchestration challenges.
Other more general-purpose programmable compute engines such as CPUs and GPUs are not specialized for the workload at hand and are therefore inherently inefficient. FPGAs are a potent, flexible acceleration architecture that is intrinsically capable of adapting to very different workloads. However, leveraging FPGAs as accelerator comes with its own challenges that require specialized skills and hinder programmer productivity.
Sub-Threshold Standard Cell Design
In the present era of high-density and high-speed nanoelectronics, power consumption has been one of the most concerning factors. Hence there is a rapidly growing demand for ultra-low power devices and advanced energy-saving methods for digital integrated circuits. The need for low-power circuits has up to now been limited to a small number of products, but this situation has changed drastically in the last few years, mainly because of the growing need for portability in computer and telecommunication products.
In this work, we discuss the motivation, trends, and challenges of operating digital circuits in the sub-threshold regime using UMC’s commercial 0.18 𝜇m and 28nm high-performance compact high-K bulk CMOS processes. We synthesize a 32-bit 6-pipeline stage RV32IMFC Chromite RISC-V core, PULP, and PicoRV32 RISC-V cores. We further reduce the energy consumption of the commercial UMC 55nm High-Performance Compact CMOS Process Technology by down-scaling the supply voltage.
We automate significant parts of the logic gate design process, enabling the rapid adoption of new processes or alternative designs.
In-Memory Computation (IMC) using Hybrid Memory Cube (HMC)
In IMC computations are performed directly within the memory modules, rather than moving data back and forth between the memory and the host (CPU). IMC is often associated with the concept of Processing-in-Memory (PIM), where memory modules have Processing units which perform simple operations like arithmetic and logical functions. We compare the performance of in-memory computation on the HMC with traditional CPU-based execution and PIM-based execution.
Instruction Profiler - RISC-V Profiling for identification of custom instruction-set extensions
RISC-V Profiling for identification of custom instruction-set extension RISC V is rapidly making its place in almost all domains with its ability to improve the performance and efficiency of a program. In processor design, the instruction set architecture is mainly decided based on the requirements of the application. An extendable ISA lets the user make application-specific adjustments for improved performance at the cost of additional area and more complex design. Adding a custom instruction also lets one decide whether to use additional hardware or existing ones with embedded software. To enable and disable the instructions we used Codasip Studio 9.2.0, under the instruction accurate (IA) model with O3 optimization for ease of integrability.
After comparing the results of cycle count and the area utilized from Codasip studio for the Embench benchmarks considered we conclude that the instructions can be implemented as full hardware or full software or hybrid model. It was observed that a maximum of 42.131% cycle improvement and a maximum of 1.198% area reduction was possible by using the optimal solution for the Embench benchmarks.
PSIMD Extension of RISC-V
This project to extend the PSIMD extension of RISC-V for 16-bit and 8- bit float data types. Currently the PSIMD extension is only defined for integer 8-bit, 16-bit, 32-bit and 64-bits. The PSIMD for integers are good for DSP, Audio and Video applications but not scale well for ML applications in edge-analytics which perform well with 16-bit floats. Now, in PSIMD the operands reside in the same integer register file i.e., a RV64 core can basically have 4,16-bit operands or 8,8-bit operands in one register. In this project the proposal is to have,a PSIMD sub-extension for Bfloat16 and DLFloat16 data types wherein we can perform up to 4 parallel DL Float or Bfloat operations.
Physical Hardening of RISC V
The Azurite core generator at InCore today houses a wide variety of configurable features. In this project, we take a few standard configurations (architectural and micro-architectural) of Azurite and perform synthesis using OpenLane and Commercial Tools. This will provide an initial estimate of PPA and indicate opportunities to enhance the core for either performance, area, and/or Power.
The outputs also include building a complete automated flow (using OpenLane and Commercial flows) which can be incorporated in a CI and thereby enable quick comparison with older versions of the core.
A separate aspect of this exercise also includes building a script that will allow the azurite instance generated to be treated as a Vivado board-design IP. This will allow customers to readily use this IP to build their own custom SoCs using 3rd party IPs.
RAP: RISC-V Application Profiler
Build a Python-based tool to parse an execution log and capture interesting insights about the application/test that was run on the RISC-V instance (simulator, RTL, emulator, etc). While the majority of the insights would be ISA related, there is a possibility to capture micro-architectural insights like cache misses, branch mispredictions, instruction latencies, etc.