A modular trans-precise kernel characterization for deep learning

Deep neural networks are powerful methods used in a variety of large-scale real-world problems such as image classification, object detection, natural language processing, and human action recognition. Although the state-of-the-art results for easier tasks exceed human accuracy, these methods still pose several challenges. Overcoming those concerns, those approaches become even more economically interesting. For example, traditional models that are built to run on GPUs require expensive infrastructure, are slow to execute on traditional hardware,or they might consume too much power. Specific FPGA implementations demonstrate success by improving power and energy figures against GPU implementations.

However, case-specific implementations are missing modular methodologies that can be extended to novel deep learning models. The rapid development of frameworks, models, and precision options challenge the adaptability of kernel-accelerators since the adaptation to a new requirement incurs significant engineering costs. Programmable accelerators offer apromising alternative by allowing reconfiguration of a virtual architecture that overlays on top of the physical FPGA configurable fabric. Within this project, we target to follow a modular approach that includes the characterization of kernels and the implementation of efficient kernelsin overlay architectures (e.g. VTA https://docs.tvm.ai/vta/). Those components are designed and implemented in conjunction with methodologies that allow them to be integrated into new models. We expect that reusable components help to improve the energy efficiency of the next generation of deep learning models.

Deep learning models are known to be inherently error-resilient. Henceforth, they build excellent use-cases to reduce intermediate precision levels while achieving strict quality constraints. However, to gain from reduced number representations, the underlying hardware needs to be designed account for those opportunities. We have already performed extensive numerical emulation studies that demonstrate the favorable numerical behavior of a wide set of well-established deep learning models used for image classification.

Goals:

Implement common kernels on FGPA such as dense layers, convolutional layers, …
Implement reduced precision variants of each kernel (templated)
Perform an extensive design space exploration by instantiating exhaustive configurations for each kernel
Map numerical behavior of deep learning models into the new design space to understand the overall trade-off between quality and performance
Work on unified methodology that provides a workflow to generate FPGA designs of arbitrary deep learning models

Recommended experience:

XilinxVivadoDesign suite
Xilinx Vivado HLS (C/C++)
Verilog/VHDLfor RTL implementations
Python

The candidate should be motivated to learn and extend his/her knowledge in the domain of deep learning.

References:

Constrained deep neural network architecture search for IoT devices accounting hardware calibration(https://arxiv.org/abs/1909.10818)
A System-Level Transprecision FPGA Accelerator for BLSTM Using On-chip Memory Reshaping (https://ieeexplore.ieee.org/abstract/document/8742271)
Exemplary background work: https://github.com/oprecomp/HLS_BLSTM
Deep Learning overlays for TP characterization in this project (other can be proposed):
TVM’s Versatile Tensor Accelerator (VTA): https://docs.tvm.ai/vta/
Xilinx Deep Learning Processor Unit (DPU): https://www.xilinx.com/products/intellectual-property/dpu.html

For additional information, you can contact: