Calcul scientifique accéléré sur FPGAs
de
lundi 4 juillet 2022 (11:30)
à
jeudi 7 juillet 2022 (19:30)
lundi 4 juillet 2022
11:30
Welcome - lunch
Welcome - lunch
11:30 - 13:45
13:45
Welcome
-
Matthieu Haefele
(
Laboratoire de Mathématique et de ses Applications de Pau
)
Nicolas GAC
(
L2S - Université Paris Saclay
)
Welcome
Matthieu Haefele
(
Laboratoire de Mathématique et de ses Applications de Pau
)
Nicolas GAC
(
L2S - Université Paris Saclay
)
13:45 - 14:00
14:00
Introduction to FPGA computation and architecture
-
Florent De Dinechin
(
INSA Lyon
)
Introduction to FPGA computation and architecture
Florent De Dinechin
(
INSA Lyon
)
14:00 - 15:00
One the one hand, an FPGA is a multiprocessor on a chip with up to several million elementary processors and a cumulative internal bandwidth of several Tbit/s. On the other hand, these are binary processors, and their frequency is far below what you get in a conventional CPU. Worse, compiling a program on such a chip may take several days. All things considered, are FPGAs any good at scientific computing? The answer, of course, is "it depends", and this talk will attempt to refine this statement. After a perfectly balanced presentation of FPGA architectures and programming models, it will provide a serenely partial and biased review of FPGA success stories in scientific computing.
15:00
Neural Network Accelerator Co-Design with FINN
-
Thomas Preusser
(
AMD
)
Neural Network Accelerator Co-Design with FINN
Thomas Preusser
(
AMD
)
15:00 - 15:30
High-throughput and low-latency edge applications need co-designed solutions to meet the performance requirements. Quantized Neural Networks (QNNs) combined with custom FPGA dataflow implementations offer a good balance of performance and flexibility, but building such implementations by hand is difficult and time-consuming. In this presentation, we will introduce FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. It is an open-source experimental framework by Xilinx Research Labs to help the broader community to explore deep neural network (DNN) inference on FPGAs. It specifically targets QNNs, with emphasis on generating dataflow-style architectures customized for each network. It is not intended to be a generic DNN accelerator like xDNN, but rather a tool for exploring the design space of DNN inference accelerators on FPGAs. The key components are Brevitas for training quantized neural networks, the FINN compiler, and the finn-hlslib Vivado HLS library of FPGA components for QNNs.
15:30
Break
Break
15:30 - 16:00
16:00
Co-design for the SKA project
-
Shan Mignot
(
Laboratoire Lagrange
)
Co-design for the SKA project
Shan Mignot
(
Laboratoire Lagrange
)
16:00 - 16:30
Radio-telescopes cover wide frequency bands and make extensive use of interferometry. This leads to the production of large volumes of data and calls for considerable processing. As a consequence of the limited material possibility to store the raw data, the SKA project has decided to incorporate the processing facilities to the telescopes. Two supercomputers, one for each telescope, are hence envisaged to ingest an expected representative flow of 0.77 TB/s and carry out preliminary data reduction tasks both to reduce the volume of data and yield science products. Preconstruction work has led to a concept based on a homogeneous set of nodes processing data on-the-fly or on-demand within a few days. This distinction stems from the need to provide operational feedback and average the computing load which peaks at an estimated 125 PFlops but averages out to 10 PFlops. Generic COTS systems have hitherto been considered to maximise versatility and refrain from specialising software development. However significant risks have been identified concerning cost for procurement and operations. I will present the co-design exercise which is on-going to mitigate this. In this frame, with the advent of high level synthesis, FPGAs with their higher resource utilisation and lower operating frequencies could become an option, notably for on-the-fly tasks, in-networking processing or as accelerators for selected calculation, should the risk/benefit ratio prove favourable.
16:30
FPGA acceleration of 3D CT reconstruction using OpenCL and oneAPI Tools
-
Daouda DIAKITE
(
L2S - Université Paris Saclay
)
FPGA acceleration of 3D CT reconstruction using OpenCL and oneAPI Tools
Daouda DIAKITE
(
L2S - Université Paris Saclay
)
16:30 - 17:00
Many-core processors such as GPUs are currently the preferred technological target for accelerating HPC applications. However, architectures designed on FPGAs can be interesting alternatives to GPUs because they are potentially lower power and accessible thanks to the new high-level synthesis tools (HLS) provided by the leading manufacturers such as Intel or Xilinx. However, exploiting the full potential of FPGAs via HLS tools requires a deep knowledge of their architecture and a significant effort to match the application to the underlying architecture. In this presentation, I will present the principle of HLS tools as well as a methodology for FPGA acceleration through Intel's OpenCL and OneAPI tools. The 3D back-projection operator, present in iterative tomographic reconstruction algorithms, is considered as a use case for this methodology.
17:00
Matrix free conjugate gradient with Maxeler Data Flow Engine technology
-
Charles Prouveur
(
CEA
)
Matrix free conjugate gradient with Maxeler Data Flow Engine technology
Charles Prouveur
(
CEA
)
17:00 - 17:30
In this presentation, the implementation of a miniapp extracted from a production code in material science (Metalwalls) using Maxeler technology will be explained, after which a chip to chip comparison between a CPU, a GPU and an FPGA, as well as a scalability study on multiple FPGAs will be presented. The core algorithm is a matrix free conjugate gradient that computes the total electrostatic energy thanks to an Ewald summation at each iteration. The FPGA implementation using 40 bits floating point number representation outperforms the CPU implementation both in terms of computing power and energy usage resulting in an energy efficiency more than 14 times better. Compared to the GPU of the same generation, the FPGA reaches 60\% of the GPU performance while the ratio of the performance per watt is still better by a factor of 3. Thanks to its low average power usage, the FPGA bests both fully loaded CPU and GPU in terms of number of conjugate gradient iterations per second and per watt.
mardi 5 juillet 2022
09:00
AMD-Xilinx System On Chip (SoC) FPGA, une introduction + démo
-
Olivier Régnault
(
AMD-Xilinx dedicated FAE & Product Line Manager
)
AMD-Xilinx System On Chip (SoC) FPGA, une introduction + démo
Olivier Régnault
(
AMD-Xilinx dedicated FAE & Product Line Manager
)
09:00 - 10:00
Olivier Regnault est « senior expert » pour les architectures FPGA & System on Chip (SoC) et travaille en tant que « Field Application Engineer & Product Line Manager» pour le distributeur de semi-conducteurs European Avnet Silica. L’exposé débutera avec une introduction de l’architecture AMD-Xilinx SoC, avec l’accent sur la gamme ZYNQ Ultrascale+. Suivra une démonstration de développement avec Vivado et Vitis, et, à la fin, une présentation de la nouvelle architecture Versal pour les plateformes de calcul accéléré.
10:00
Break
Break
10:00 - 10:30
10:30
Introduction AMD-XilinX Vitis HLS
-
William Duluc
(
MVD Training
)
Introduction AMD-XilinX Vitis HLS
William Duluc
(
MVD Training
)
10:30 - 12:30
La présentation donnera un aperçu de l'outil Vitis HLS. Cet outil a la capacité de retranscrire un code C/C++ en langage RTL afin d'implémenter la fonction dans une architecture FPGA. Dans la deuxième partie de la présentation, certaines techniques d'optimisation qui seront utilisées dans le laboratoire pratique seront détaillées.
12:30
Lunch
Lunch
12:30 - 14:00
14:00
Hands-on AMD-XilinX Vitis HLS
-
William Duluc
Hands-on AMD-XilinX Vitis HLS
William Duluc
14:00 - 15:30
15:30
Break
Break
15:30 - 16:00
16:00
Hands-on AMD-XilinX Vitis HLS
-
William Duluc
Hands-on AMD-XilinX Vitis HLS
William Duluc
16:00 - 17:30
mercredi 6 juillet 2022
09:00
Dataflow code generation for FPGA
-
mickael dardaillon
(
INSA Rennes / IETR
)
Dataflow code generation for FPGA
mickael dardaillon
(
INSA Rennes / IETR
)
09:00 - 09:30
High-level synthesis tools for FPGA such as Vitis HLS simplify the development of accelerated applications using high-level C language and combining pre-existing kernels. However connection of dataflow buffers between these kernels still need to be specified and optimized manually by the developer. In this presentation, we introduce a new method and associated tool to generate HLS code from a dataflow graph, and automatically compute buffer sizes to reach the highest throughput.
09:30
Work on the accelerated calculation of electron repulsion integrals on FPGAs using oneAPI
-
Xin Wu
(
Paderborn Center for Parallel Computing, Paderborn University
)
Work on the accelerated calculation of electron repulsion integrals on FPGAs using oneAPI
Xin Wu
(
Paderborn Center for Parallel Computing, Paderborn University
)
09:30 - 10:00
The calculation of electron repulsion integrals (ERIs) is a major bottleneck in quantum chemistry applications. In this work the accelerated calculation of ERIs is developed on Intel Stratix 10 GX 2800 FPGAs by using oneAPI as the high-level synthesis (HLS) tools. To maximize the performance the arrays for intermediate results are carefully designed by taking advantage of the FPGA local memory for parallel data accesses. Via template arguments, multiple different kernel variants for different angular momenta of the input electrons get generated, which allows to fully unroll inner loops with recursive dependencies. Our FPGA kernels for ERIs of high angular momenta outperform the libint library on a compute node with 2 CPU sockets by about 4x. A performance model is established to explain the measured FPGA performance.
10:00
Break
Break
10:00 - 10:30
10:30
Introduction Intel OneAPI
-
Maurizio Paolini
(
Intel
)
Introduction Intel OneAPI
Maurizio Paolini
(
Intel
)
10:30 - 12:30
The presentation will provide an overview of the oneAPI initiative and Intel products for heterogeneous computing. It will then cover the conceptual differences between coding for FPGAs and coding for CPUs/GPUs and describe the development flow specific to FPGA platforms. In the second part of the presentation, some basic oneAPI design techniques for FPGA to be used in the hands-on lab will be detailed.
12:30
Lunch
Lunch
12:30 - 14:00
14:00
Hands-on Intel OneAPI
-
Maurizio Paolini
(
Intel
)
Hands-on Intel OneAPI
Maurizio Paolini
(
Intel
)
14:00 - 15:30
15:30
Break
Break
15:30 - 16:00
16:00
Hands-on Intel OneAPI
-
Maurizio Paolini
(
Intel
)
Hands-on Intel OneAPI
Maurizio Paolini
(
Intel
)
16:00 - 17:30
jeudi 7 juillet 2022
09:00
Welcome coffee
Welcome coffee
09:00 - 09:30
09:30
Welcome talk
Welcome talk
09:30 - 09:45
09:45
ESSPER: FPGA Cluster for Research on Reconfigurable HPC with Supercomputer Fugaku
-
Kentaro SANO
(
Riken
)
ESSPER: FPGA Cluster for Research on Reconfigurable HPC with Supercomputer Fugaku
Kentaro SANO
(
Riken
)
09:45 - 10:30
At RIKEN Center for Computational Science (R-CCS), we have been developing an experimental FPGA Cluster named "ESSPER (Elastic and Scalable System for high-PErformance Reconfigurable computing)," which is a research platform for reconfigurable HPC. ESSPER is composed of sixteen Intel Stratix 10 SX FPGAs which are connected to each other by a dedicated 100Gbps inter-FPGA network. We have developed our own Shell (SoC) and its software APIs for the FPGAs supporting inter-FPGA communication. The FPGA host servers are connected to a 100Gbps Infiniband switch, which allows distant servers to remotely access the FPGAs by using a software bridged Intel's OPAE FPGA driver, called R-OPAE. By 100Gbps Infiniband network and R-OPAE, ESSPER is actually connected to the world's fastest supercomputer, Fugaku, deployed in RIKEN, so that using Fugaku we can program bitstreams onto FPGAs remotely using R-OPAE, and off-load tasks to the FPGAs. In this talk, I introduce our ESSPER's concept, system stack of hardware and software, programming environment, under-development applications as well as our future prospects for reconfigurable HPC.
10:30
Reduced-Precision Acceleration of Radio-Astronomical Imaging on Xilinx FPGAs
-
Stefano Corda
(
Ecole Polytechnique Fédérale de Lausanne
)
Reduced-Precision Acceleration of Radio-Astronomical Imaging on Xilinx FPGAs
Stefano Corda
(
Ecole Polytechnique Fédérale de Lausanne
)
10:30 - 11:15
Modern radio telescopes such as the Square Kilometre Array (SKA) produce large volumes of data that need to be processed to obtain high-resolution sky images. This is a complex task that requires computing systems that provide both high performance and high energy efficiency. Hardware accelerators such as GPUs (Graphics Processing Units) and FPGAs (Field Programmable Gate Arrays) can provide these two features and are thus an appealing option for this application. Most HPC (High-Performance Computing) systems operate in double precision (64-bit) or in single precision (32-bit), and radio-astronomical imaging is no exception. With reduced precision computing, smaller data types (e.g., 16-bit) aim at improving energy efficiency and throughput performance in noise-tolerant applications. We demonstrate that reduced precision can also be used to produce high-quality sky images. To this end, we analyze the gridding component (Image-Domain Gridding) of the widely-used WSClean imaging application. Gridding is typically one of the most time-consuming steps in the imaging process and, therefore, an excellent candidate for acceleration. We identify the minimum required exponent and mantissa bits for a custom floating-point data type. Then, we propose the first custom floating-point accelerator on a Xilinx Alveo U50 FPGA using High-Level Synthesis. Our reduced-precision implementation improves the throughput and energy efficiency by respectively 1.84x and 2.03x compared to the single-precision floating-point baseline on the same FPGA. Our solution is also 2.12x faster and 3.46x more energy-efficient than an Intel i9 9900k CPU (Central Processing Unit) and manages to keep up in throughput with an AMD RX 550 GPU.
11:15
Break
Break
11:15 - 11:45
11:45
Heterogeneous Embedded Multicore Design Graduate Education in ENSTA PARIS: A 5 years Feedback
-
Omar HAMMAMI
(
ENSTA PARIS
)
Heterogeneous Embedded Multicore Design Graduate Education in ENSTA PARIS: A 5 years Feedback
Omar HAMMAMI
(
ENSTA PARIS
)
11:45 - 12:30
In this talk we will present a 5 years feedback on training graduate level students in the oldest school of engineers in France, ENSTA PARIS on heterogeneous embedded multicore design on Xilinx SOC Zynq chip. Part of the ROB 307 MPSOC (Multiprocessor System on Chip) course students are required to design a Heterogeneous Embedded Multicore combining a dual core hardcore IP (ARM9, 4 soft cores Microblaze, 2 hardware accelerators (Neural network, vision, image processing) and a AXI NOC (Network on Chip) on a single Zynq XC7Z020 chip using a zedboard. Students are expected to validate their design through actual execution on the zedboard and have all IPs running concurrently. This project have been going on for the past 5 yeat and we will share our experience in this training.
12:30
Lunch
Lunch
12:30 - 14:00
14:00
Compiling circuits with polyhedra
-
Christophe Alias
(
INRIA LIP
)
Compiling circuits with polyhedra
Christophe Alias
(
INRIA LIP
)
14:00 - 15:00
Hardware accelerators are unavoidable to improve the performance of computers with a bounded energy budget. In particular, FPGA allow building dedicated circuits from a gate-level description, allowing a very advanced level of optimization. Tools for high-level synthesis (HLS) allow the programmer to program FPGA without the constraints linked to hardware, compiling a C specification into a circuit. Code optimizations in these tools remain rudimentary (loop unrolling, pipelining, etc.), and most often the responsibility of the programmer. Polyhedral model, born from research on systolic circuits, offer a powerful tool to optimize compute kernels for HPC. In this seminar, I will show a few interconnections between HLS and the polyhedral model, either as a preprocessing (source-to-source) step, or as a synthesis tool (optimizing the circuit using a dataflow intermediate representation). In particular, I will present a dataflow formalism that allow reasoning geometrically on circuit synthesis.
15:00
Toward Speculative Loop Pipelining for High-Level Synthesis
-
Steven Derrien
(
IRISA, Université de Rennes 1
)
Toward Speculative Loop Pipelining for High-Level Synthesis
Steven Derrien
(
IRISA, Université de Rennes 1
)
15:00 - 15:45
Loop pipelining (LP) is a key optimization in modern high-level synthesis (HLS) tools for synthesizing efficient hardware datapaths. Existing techniques for automatic LP are limited by static analysis that cannot precisely analyze loops with data-dependent control flow and/or memory accesses. We propose a technique for speculative LP that handles both control-flow and memory speculations in a unified manner. Our approach is entirely expressed at the source level, allowing a seamless integration to development flows using HLS. Our evaluation shows significant improvement in throughput over standard loop pipelining techniques.
15:45
Break
Break
15:45 - 16:15
16:15
Using Unified Shared Memory and External Function Interface with oneAPI
-
Suleyman Demirsoy
(
Intel
)
Using Unified Shared Memory and External Function Interface with oneAPI
Suleyman Demirsoy
(
Intel
)
16:15 - 16:45
Unified Shared memory(USM) abstraction offers significant ease of use and, in some cases, performance benefits when critical functions are offloaded to an accelerator such as FPGA. Some of the critical functions would also benefit from lower level customization that is possible at RTL level but not easy to capture at the oneAPI code. In this talk, we will look more closely into both topics as a follow up on the main oneAPI introduction presented earlier in the conference.
16:45
Round table
Round table
16:45 - 17:45