# MLIR Compiler Infrastructure Beyond Machine Learning Alex Zinenko (Brium Inc.) April 28, 2025 ### compiler /kəmˈpʌɪlə/ noun a program that converts instructions into a machine-code or lower-level form so that they can be read and executed by a computer. "conversion would require more than just running it through a different compiler" **LLVM** ### What Is MLIR #### MLIR /əm əl vı aːr/ acronym Multi-Level Intermediate Representation. A unifying software framework for compiler development. "everybody thought ML for MLIR stood for Machine Learning" ``` %results:2 = "d.operation"(%arg0, %arg1) ({ Regions belong to Ops and can have multiple blocks. Region ^block(%argument: !d.type): Block %value = "nested.operation"() ({ // Ops can contain nested regions. Region "d.op"() : () -> () }) : () -> (!d.other_type) "consume.value"(%value) : (!d.other_type) -> () ^other block: "d.terminator"() [^block(%argument : !d.type)] : () }) : () -> (!d.type, !d.other type) ``` ## Little Builtin, Everything Customizable #### No fixed set of: - Operations - Attributes - Types ### Bring your own anything: - As long as you define and verify semantics - Group into "dialects" ``` #include <math.h> define double @foo(double noundef %0) { %2 = tail call double @llvm.exp.f64(double %0) %3 = tail call double @llvm.log.f64(double %2) return log(exp(x)); } return return define double @foo(double noundef %0) { %2 = tail call double @llvm.log.f64(double %2) ret double %3 } ``` ``` #include <math.h> define double @foo(double noundef %0) { %2 = tail call double @llvm.exp.f64(double %0) %3 = tail call double @llvm.log.f64(double %2) return log(exp(x)); } return return define double @foo(double noundef %0) { %2 = tail call double @llvm.log.f64(double %2) ret double %3 } ``` $log e^x = x ???$ ``` #include <math.h> define double @foo(double noundef %0) { ret double %0 } return log(exp(x)); } ``` clang -S -emit-llvm -ffast-math -03 LLVM IR has intrinsics for e<sup>x</sup> and log, but not for expm1 and log1p ``` #include <math.h> double foo(double x) { return log1p(expm1(x)); } log (1 + x) ``` ``` call expm1 call log1p return ``` ``` define double @foo(double noundef %0) { %2 = tail call fast double @expm1(double %0) %3 = tail call fast double @log1p(double %2) ret double %3 } ``` $$log (1 + e^{x}-1) = log e^{x} = x ???$$ clang -S -emit-llvm -ffast-math -03 48 dialects "upstream" (TOSA) Dialect Transform Dialect 48 dialects 48 dialects 48 dialects 48 dialects 48 dialects "upstream" (TOSA) Dialect Transform Dialect 48 dialects 'acc' Dialect 'affine' Dialect 'arith' Dialect 'amdgpu' Dialect 'arm\_sve' Dialect 'ArmSME' Dialect 'bufferization' Dialect 'complex' Dialect 'diti' Dialect 'irdl' Dialect "llum" Dialect 'mesh' Dialect 'mni' Dialect 'nvgpu' Dialect 'omp' Dialect 'pdl\_interp' Dialect 'nohromial' Dialect 'guant' Dialect 'scf' Dialect 'shape' Dialec 'univ' Dialect 'vector' Dialect 'xegpu' Dialect **Builtin Dialect** 'x86vector' Dialect OpInterface definitions Tensor Operator Set Architecture (TOSA) Dielect Transform Dialect 'sparse tensor' Dialect 'tensor' Dialec 48 dialects "upstream" 100s "downstream" #### Users of MLIR In alphabetical order below Accera is a compiler that enables you to experiment with loop optimizations without hand-writing Assembly code. With Accera, these problems and impediments can be addressed in an optimized way. It is available as a Python library and supports cross-I Beaver is an MUR frontend in Elixir Google for compling programs that features, Beaver provides a simple, computations to be performed direct. Nod Distributed Runtime: Asynchronous fine-grained op-level a dynamic quantum program: a dynamic quantum program: specifically focused on games and re See also this paper: Compiling ONNX Neural Network Models Using MLIB. Concrete is an open-source framey makes writing FHE programs easy! Technology Enzyme: General Auts Institution, provide support functions segregations segregations segregation of the Enzyme (specifically EnzymeMUR) MLIR-AIE: Toolchain fo Substrait MLIR Firefly: A new compile MLIR-DaCe: Data-Cen converter, quantization, ...). A compiler for Erlang to native representations. By bridging these is territir is a compiler project almed at defining MLR dialects to abstract compute on Tenstorrent AI An Fringer juntime, implement centric optimizations in optimization and access to a fact that the primary motivator for Friendy's cidelect in MLR to compiler infrastructure and target TTNN. The primary motivator for Friendy's cidelect in MLR to compete the MLR of the MLR compiler infrastructure and target TTNN. The primary motivator for Friendy's cidelect in MLR to compete the MLR of the MLR compiler infrastructure and target TTNN. The primary motivator for Friendy's cidelect in MLR to compete the MLR of the MLR compiler infrastructure and target TTNN. The primary motivator for Friendy's cidelect in MLR to observe the MLR of the MLR compiler infrastructure and target TTNN. The primary motivator for Friendy's cidelect in MLR to observe the MLR of the MLR compiler infrastructure and target TTNN. The primary motivator for Friendy's cidelect in MLR to observe the MLR of the MLR compiler infrastructure and target TTNN. The primary motivator for Friendy's cidelect in MLR to observe the MLR of the MLR compiler infrastructure and target TTNN. The primary motivator for Friendy's cidelect in MLR to observe the MLR of The Torchecosystem. compiler is modeled using MLIR. morphic Encryption Intermediate Representation) is an MLIR-based toolchain developed by Feature, Sever provides a simple, computations to be performed orest of the several provides a simple, computations to be performed orest of the several provides a simple, computation of Multi-Balding upon the foundation Multi-Bal **ONNX-MLIR** verification domain, and has been a "IEEE" Catalysts Service and integration into the Pytho and integration must be performed united. See also ones with the Light Catalyst also comes with the Light The Kokkos C++ Performance Portal: A community-driven, open source M. compiler ecosystem, using the best of XLA & MLR. backed system that is constantly. If the Koskos C++ remnance route. A numerary unexp. soon. Given and QPUs. Concrete: TFHE Com There is current unit ongoing to con to MLR to support titled and distribut equivalent Lingo DB: Revolutioniz Revolutio Polygeist: C/C++ frontend and optimizations for MLIR Polygeist is a CIC++ frontend for MLIR which preserves high-level structure from programs such as parallelism. Intellige writing Price (project week) which were projected as a CV-A+ recovered the Week which preserves in projects so stroke the project as a CV-A+ recovered the Week which preserves in projects as writing a company of the project as a contract of MLR, as well as various ranking/lowering utilities, capability ensures user privacy and unprecedented flexibility and extends. See both the polyherial proper (SV-A) projects (SW-A) which were projected as a contract extends to projected and projected as a contract extends to projected and projected as a contract extends to the CVA Projected and Proj Collecting values services and indicates any another MARCO: Modelica Adv RISE DSP-MLIR is a framework designed. MRCO is a prototype compiler fort RISE is a spiritual successor to the LIII project. "a high-level functional data parallel language with a system of and rewrite patterns that detect DS simulation of large-scale models. The "enterthe rules which encode algorithmic and hardware-specific optimisation choices." and returns parties on the Outcome of State (State State Sta Types implement or inheret general MLR-AIE is a toolchain providing low Substrait MLR is an input/output dialect for Substrait, the cross-language serialization format of database provide efficient forward and revery provided to target the AllEngian portic project which uses Enzyme to diffe ShimDMA blocks. Backend code gen level abstractions enabling higher-lev MLIR is used as a Graph Transformation framework and the foundation for building many tools DKIA, TFLite Firefly is not only a compiler, but a 1 MLIR-DaCe is a project aiming to bric Tenstorrent MLIR Compiler TRT: TensorFlow Runtime With service and the patterns as well. by 9 MiR-EmitC provides a way to trans; tran Flang translate Kersa and Tensor Flow mode Torch-MLIR Flang is a ground-up implementatic The latter is used to generate calls to The Torch-MLIR project aims to provide first class compiler support from the PyTorch ecosystem to the MLIR friton is a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim Mojo is a new programming language trition is to provide an open-source environment to write fast code at higher productivity than CUDA, but also best of Python syntax with systems; with higher flexibility than other existing CSLs. aims to be a strict superset of Pythor immediately for long-tail ecosystem ( VAST: C/C++ frontend for MLIR VAST is a library for program analysis and instrumentation of C/C++ and related languages. VAST provides a foundation for customizable program representation for a broad spectrum of analyses. Using the MLIR infrastructure, VAST provides a toolset to represent C/C++ program at various stages of the compilation and to transform the representation to the best-fit program abstrac Project Verona is a research programming language to explore the concept of concurrent ownership. They are https://mlir.llvm.org/users/ ### How to Handle Generality: Traits ``` Transformations reason about traits, not individual operations void transformation(Operation *op) { if (op->hasTrait<Associative>()) { Value operand = op->getOperand(0); op->setOperand(0, op->getOperand(1)); op->setOperand(1, operand); } } ``` ### How to Handle Generality: Interfaces ``` Ifaces: [ConditionallyAssociative, ... bool AddFOp::isAssociative() "override" { Conditionally Commucative] return getFastMathAttr().isAllowReassoc(); Traits: [Associative, Commutative, Pure, ... SameOperandAndResultType] %3 = arith.addf %1, %2 %4 = matrix.multiply %5, %6 Traits: [Associative, Pure, ... LoopDecomposable] Ifaces: [AntiCommutative] _ → Value MatrixMultiplyOp::buildInverse(OpBuilder &b, ...) "override" { return b.create<MatrixInverseOp>(...).getResult(); ``` # Why Bother? # Why Bother? Climate Model Domain Compilers ### Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation TOBIAS GYSI and CHRISTOPH MÜLLER, ETH Zurich, Switzerland OLEKSANDR ZINENKO, Google, France STEPHAN HERHUT, Google, Germany EDDIE DAVIS, TOBIAS WICKY, and OLIVER FUHRER, Vulcan Inc, USA TORSTEN HOFFLER. ETH ZURICh. Switzerland TOBIAS GROSSER, University of Edinburgh, UK Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific diadates based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM's extensible MLIR compiler infrastructure suffice to underform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure. CCS Concepts: • Software and its engineering $\rightarrow$ Compilers; Domain specific languages; • Applied computing $\rightarrow$ Earth and atmospheric sciences; Additional Key Words and Phrases: Weather and climate, stencil computations, intermediate representations Tobias Gysi also with Google. Tobias Grosser also with ETH Zurich. The work done at ETH Zurich has received funding from the European Research Council (ERC) under the European Union's Horizon 2029 programme (grant agreement DAPP, No. 67880), the swiss National Science Foundation under the Ambizione programme (grant PZ00PZ168016), and ARM Holdings ple and Xilinx Inc in the context of Folly Labs. Authors' addresses: T. Gysi, C. Müller, and T. Horleft, ETH Zurich, Switzerland, emails gysit@google.com, Authors' addresses: T. Oysi, C. Müller, and T. Hoefler, ETH Zurich, Switzerland, emails: gyst@google.com, christoph.medig: https://discrete.htmlepiinfeth.ptm./ O.Zinenko, Google, France, email: meinko@google.com; S. Herbutt, Google, Germany; email: herbutt@google.com; E. Davis, T. Wicky, and O. Fubrer, Vulcan Inc. USA; emails: (Eddiel), TobiasW, OliverFj@ vulcan.com; T. Gerseer, University to Edinburght, USA; email: tobias grosser@ed.ac.uk. ### (cc) ① This work is licensed under a Creative Commons Attribution International 4.0 License © 2021 Copyright held by the owner/author(s) 1544-3566/2021/09-ART51 \$15.00 https://doi.org/10.1145/3469030 ACM Transactions on Architecture and Code Optimization, Vol. 18, No. 4, Article 51. Publication date: September 2021. https://dl.acm.org/doi/pdf/10.1145/3469030 ### Why Bother? Climate Model Domain Compilers ### Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation TOBIAS GYSI and CHRISTOPH MÜLLER, ETH Zurich, Switzerland OLEKSANDR ZINENKO, Google, France STEPHAN HERHUT, Google, Germany EDDIE DAVIS, TOBIAS WICKY, and OLIVER FUHRER, Vulcan Inc, USA TORSTEN HOEFLER, ETH Zurich, Switzerland TOBIAS GROSSER, University of Edinburgh, UK Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguody defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific diadest based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM's extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure. CCS Concepts: • Software and its engineering $\rightarrow$ Compilers; Domain specific languages; • Applied computing $\rightarrow$ Earth and atmospheric sciences; Additional Key Words and Phrases: Weather and climate, stencil computations, intermediate representations Tobias Gysi also with Google. Tobias Grosser also with ETH Zurich. The work done at ETH Zurich has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 programme (grant agreement DAPP, No. 678880), the Swiss National Science Foundation under the Ambitisone programme (grant p2020/1869016) and Rad Mildolings to lear Oxilization of the Contest of Polly Lorent Council Contest of Polly Lorent Council Authori addresses T. Gyst, C. Müller, and T. Hoefler, ETH Zurich, Switzerland; emails: gysi@google.com; Christoph.mueller, bard@google.com; C. O. Zimenko. Google, France; email: simenbo@google.com; S. Herhutt, Google, Ceremany; email: herhut@google.com; E. Davis, T. Wicky, and O. Fuberz, Vulcan Inc. USA; emails: @ddleD, TobiasW, OliverPj@vulcan.com; T. Grosser, University of Ediburdyp, UK; email: lobas; goosser@ed.a.cu.k ### (cc) ① This work is licensed under a Creative Commons Attribution International 4.0 License © 2021 Copyright held by the owner/author(s) 1544-3566/2021/09-ART51 \$15.00 https://doi.org/10.1145/3469030 $ACM\ Transactions\ on\ Architecture\ and\ Code\ Optimization,\ Vol.\ 18,\ No.\ 4,\ Article\ 51.\ Publication\ date:\ September\ 2021.$ https://dl.acm.org/doi/pdf/10.1145/3469030 ## Why Bother? Climate Model Domain Compilers ### Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation TOBIAS GYSI and CHRISTOPH MÜLLER, ETH Zurich, Switzerland OLEKSANDR ZINENKO, Google, France STEPHAN HERHUT, Google, Germany EDDIE DAVIS, TOBIAS WICKY, and OLIVER FUHRER, Vulcan Inc, USA TORSTEN HOEFLER, ETH Zurich, Switzerland TOBIAS GROSSER, University of Bidinburgh, UK Most compilers have a single core intermediate representation (JR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (JRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and deeign stencil- and GPU-specific diadest based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM's extensible MLIR compiler infrastructure sufflee to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure. CCS Concepts: • Software and its engineering $\rightarrow$ Compilers; Domain specific languages; • Applied computing $\rightarrow$ Earth and atmospheric sciences; Additional Key Words and Phrases: Weather and climate, stencil computations, intermediate representations Tobias Gysi also with Google. Tobias Grosser also with ETH Zurich. The work done at ETH Zurich has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 programme (grant agreement DAPP, No. 678880), the Swiss National Science Foundation under the Ambistone programme (grant PZ00P2168016), and ARM Holdings ple and Xilinx line in the context of Polly Labs. Authori addresses T. Oysi, C. Müller, and T. Hoefler, ETH Zurich, Switzerland, emails gysit@google.com/ (christoph.mueller, thor)@inidefluch.co. Z. Damelo, Google, France, email: Emico@google.com, S. Herbutt, Google, Germany, email: herbut@google.com, E. Davis, T. Wicky, and O. Fubrer, Vulcan Inc, USA; emails [EddieD, TobiasW, OliverFj@ vulcan.com. T. Grosser, University of Edilburgh, UK; email: tobias grosser/edd ac uk. ### © **①** This work is licensed under a Creative Commons Attribution International 4.0 License © 2021 Copyright held by the owner/author(s) 1544-3566/2021/09-ART51 \$15.00 https://doi.org/10.1145/3469030 ACM Transactions on Architecture and Code Optimization, Vol. 18, No. 4, Article 51. Publication date: September 2021. Over STELLA (COSMO) Over Dawn (FV3) On V100-SXM2 https://dl.acm.org/doi/pdf/10.1145/3469030 # Why Bother? CFD Domain Optimization ### Code Generation for In-Place Stencils Mohamed Essadki ONERA Chatillon France mohamed.essadki@onera.f zinenko@google.con Oleksandr Zinenko Paris, France Bertrand Michel ONERA Chatillon France bertrand.michel@onera.fr Nicolae Vasilache Zürich, Switzerland Bruno Maugars Chatillon France bruno.maugars@onera.fr Albert Cohen Google Paris, France albertcohen@google.con ### Numerical simulation often resorts to iterative in place sten- cile such as the Gauss-Saidel or Succession Operalayation (SOR) methods. Writing high performance implementations of such stencils requires significant effort and time; it also involves non-local transformations beyond the stencil kernel itself. While automated code generation is a mature technolone for image processing stencils, convolutions and out-ofplace iterative stencils (such as the Jacobi method), the optimization of in-place stencils requires manual craftsmanship. Building on recent advances in tensor compiler construction, we propose the first domain-specific code generator for iterative in-place stencils. Starting from a generic tensor compiler implemented in the MLIR framework, tensor abstractions are incrementally refined and lowered down to parallel, tiled, fused and vectorized code. We used our generator to implement a realistic, implicit solver for structured meshes, and demonstrate results competitive with an industrial computational fluid dynamics framework. We also compare with stand-alone stencil kernels for dense tensors. ### CCS Concepts: • Software and its engineering → Compilers; • Theory of computation → Parallel computing models; • Applied computing → Physical sciences and en- Keywords: Computational Fluid Dynamics. Implicit Methods, Gauss-Seidel, SOR, Iterative In-place Stencils, Domain-Specific Code Generation, MLIR, Vectorization, Tiling. ACM Reference Format: ### Mohamed Essadki. Bertrand Michel. Bruno Maurars. Oleksandr Zinenko, Nicolas Vasilache, and Albert Cohen. 2023. Code Generation for In-Place Stencils. In Proceedings of the 21st ACM/IEEE Interna- iii. 2023 Commisht held by the owner(author(s) https://doi.org/10.1145/3579990.3580036 tional Symposium on Code Generation and Optimization (CGO '23), Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are net made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact CGO '23. February 25 - March 1, 2023, Montréal, OC, Canada February 25 - March 1, 2023, Montréal, QC, Canada. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3579990.3580006 We are interested in the parallelization and optimization of Computational Fluid Dynamics (CFD) applications, and more specifically implicit finite-volume numerical methods to solve differential equations. This consists in discretizing the space domain into small cells representing the conservative fields of the simulation (mass density, momentum energy, etc., where the volume value of each field is averaged over a given cell). Then, at every step of the simulation a solver based on the implicit method (for faster convergence and scalability) proceeds by rewriting the differential equations in the form of a large and sparse linear system $A \cdot x = b$ $$A \cdot x = b$$ (1) where A is a square matrix of size $m \times m$ , x and b are two - column vectors of the same size m, and v contains the numer ical solution of the physical fields. An implicit CFD solver can typically be split in two main phases: 1. first compute the vector b, iterating over the faces of - the cells to compute a numerical flux [34, 36] which can be considered as a function of the two solutions in adjacent cells separated by a common face; - 2. then, rather than explicitly updating the fields in the cell, solve the linear system using an iterative method like Successive Overrelaxation (SOR), a variant of the Gauss-Seidel method [12, 42]. It remains an open problem to design and implement a domain-specific code generator for implicit finite-volume solvers using state-of-the-art methods like SOR. Unlike the Jacobi iterative method and all stencil codes occurring in image processing and neural networks, SOR is an in-place stencil computation, carrying an internal data dependence over the space domain. Because of these internal dependences, parallelization and vectorization of in-place stencils require a wavefront schedule. This incurs additional control flow and indexing overheads and higher complexity in model ing locality-enhancing transformations such as tiling (cache blocking) and fusion. It is important to ontimize such inplace stencils, since in typical scenarios Gauss-Seidel and # Why Bother? CFD Domain Optimization ### Code Generation for In-Place Stencils Bertrand Michel Mohamed Essadki ONERA Chatillon France mohamed.essadki@onera.f Oleksandr Zinenko Paris, France zinenko@google.com ONERA Chatillon France bertrand.michel@onera.fr Nicolae Vasilache Zürich, Switzerland Bruno Maugars Chatillon France bruno.maugars@onera.fr Albert Cohen Google Paris, France albertcohen@google.com Numerical simulation often resorts to iterative in-place stencile such as the Gauss-Saidel or Succession Operalayation (SOR) methods. Writing high performance implementations of such stencils requires significant effort and time; it also involves non-local transformations beyond the stencil kernel itself. While automated code generation is a mature technolone for image processing stencils, convolutions and out-ofplace iterative stencils (such as the Jacobi method), the optimization of in-place stencils requires manual craftsmanshin Building on recent advances in tensor compiler construction, we propose the first domain-specific code generator for iterative in-place stencils. Starting from a generic tensor compiler implemented in the MLIR framework, tensor abstractions are incrementally refined and lowered down to parallel, tiled, fused and vectorized code. We used our generator to imple ment a realistic, implicit solver for structured meshes, and demonstrate results competitive with an industrial computational fluid dynamics framework. We also compare with stand-alone stencil kernels for dense tensors ### CCS Concepts: • Software and its engineering → Compilers; • Theory of computation → Parallel computing models; • Applied computing → Physical sciences and en- Keywords: Computational Fluid Dynamics. Implicit Methods. Gauss-Seidel. SOR. Iterative In-place Stencils. Domain-Specific Code Generation, MLIR, Vectorization, Tiling. Mohamed Essadki. Bertrand Michel. Bruno Maurars. Oleksandr Zi- nenko. Nicolas Vasilache, and Albert Cohen. 2023. Code Generation for In-Place Stencils. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization (CGO '23), Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are net made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdty components of this work must be honored. For all other uses, contact CGO '23. February 25 - March 1, 2023, Montréal, OC, Canada iii. 2023 Commisht held by the owner(author(s) https://doi.org/10.1145/3579990.3580036 February 25 - March 1, 2023, Montréal, QC, Canada. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3579990.3580006 We are interested in the parallelization and optimization of Computational Fluid Dynamics (CFD) applications, and more specifically implicit finite-volume numerical methods to solve differential equations. This consists in discretizing the space domain into small cells representing the conservative fields of the simulation (mass density, momentum energy, etc., where the volume value of each field is aver aged over a given cell). Then, at every step of the simulation a solver based on the implicit method (for faster convergence and scalability) proceeds by rewriting the differential equations in the form of a large and sparse linear system $$A \cdot x = b$$ - where A is a square matrix of size $m \times m$ , x and b are two column vectors of the same size m, and v contains the numer ical solution of the physical fields. An implicit CFD solver can typically be split in two main phases: 1. first compute the vector b, iterating over the faces of - the cells to compute a numerical flux [34, 36] which can be considered as a function of the two solutions in adjacent cells separated by a common face; - 2. then, rather than explicitly updating the fields in the cell, solve the linear system using an iterative method like Successive Overrelaxation (SOR), a variant of the Gauss-Seidel method [12, 42]. It remains an open problem to design and implement a domain-specific code generator for implicit finite-volume solvers using state-of-the-art methods like SOR. Unlike the Jacobi iterative method and all stencil codes occurring in image processing and neural networks, SOR is an in-place stencil computation, carrying an internal data dependence over the space domain. Because of these internal dependences, parallelization and vectorization of in-place stencils require a wavefront schedule. This incurs additional control flow and indexing overheads and higher complexity in model ing locality-enhancing transformations such as tiling (cache blocking) and fusion. It is important to ontimize such inplace stencils, since in typical scenarios Gauss-Seidel and Speedup over Pluto compiler on 2x Intel Xeon 6152 CPUs (NUMA) https://research.google/pubs/code-generation-for-data-dependent-stencils/ # Why Bother? CFD Domain Optimization ### Code Generation for In-Place Stencils Bertrand Michel Mohamed Essadki ONERA Chatillon France mohamed.essadki@onera.f Paris, France Oleksandr Zinenko zinenko@google.com ONERA Chatillon France bertrand.michel@onera.fr Nicolae Vasilache Zürich, Switzerland Bruno Maugars Chatillon France bruno.maugars@onera.fr Albert Cohen Google Paris, France albertcohen@google.com Numerical simulation often resorts to iterative in-place stencile such as the Gauss-Saidel or Succession Operalayation (SOR) methods. Writing high performance implementations of such stencils requires significant effort and time; it also involves non-local transformations beyond the stencil kernel itself. While automated code generation is a mature technolone for image processing stencils, convolutions and out-ofplace iterative stencils (such as the Jacobi method), the optimization of in-place stencils requires manual craftsmanship. Building on recent advances in tensor compiler construction, we propose the first domain-specific code generator for iterative in-place stencils. Starting from a generic tensor compiler implemented in the MLIR framework, tensor abstractions are incrementally refined and lowered down to parallel, tiled, fused and vectorized code. We used our generator to implement a realistic, implicit solver for structured meshes, and demonstrate results competitive with an industrial computational fluid dynamics framework. We also compare with stand-alone stencil kernels for dense tensors. ### CCS Concepts: • Software and its engineering → Compilers; • Theory of computation → Parallel computing models; • Applied computing → Physical sciences and en- Keywords: Computational Fluid Dynamics. Implicit Methods, Gauss-Seidel, SOR, Iterative In-place Stencils, Domain-Specific Code Generation, MLIR, Vectorization, Tiling. ACM Reference Format Mohamed Essadki. Bertrand Michel. Bruno Maurars. Oleksandr Zi- nenko. Nicolas Vasilache, and Albert Cohen. 2023. Code Generation for In-Place Stencils. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization (CGO '23), Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are net made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdurty components of this work must be honored. For all other uses, contact CGO '23. February 25 - March 1, 2023, Montréal, OC, Canada iii. 2023 Commisht held by the owner(author(s) https://doi.org/10.1145/3579990.3580036 February 25 - March 1, 2023, Montréal, QC, Canada. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3579990.3580006 We are interested in the parallelization and optimization of Computational Fluid Dynamics (CFD) applications, and more specifically implicit finite-volume numerical methods to solve differential equations. This consists in discretizing the space domain into small cells representing the conservative fields of the simulation (mass density, momentum energy, etc., where the volume value of each field is aver aged over a given cell). Then, at every step of the simulation a solver based on the implicit method (for faster convergence and scalability) proceeds by rewriting the differential equations in the form of a large and sparse linear system $$A \cdot x = b$$ - where A is a square matrix of size $m \times m$ , x and b are two column vectors of the same size m, and v contains the numer ical solution of the physical fields. An implicit CFD solver can typically be split in two main phases: - 1. first compute the vector b, iterating over the faces of the cells to compute a numerical flux [34, 36] which can be considered as a function of the two solutions in adjacent cells separated by a common face; - 2. then, rather than explicitly updating the fields in the cell, solve the linear system using an iterative method like Successive Overrelaxation (SOR), a variant of the Gauss-Seidel method [12, 42]. It remains an open problem to design and implement a domain-specific code generator for implicit finite-volume solvers using state-of-the-art methods like SOR. Unlike the Jacobi iterative method and all stencil codes occurring in image processing and neural networks, SOR is an in-place stencil computation, carrying an internal data dependence over the space domain. Because of these internal dependences, parallelization and vectorization of in-place stencils require a wavefront schedule. This incurs additional control flow and indexing overheads and higher complexity in model ing locality-enhancing transformations such as tiling (cache blocking) and fusion. It is important to ontimize such inplace stencils, since in typical scenarios Gauss-Seidel and Comparable to elsA (ONERA) Keeps using memory instead of MPI https://research.google/pubs/code-generation-for-data-dependent-stencils/ ``` * textbook material ``` ``` for i in range(N): for j in range(M): for k in range(P): C[i][j] += A[i][k] + B[k][j] ``` $\{(i,j,k): i,j,k \in \mathbb{Z}, \ 0 \leq i < N \ \land \ 0 \leq j < M \ \land \ 0 \leq k < P\}$ \* textbook material ``` for i in range(N): for j in range(M): for k in range(P): C[i][j] += A[i][k] + B[k][j] ``` $\{(i,j,k): i,j,k \in \mathbb{Z}, \ 0 \leq i < N \ \land \ 0 \leq j < M \ \land \ 0 \leq k < P\}$ ### Polygeist: Raising C to Polyhedral MLIR William S. Moses Cambridge, MA, USA wmoses@mit.edu Eindhoven, The Netherlands Lebelini@tue.nl Ruizhe Zhao\* Imperial College London London, UK ruizhe.zhao15@imperial.ac.uk zinenko@google.com Oleksandr Zinenko Google Inc. Paris, France Abstract-We present Polygeist, a new compilation flow that connects the MLIR compiler infrastructure to cutting edge polyhedral optimization tools. It consists of a C and C++ frontend capable of converting a broad range of existing codes into MLIR suitable for polyhedral transformation and a bi-directional conversion between MLIR and OpenScop exchange format. The Polygisit/MLIR intermediate representation featuring high-level (affine) loop constructs and n-D arrays embedded into a single static assignment (SSA) substrate enables an unprecedented static assignment (SSA) substrate enables an unprecedented combination of SSA-based and polyhedral optimizations. We illustrate this by proposing and implementing two extra transfor-mations: statement splitting and reduction parallelization. Our evaluation demonstrates that Polyerist outperforms on average both an LLVM IR-level optimizer (Polly) and a source-to-source state-of-the-art polyhedral compiler (Pluto) when exercised on the Polybench/C benchmark suite in sequential (2.53x vs 1.41x, 2.34x) and parallel mode (9.47x vs 3.26x, 7.54x) thanks to the ### I. INTRODUCTION Improving the efficiency of computation has always been one of the prime goals of computing. Program performance can be improved significantly by reaping the benefits of parallelism temporal and spatial locality and other performance sources. Relevant program transformations are particularly tedious and challenging when targeting modern multicore CPUs and GPUs with deep memory hierarchies and parallelism, and are often performed automatically by optimizing compilers. The polyhedral model enables precise analyses and a relatively easy specification of transformations (loop restructuring, automatic rarallelization, etc.) that take advantage of hardware performance sources. As a result, there is growing evidence that the polyhedral model is one of the best frameworks for not backed by memory storage. efficient transformation of compute-intensive programs [I], [2] [3] and for programming accelerator architectures [4] [5], [6]. Consequently, the compiler community has focused on building tools that identify and optimize parts of the fine-grained control of compiler optimization provided by lowprogram that can be represented within the polyhedral model level IRs. It builds on the recent MLIR compiler infrastructure tools tend to fall into two categorie ders the tools' ability to perform analyses and transformations and transform SCoPs in compiler intermediate representations level constructs such as loops, parallel and reduction par (IRs). While this offers seamless integration with rest of the terms; low-level representations fully covering LLVM IR [16]: compiler, the lack of high-level structure and information hin- rig. 1. The Polygon computation now common of 4 stages, the interests traverses Clang AST to emit MLIR SCF dialect (Section III-A), which is raised to the Affine dialect and pre-optimized (Section III-B). The IR is then processed by a polyhedral scheduler (Sections III-C III-D) before post- imperfectly or at a significant cost [9]. Moreover, common compiler optimizations such as LICM may interfere with the process [10]. Finally, low-level IRs often lack constructs for, e.g., parallelism or reductions, produced by the transformation which makes the flow more complex Source-to-source compilers such as Pluto [11], POCC [12] and PPCG [5] operate directly on C or C++ code. While this lack of enabling optimizations such as those converting hazardous memory loads into single-assignment virtual registers. Furthermore, the transformation results must be expressed in C. which is known to be complex [13]. [14] and is also missing constructs for, e.g., reduction loops or register values This paper proposes and evaluates the benefits of a polyhe dral compilation flow. Polygeist (Figure II), that can leverage both the high-level structure available in source code and the amonly referred to as static-control parts, or SCoP's). Such that allows the interplay of multiple abstraction levels within the same representation, during the same transformations [15] Compiler-based tools like Polly [7] and Graphite [8] detect Intermixable MLIR abstractions, or dialects, include high memory accesses annotated with affine expressions. Moreover, This structure needs to be recovered from optimized IR, often by combining the best of source-level and IR-level tools in an end-to-end polyhedral flow. Polygeist preserves high-level information and leverages them to perform new or improved Connecting C and C++, and any MLIR loops, to pre-existing Polyhedral optimization tools ### Polygeist: Raising C to Polyhedral MLIR Cambridge, MA, USA wmoses@mit.edu Findhoven. The Netherlands Lebelini@tue.nl Ruizhe Zhao\* Imperial College London London, UK ruizhe.zhao15@imperial.ac.uk zinenko@google.com Oleksandr Zinenko Google Inc. Paris, France Abstract-We present Polygeist, a new compilation flow that connects the MLIR compiler infrastructure to cutting edge polyhedral optimization tools. It consists of a C and C++ frontend capable of converting a broad range of existing codes into MLIR suitable for polyhedral transformation and a bi-directional MLIR uitable for polyherlar transformation and a bi-direction and a bi-direction allowers on between MLIR and OpenScyo exchange format. The Polypeit/MLIR intermediate representation featuring high-level (affine) loop constructs and a-D arrays embedded into a single static assignment (SSA) substrate enables an unprecedual combination of SSA-based and polyhedral optimizations. We illustrate this by proposing and implementing two extra transformations: statement splitting and reduction parallelization. Our matter of the proposing and implementing two extra transformations: statement splitting and reduction parallelization. Our evaluation demonstrates that Polyeeist outperforms on average both an LLVM IR-level optimizer (Polly) and a source-to-source state-of-the-art polyhedral compiler (Pluto) when exercised on the Polybench/C benchmark suite in sequential (2.53x vs 1.41x, 2.34x) and parallel mode (9.47x vs 3.26x, 7.54x) thanks to the ### I. INTRODUCTION one of the prime goals of computing. Program performance e.g., parallelism or reductions, produced by the transformation, can be improved significantly by reaping the benefits of parallelism temporal and spatial locality and other performance sources. Relevant program transformations are particularly tedious and challenging when targeting modern multicore CPUs and GPUs with deep memory hierarchies and parallelism, and code, the effectiveness of such tools is often reduced by the are often performed automatically by optimizing compilers. The polyhedral model enables precise analyses and a relatively easy specification of transformations (loop restructuring, automatic rarallelization, etc.) that take advantage of hardware performance sources. As a result, there is growing evidence that the polyhedral model is one of the best frameworks for not backed by memory storage. efficient transformation of compute-intensive programs [I], [2] [3] and for programming accelerator architectures [4] [5]. [6]. Consequently, the compiler community has focused on building tools that identify and optimize parts of the fine-grained control of compiler optimization provided by lowprogram that can be represented within the polyhedral model level IRs. It builds on the recent MLIR compiler infrastructure tools tend to fall into two categories mmonly referred to as static-control parts, or SCoP's). Such that allows the interplay of multiple abstraction levels within Compiler-based tools like Polly [7] and Graphite [8] detect Intermixable MLIR abstractions, or dialects, include highand transform SCoPs in compiler intermediate representations level constructs such as loops, parallel and reduction par (IRs). While this offers seamless integration with rest of the terms; low-level representations fully covering LLVM IR [16]: compiler, the lack of high-level structure and information hinders the tools' ability to perform analyses and transformations. This structure needs to be recovered from optimized IR, often by combining the best of source-level and IR-level tools in Fig. 1. The Polygeist compilation flow consists of 4 stages. The freetend terretors Claug AST to emit MLR SCP dislect (Section [BFA]), which is raticed to the Allfite dislect and pre-optimized (Section [BFA]) The Ris then precuously a polyhedral scheduler (Sections [BFC-BELD]) before post-optimization and parallelization (Section [BFC-BELD]). It is translated to LLVM IR for further optimization and binary generation by LLVM. imperfectly or at a significant cost [9]. Moreover, common compiler optimizations such as LICM may interfere with the Improving the efficiency of computation has always been process [10]. Finally, low-level IRs often lack constructs for, which makes the flow more complex Source-to-source compilers such as Pluto [11], POCC [12] and PPCG [5] operate directly on C or C++ code. While this lack of enabling optimizations such as those converting hazardous memory loads into single-assignment virtual registers. Furthermore, the transformation results must be expressed in C. which is known to be complex [13]. [14] and is also missing constructs for, e.g., reduction loops or register values This paper proposes and evaluates the benefits of a polyhe dral compilation flow. Polygeist (Figure II), that can leverage both the high-level structure available in source code and the the same representation, during the same transformations [15] an end-to-end polyhedral flow, Polygeist preserves high-level information and leverages them to perform new or improved Faster than pre-existing Polyhedral optimization tools thanks to higher-level abstraction ### Why Bother? Raise the Abstraction Level ### Progressive Raising in Multi-level IR Andi Drebes TU Eindhoven Inria and École Normale Supérieure Findhoven. The Netherlands Paris, France Lebelini@tne nl andi@nrogrammierforen de Nicolas Vasilache Google Zurich, Switzerland ntv@google.com Tobias Grosser University of Edinburgh Edinburgh, UK tobias.grosser@ed.ac.uk Google Google Paris, France Paris, France zinenko@google.com albertcohen@google.com Henk Corporaal Albert Coher TU Eindhoven Eindhoven, The Netherlands h.corporaal@tue.nl Abstract-Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-onin support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite dire entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to We propose Progration Entities, a complicatority appearab to the higher level abstraction to beverage domains for both severage domains for low-level representations. We further introduct Radii. Fig. 1: Multi-Level Tactions e district waypoows for preserview radius, for low-level representations. We further introduct Radii. Fig. 1: Multi-Level Taction is formed to the progressive radius; from affine loop nests specified in a higher-abstraction levels to enable effective domain specific general-purpose languages to higher-of barrac articles of the progration programment progra Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. Index Terms—MLIR, progressive raising, multi-level interme- ### I INTRODUCTION the ongoing trend for beterogeneous systems has made it milation framework with interoperable representations breaking difficult for general-purpose compilers to generate efficient the isolation between domains and enabling comprehensive code automatically [1]. One of the main issues is the mismatch optimizations. During compilation, the source program's highbetween the low level of abstraction at which general-purpose level representation is progressively optimized and transformed compilers operate and the various high-level abstractions for to lower-level abstractions, until reaching a low-level, general computation required by today's applications [2]. Although purpose representation for code generation [7]. high-level programming languages allow for the specification Multi-level frameworks solve many issues of DSLs, but [5]. However, such languages commit to a limited set of isolated low level, thus precluding most, if not all, domain-specific abstractions and domain-specific optimizations, resulting in noor interoperability limited reusability of software components, and few opportunities for inter-domain optimizations [6] Multi-level intermediate representations explicitly allow for The increasing complexity of hardware resulting from the co-existence of multiple abstractions within the same com- of high-level operations, this information is often not captured the optimizations in progressive lowering compilation scheme by the low-level intermediate representation (IR) of general-crucially rely on the adequate initial representation of the source purpose compilers or lost early in the compilation process program. If the initial representation is below the required level of abstraction for a given optimization, the optimization simply Domain-specific languages (DSLs) and compilers attempt fails to apply. However, providing an adequate high-level input to capture and explicitly preserve high-level information, representation may not always be possible General numbers throughout the compilation process and have been employed languages not being semantically rich enough to preserve the successfully to generate efficient code for modern hardware [4], right level of information enter the lowering pipeline at a very ## Why Bother? Raise the Abstraction Level TU Eindhoven Inria and École Normale Supérieure Findhoven. The Netherlands Paris, France Lebelini@tne nl andi@nrogrammierforen de Nicolas Vasilache Google Zurich, Switzerland ntv@google.com Tobias Grosser University of Edinburgh Edinburgh, UK tobias.grosser@ed.ac.uk Paris, France Google Paris, France zinenko@google.com albertcohen@google.com Henk Corporaal TU Eindhoven Eindhoven, The Netherlands h.corporaal@tue.nl Abstract-Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific omnilers by providing a reusable, extensible, and non-onin support the progressive lowering of high-level representations to We propose Progressive Raising, a complementary approach to We propose Progressive Railing, a complementary approach to the progressive Investigation in multi-level IR had reads from lower to the progressive Investigation in the Investig ceneral-nursose language to high-level linear algebra operations general-purpose language to ingristere interal argeria operations Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. Index Terms-MLIR, progressive raising, multi-level interme ### I INTRODUCTION the oneoing trend for beterogeneous systems has made it relation framework with interogerable representations breaking difficult for general-purpose compilers to generate efficient the isolation between domains and enabling comprehensive code automatically [1]. One of the main issues is the mismatch optimizations. During compilation, the source program's highbetween the low level of abstraction at which general-purpose level representation is progressively optimized and transformed compilers operate and the various high-level abstractions for to lower-level abstractions, until reaching a low-level, general computation required by today's applications [2]. Although purpose representation for code generation [7]. high-level programming languages allow for the specification Multi-level frameworks solve many issues of DSLs, but to capture and explicitly preserve high-level information, representation may not always be possible General numbers successfully to generate efficient code for modern hardware [4], right level of information enter the lowering pipeline at a very [5]. However, such languages commit to a limited set of isolated low level, thus precluding most, if not all, domain-specific empilation via progressive lowering. noor interoperability limited reusability of software components, and few opportunities for inter-domain optimizations [6] Multi-level intermediate representations explicitly allow for The increasing complexity of hardware resulting from the co-existence of multiple abstractions within the same com- of high-level operations, this information is often not captured the optimizations in progressive lowering compilation scheme by the low-level intermediate representation (IR) of general-crucially rely on the adequate initial representation of the source purpose compilers or lost early in the compilation process program. If the initial representation is below the required level of abstraction for a given optimization, the optimization simply Domain-specific languages (DSLs) and compilers attempt fails to apply. However, providing an adequate high-level input throughout the compilation process and have been employed languages not being semantically rich enough to preserve the Can target BLAS, which is (obviously) faster than any automatically compiled code. On Intel i9-9900K. # Why Bother? Modernize and Port Legacy Code Tokyo Isnan endo@is.titech.ac.ip Ivan R. Ivanov Tokyo Institute of Technology RIKEN R-CCS Kobe, Japan ivanov i as@m.titech ac in RIKEN R-CCS Kobe, Japan iens.domke@riken.ip zinenko@google.com Toshio Endo Tokyo Institute of Technology William S. Moses University of Illinois Urbana-Champaign Google DeepMind Illinois, United States erators like GPUs require significant architecture-specific tuning written in CUDA. tensor cores, etc. of females, assumed to high surround to help assumed hel the need for performance periability across different GPUs, and instructions. For example, curiler versions of programmable expectability moderates by a supposed programs and a periability and the across a relative to in midd. Even when the programs can be sensibled vectored on a different architecture, it mays uffer a performance penalty due to it not being steed appropriate, to the available lardware resources such as fast. See a hardware unit while modern GPUs use "full warpe" of a party of the programs and the programs are also as a second of the programs and the programs are also as a full warper of a hardware unit while modern GPUs use "full warpe" of 22 and allows up to 2008 threads per hardware unit Similar to the programs are also as a program and the programs are also as a program and the programs are also as a full program and the programs are also as a full program and the programs are also as a full program and the programs are also as a full program and the programs are also as a full program and the programs are also as a full program and threads per hardware unit. the amount of memory and register resources it requires. By threads per hardware unit. Even when GPU kernels written in CUDA appear to run to also target AMD GPUs by performing automatife translation on newer NVIDIA GPUs, they may often fail to reach similar AMD GPUs executing the same CUDA program. Accelerators like GPUs remain the hardware target of choice for performance-critical software. Achieving high performance of a different vendor, let alone the often non-trivial engineering on these accelerators requires programmers to effectively effort of portine itself. leverage a peculiar programming model, most often exposed as In this paper, we propose a compiler-based mechanism to C++ language extensions such as CUDA for NVIDIA GPUs "resize" GPU programs to a particular architecture. Taking and ROCm for AMD. While the community has developed existing CUDA code, our compiler can control the granularity alternative methods to portably program GPUs, including: a of the program including the amount of work performed by high-level block programming knows in 11th- 42), may happen of C+-Occode one GFM 25. Pulmy-style abstractions unappen of C+-Occode one GFM 25. Pulmy-style abstraction with varying degree of automated scheduling in JAC [3]. Te [4]. "In spire of various alternatives, like ECCn and SYCL [6]. the CUDA with varying degree of automated scheduling in JAC [3]. Te [4]. "In spire of various alternatives, like ECCn and SYCL [6]. the CUDA with Yard [7] and Yard [7] are supported by the CUDA with Yard [7]. The CUDA with Yard [7] are supported by the Symposium scanner of the CUDA with Yard [7] are supported by the CUDA with Yard [7]. "In spire of various alternatives, like ECCn and SYCL [6]. the CUDA with Yard [7] are supported by the 979-8-3503-9509-9/24 © 2024 IEEE Abstract—In order to come close to peak performance, accel- programs, including these very portability frameworks, remain Oleksandr Zinenko Google DeepMind While the CUDA programming model and syntax have nemory and registers, let alone not using newer advanced changes can be observed in the amount of available low-latency memory and registers. This trend is even more important when We propose a new approach to improving performance of degacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and from CUDA and simultaneously adjust the program granularity utilization as the kernels are incorrectly sized for the target to at the size of target GPUs. Combined with autotiming assisted by the platform-specific complete, or appeared theometries 27% genomas specing on the Rodinis benchmark size over baseline CUM implementation of the programming model by writing CUDA programs that for the programming model by writing CUDA programs with this electricity as the conversal and and the sidence and the programs with this electricity as not accorded of the programs with this electricity as not accorded of the programs with this electricity as not accorded of the programs with this electricity as not accorded of the programs with this electricity as not accorded of the programs with this electricity as not accorded of the programs with this electricity as not accorded of the programs with this electricity as not accorded of the programs with this electricity as not accorded to the programs with the electricity as not accorded to the programs with the electricity as not accorded to the programs with the electricity as not accorded to the programs with the electricity as not accorded to the programs with the electricity as not accorded to the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs and the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of the programs with the electricity as a correction of amount of allocated "shared" memory between several threads in a group or the amount of registers used (which is proportional to the number of threads). Both of these characteristics have a dramatic impact on the overall performance. These sizing problems are often amplified when porting a program to a GPU ## Why Bother? Modernize and Port Legacy Code Ivan R. Ivanov Tokyo Institute of Technology RIKEN R-CCS Kobe, Japan ivanov i as@m.titech ac in RIKEN R-CCS Kobe, Japan iens.domke@riken.ip Toshio Endo Tokyo Institute of Technology Tokyo Isnan endo@is.titech.ac.ir William S. Moses University of Illinois Urbana-Champaign Google DeepMind Illinois, United States Aborder—In order to come done in posk performance, nector programs, including fluor very portability frameworks, remain that underscand the suitability of about entersy, particulation, increase cross, str., including fluor very portability frameworks, and the underscand the suitability of about entersy, particulation, increase cross, str., including fluor control and spart to the control and strength of the control and strength of the control and strength of the control and strength or strengt features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and threads per hardware unit. the amount of memory and register resources it requires. By threads per hardware unit. Even when GPU kernels written in CUDA appear to run to also target AMD GPUs by performing automatife translation on newer NVIDIA GPUs, they may often fail to reach similar from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementa-tion as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program. ### I. INTRODUCTION Accelerators like GPUs remain the hardware target of choice for performance-critical software. Achieving high performance of a different vendor, let alone the often non-trivial engineering on these accelerators requires programmers to effectively effort of portine itself. C++ language extensions such as CUDA for NVIDIA GPUs "resize" GPU programs to a particular architecture. Taking and ROCm for AMD. While the community has developed existing CUDA code, our compiler can control the granularity alternative methods to portably program GPUs, including: a of the program including the amount of work performed by mapping of C++ code onto GPUs [2], NumPy-style abstractions with varying degree of automated scheduling in JAX [3], TC [4], Abstract—In order to come close to peak performance, accel- programs, including these very portability frameworks, remain Oleksandr Zinenko Google DeepMind zinenko@google.com memory and registers, let alone not using newer advanced changes can be observed in the amount of available low-latency memory and registers. This trend is even more important when considering GPUs of a different vendor, like AMD, which operate in "wavefronts" of 64 threads and allow up to 4096 > utilization as the kernels are incorrectly sized for the target use of the programming model by writing CUDA programs that adapt to different numbers of concurrent threads. But even programs with this flexibility do not permit control of the amount of allocated "shared" memory between several threads in a group or the amount of registers used (which is proportional to the number of threads). Both of these characteristics have a dramatic impact on the overall performance. These sizing problems are often amplified when porting a program to a GPU 979-8-3503-9509-9/24 © 2024 IEEE Older CUDA code is made faster (better shmem use) And also runs on AMD transparently! ### How to Start Using It https://mlir.llvm.org/getting\_started/ # How to Start Using It https://www.youtube.com/watch?v=IXAp6ZAWyBY ### Little Builtin, Everything Customizable Define your abstraction Reuse existing when possible ### No fixed set of: - Operations - Attributes - Types ### Bring your own anything: - As long as you define and verify semantics - Group into "dialects" ### How to Start Using It ### https://mlir.llvm.org/ https://llvm.discourse.group/c/mlir/31 https://discord.gg/xS7Z362 Open Meeting: Thursdays, 18:00 CEST