Preamble
Introduction To Pyfr V2.1
It is a pleasure to be here with all of you. I would like to express my gratitude for the invitation and extend my warmest birthday wishes to ZJ Anthony and Chong Yang.
Today, I am pleased to discuss the latest advancements in our high-order flow solver, PyFR, within the context of emerging trends in computational fluid dynamics (CFD) and their industrial applications. Our focus is on making scale-resolving simulations applicable to industrial settings.
We achieve this by integrating advancements in numerical methods, many of which have been contributed by colleagues in this room, and by enhancing the efficiency of our implementations.
As a point of motivation, it is important to recognize that CFD serves as the foundation for several high-tech industries. While CFD has experienced significant success across these sectors, it is essential to acknowledge that it is not without its limitations.
Motivation
This slide, originally presented by Murray Cross from Airbus, illustrates the flight envelope of a commercial airliner. The x-axis represents equivalent airspeed, while the y-axis depicts the load on the airframe in G. The blue region indicates the flight envelope within which the aircraft must be validated for safety.
The red dot, positioned at a load of 1G and a speed of VC (the crew speed), represents the area where current generation Computational Fluid Dynamics (CFD) is applied effectively in a predictive context. Although this region is critical since it encompasses the conditions under which aircraft typically operate, the entire flight envelope, including off-design conditions, is equally important.
The inability to simulate these off-design conditions effectively often leads engineers to overengineer the aircraft, resulting in increased complexity and added weight. This excess weight is carried even during normal operations at crew speed. Therefore, a significant objective of my work is to enhance the fidelity of CFD simulations in the industry.
- 1.Murray Cross, Airbus, Technology Product Leader — Future Simulations (2012)
The flight envelope of a commercial airliner is illustrated in a slide originally presented by Murray Cross from Airbus. The x-axis represents equivalent airspeed, while the y-axis indicates the load on the airframe in G. The blue region denotes the flight envelope that the aircraft must validate for safety. The red dot, positioned at a load of 1G and the cruise speed (VC), indicates the current application of Computational Fluid Dynamics (CFD) in industry, primarily in a predictive capacity. This area is critical as it encompasses the majority of operational time. However, the entire flight envelope, including off-design conditions, is essential for accurate simulations. The inability to effectively simulate these conditions often leads engineers to overengineer, resulting in increased complexity and weight, which persists even at cruise speed. My objective is to enhance the fidelity of CFD applications within the industry.
In a broad context, if we plot fidelity against computational cost, RANS methods would be positioned in the top left, demonstrating rapid simulation capabilities, such as performing a 3D aerofoil RANS simulation in under 10 seconds. Conversely, Direct Numerical Simulation (DNS) is located at the opposite end of the spectrum, offering high fidelity but at a prohibitive cost for real geometries at realistic Reynolds numbers. My focus lies in large eddy simulations, or under-resolved DNS simulations, with the aim of reducing costs through advancements in numerical methods and their implementation. The primary tool for this endeavor is PyFR, a code I initiated during my PhD at Imperial College in 2012. It was conceived as a blank slate, with the directive to develop a 3D high-order Navier-Stokes solver capable of parallel execution on GPUs, ideally within three months, and implemented in Python.
To illustrate the relationship between fidelity and computational cost, we can categorize simulation methods. In the upper left quadrant, we find RANS methods, which can perform a 3D aerofoil simulation in under 10 seconds, as demonstrated by Antony. Conversely, at the opposite end of the spectrum, DNS methods offer high fidelity but are prohibitively expensive for practical geometries and realistic Reynolds numbers.
My focus is on large eddy simulations, or more accurately, under-resolved DNS simulations. The objective is to reduce their computational cost by advancing numerical methods and their implementation. We utilize PyFR, a code I developed during my PhD at Imperial College in 2012. This project began with the challenge to create a 3D high-order Navier-Stokes solver that operates in parallel on GPUs, all within a three-month timeframe.
PyFR is unique in that it is written in Python, a language that has become the standard in the machine learning community over the past decade. The design allows users to interact with an accessible Python layer that manages configuration files, mesh reading, and other routine tasks.
The appeal of high-order schemes, such as luxury construction methods, spectral difference schemes, and discontinuous spectral element methods, lies in their ability to represent most operations as matrix multiplications and point-wise nonlinear kernels. Matrix multiplications are highly optimized in modern hardware, while point-wise nonlinear kernels, which encapsulate the physics of the simulations, are less commonly adopted. We implement these kernels using a domain-specific language similar to C, enabling us to translate our physics—flux functions, Riemann solvers, and boundary conditions—into efficient C code that runs on the hardware used for simulations.
This approach provides insulation from the specific vendor languages of different devices, allowing our code to have a longer lifespan than any single computer architecture. Notably, we generate this code at runtime, tailoring it to the specific parameters of the simulation, such as the equation of state and boundary conditions. This process may take 5 to 10 seconds, but it significantly enhances performance for simulations that could run for hours, days, or even months.
Over the past 12 years, PyFR has evolved to include support for AMD GPUs, which are now prevalent in many large DOE clusters, as well as Intel GPUs, though with limited success. More recently, we have added support for Apple GPUs, making it feasible to conduct unsteady 3D simulations on consumer laptops, albeit with some restrictions.
Pyfr Architecture And Development
In order to enhance performance, we focused on optimizing matrix multiplications for tensor product operators. We explored how to leverage the tensor product structure present in certain elements, which results in operator matrices that are predominantly sparse. To facilitate this, we developed several libraries.
Additionally, we introduced support for in-situ postprocessing. This capability is essential for large simulations, as it allows for quicker turnaround times. It eliminates the need for extensive manual postprocessing, which can be time-consuming and inefficient.
By late 2013, PyFR featured a compressible Euler solver and a Stokes solver, capable of running on unstructured grids, albeit limited to hexahedral elements in 3D. The implementation utilized explicit runge-kutta methods without stabilization or shock capturing techniques.
To enhance performance, we focused on optimizing matrix multiplications, particularly for tensor product operators. We explored the potential to leverage the tensor product structure of certain elements, which results in predominantly sparse operator matrices. This led to the development of several libraries to facilitate these optimizations.
Additionally, we introduced support for in-situ postprocessing, a crucial feature for large simulations that require quicker turnaround times. This development addresses the need for efficient postprocessing, eliminating the reliance on lengthy MATLAB sessions for analysis.
By late 2013, PyFR had made significant strides, featuring a compressible Euler solver and a Stokes solver capable of running on unstructured grids, albeit limited to hexahedral elements in three dimensions. The implementation was based on an explicit Runge-Kutta method, lacking stabilization and shock capturing techniques.
We also added functionality such as a NaN checker to improve reliability, alongside file writing capabilities and a progress bar to estimate simulation duration based on the Courant-Friedrichs-Lewy (CFL) condition. Although the initial version of the code was limited by industrial standards, it successfully utilized high-level and domain-specific languages, concepts that have since gained acceptance in major machine learning frameworks.
We implemented a NaN checker, which has proven invaluable, as we frequently encounter NaN values. Additionally, we developed file writing capabilities and a progress bar to indicate simulation duration, particularly useful for identifying issues such as an incorrect CFL condition that could lead to extended simulation times.
While the initial version of the code was limited by industrial standards, it successfully incorporated several key concepts. Notably, the use of a high-level programming language and a domain-specific language were not widely accepted a decade ago. Today, however, these concepts are foundational in major machine learning frameworks, reflecting the progress we anticipated.
We also recognized the potential of GPUs, believing they would become a lasting asset rather than a passing trend. Although we established a solid foundation, the functionality remained constrained due to our limited understanding of numerical methods at the time. Continued method development is essential for advancing our capabilities, and many individuals in this room have contributed significantly to these advancements, which we have integrated into our code.
In version 2.1, we have introduced several significant enhancements, highlighted in bold, since the release of our first version. One major advancement is the transition to incompressible flow, enabling us to explore various interesting applications. We have also added support for previously missing major element types, including prisms, pyramids, and tetrahedra.
Our adaptive Runge-Kutta schemes, particularly the explicit ones, have been improved for usability. The time step is now determined automatically by a PID controller, eliminating the need for user input. Additionally, we have incorporated support for implicit SDIRK schemes, although achieving convergence, especially on GPUs, remains a challenging area of research.
The most notable progress has been in stabilization, enhancing the robustness of our schemes, and in shock capturing. We have also developed numerous plugins that significantly enhance the code's industrial applicability, allowing for the extraction of physical results. These plugins include features for integrating quantities on surfaces, efficient turbulence generation for simulations with specified turbulence intensity at inflows, and the application of Ffowcs-Williams-Hawkings for aeroacoustic studies.
Furthermore, we have implemented time-averaging point sampling and expression integration. I would like to emphasize two key features that have greatly advanced PyFR: entropy filtering, developed by my student Carrick, and in situ visualization. These features have been instrumental in expanding the range of applications we can effectively address.
While we have achieved many successes, we acknowledge that there is room for improvement. To enhance our internal processes, we have migrated to a task graph structure that organizes all operations within a time step as a task graph, resulting in improved performance.
In version 2.1, we have introduced several significant enhancements, highlighted in bold. One major update is the transition to incompressible flow, which opens up new applications. We have also added support for previously missing major element types, including prisms, pyramids, and tetrahedra.
Our adaptive Runge-Kutta schemes are now fully operational in explicit form, allowing automatic time step determination through a PID controller, eliminating the need for user input. Additionally, we have implemented implicit SDIRK schemes, although convergence remains a challenge, particularly on GPUs, making this an area for ongoing research.
The most notable advancements have been in stabilization and shock capturing, improving the robustness of our schemes. We have also introduced numerous plugins that enhance the code's utility for industrial applications, enabling the extraction of physical results. These plugins facilitate surface quantity integration, efficient turbulence generation, and aeroacoustic simulations using the Ffowcs-Williams-Hawkings method. Key features such as entropy filtering, developed by my student Carrick, and in situ visualization have significantly enhanced PyFR's capabilities for diverse applications.
While we have made substantial progress, we acknowledged areas for improvement. To address this, we have migrated to a task graph structure for internal operations during time steps. This new structure has proven to be highly effective.
Furthermore, we have increased our use of auto-tuning, allowing us to evaluate multiple methods for the same operation at startup to determine the best approach for specific cases. This strategy has led to considerable improvements in both performance and scalability.
In this presentation, I will focus on performance metrics for a tetrahedral grid. Traditionally, I have presented results based on hexahedral grids, which tend to yield more favorable outcomes. However, to align with industrial relevance, I will now emphasize tetrahedral grids, as they present more complex and unstructured challenges.
Performance Evaluation
We are leveraging GPUs and extensively utilizing auto-tuning techniques. This approach allows us to explore multiple methods for the same task without knowing in advance which will yield the best results. At startup, we evaluate all options to determine the most effective solution for your specific scenario. This strategy has significantly enhanced both performance and scalability.
In this presentation, I will focus on performance metrics using a tetrahedral grid. Historically, I have presented data based on hexahedral grids, which often yield more favorable results. However, for a more industrially relevant perspective, I will examine tetrahedral grids due to their inherent complexity and unstructured nature.
We will analyze degree seven solution polynomials with explicit Navier-Stokes equations on a mesh containing 1.2 million cubic elements. The performance will be evaluated on two high-performance systems: Frontier, currently ranked as the second fastest computer in the Top 500, equipped with AMD MI 250X GPUs, and Alps, which is ranked within the top 10 and utilizes the latest NVIDIA GH 200 GPUs.
To measure performance, we will use gigaflops per second as our standard metric. For instance, if your simulation involves one billion degrees of freedom and requires 100,000 RK4 time steps to achieve reliable statistics, we will calculate the expected runtime for your simulation accordingly.
We will examine degree seven solution polynomials using explicit Navier-Stokes equations. The simulations will utilize a mesh consisting of 12 x 10^5 cubic elements. Our performance evaluation will take place on two high-performance computing systems: Frontier, currently ranked as the second-fastest computer in the Top 500, and Alps, which is among the top ten. Frontier is equipped with AMD MI 250X GPUs, while Alps features the latest NVIDIA GH 200 GPUs.
To measure performance, we will employ gigaflops per second as our primary metric. For instance, if your simulation has 10^9 degrees of freedom and requires 100,000 RK4 time steps to achieve reliable statistics, the runtime can be approximated as follows:
Runtime = (10^9 DOF x 4 stages for RK4) / Performance (in gigaflops per second).
This formula provides a reliable estimate of your code's runtime on Frontier with AMD GPUs.
We will begin by utilizing 16 ranks, which corresponds to 8 physical GPUs due to certain configurations. From a code perspective, we are operating with 16 ranks, achieving a baseline performance of 26 GDOF/s on these 8 GPUs. As we scale our resources, we expect our performance to follow the ideal curve represented by the black line. In contrast, the actual scaling is illustrated by the blue dots.
The dotted line indicates a reference point of 80% parallel efficiency, which is optimal for balancing time to solution with efficient use of computational resources. Notably, even when increasing from 16 ranks to 2,048 ranks, we maintain strong scalability, particularly evident at 1,024 ranks, which allows for effective solutions.
Now, transitioning to the NIDA system, Alps, we observe that our baseline performance at 16 ranks has improved to 96 GDOF/s, indicating a significantly faster baseline. Additionally, our scaling efficiency has also enhanced, with approximately 80% parallel efficiency maintained even at 1,024 ranks. This improved scalability is crucial, as demonstrated by Peter in his presentation yesterday, where he illustrated the ability to perform tasks like LPTs in just 3 minutes of wall clock time per blow-through.
Next, I will discuss the entropy filtering technique developed by my group, particularly highlighting the contributions of my graduate student, Tarik, who has been instrumental in this research.
Shock Capturing Techniques
We will begin with 16 ranks, which corresponds to 8 physical GPUs, although the code recognizes it as 16 ranks. Our baseline performance at this configuration is 26 GDOF/s. As we scale up the resources, we expect performance to ideally follow the black line on the graph, while the blue dots represent our actual scaling performance. The dotted line indicates a reference for 80% parallel efficiency, which is crucial for balancing the time to solution with the effective use of computational resources. Notably, even when scaling from 16 ranks to 2,048 ranks, we maintain strong scalability, particularly evident at 1,024 ranks.
Now, transitioning to the NIDA system Alps, our baseline performance on 16 ranks improves significantly to 96 GDOF/s. This enhanced baseline also leads to better scalability, with approximately 80% parallel efficiency at 1,024 ranks. This level of scalability enabled Peter, in his presentation yesterday, to demonstrate achieving 3 minutes of wall clock time per blow through.
Next, I will discuss the entropy filtering technique developed by my group, particularly by my graduate student Tarik, who contributed many innovative ideas. This technique offers an efficient and robust solution for handling shocks, a significant challenge for high-order methods, especially beyond P2 and in real industrial applications. It is essential that the approach works across all major element types, on curved grids, and does not significantly degrade code performance.
The principle of entropy filtering involves applying modal filtering to troubled elements. We identify elements exhibiting negative pressure, negative density, or violating minimum entropy conditions. The filtering process is tailored to address only the necessary adjustments to satisfy these properties, effectively treating it as a root bracketing problem on a per-element basis. This strategy minimizes unwanted dissipation in well-behaved scenarios.
For example, we tested this technique on a VFE2 δ wing at a high angle of attack of 20.5 degrees, with a Reynolds number of 3 million based on a unit root chord. This configuration results in a local Mach number of 2 around the primary vortex, generating a distinct shock.
This presentation will focus on the implementation of a P3 high-order method utilizing entropy filtering on a quadratically curved mesh consisting of approximately 4.1 million elements. This configuration is both efficient and effective for managing shocks, which are critical challenges in high-order methods, particularly beyond P2 in real industrial applications. The approach ensures compatibility with various element types and curved grids, while maintaining computational efficiency.
On the left side of the slide, we present the Cp distribution compared to the experimental data. On the right side, we again display Cp alongside the vortical structures, which are colored by the local Mach number.
This method has proven effective, demonstrating success not only within our group but also across the broader research community. Many researchers have achieved significant results using the entropy filtering approach, which has greatly enhanced the capabilities of high-order methods at scale.
In the remaining time, I would like to address the importance of post-processing, particularly in the context of large-scale and industrial simulations. It is essential to recognize that a simulation code encompasses more than just the solver and its numerical methods; all components must scale effectively. This includes the mesh, the solver, and the post-processing, which is often overlooked.
In-Situ Visualisation
On the left side, the slide displays the Cp distribution compared to experimental data. On the right side, Cp is presented alongside the vortical structures, which are color-coded by the local Mach number. This method has proven effective, with numerous researchers achieving significant success through the entropy filtering approach. This advancement has notably enhanced the capabilities of high-order methods at scale.
Additionally, I want to emphasize the importance of post-processing, particularly in large-scale and industrial simulations. A comprehensive code encompasses more than just the solver and its underlying numerical methods; all components must effectively scale. This includes the mesh, the solver, and the post-processing stages, which are often overlooked.
Traditionally, flow visualization in simulations involves running the flow solver, which periodically writes output files to disk. These files are then converted into a visualizable format, typically VTK, and opened in applications like Power View for post-processing.
The traditional simulation process involves several steps that can be inefficient. Initially, a flow solver runs on a GPU and periodically writes output files to disk. Subsequently, these files are converted into a format suitable for visualization, typically the VTK file format. Finally, the transformed files are opened in a separate application, such as Power View, for post-processing. This multi-step approach can be seen as wasteful due to the repeated writing and transformation of data on disk.
This process is highly inefficient. Initially, a simulation runs on a GPU, followed by writing the output to a disk file. Subsequently, this file is transformed into another format on disk.
After completing the post-processing, an image is generated for rendering on a GPU. However, this workflow involves a round trip through magnetic storage, which is significantly slow, and it also consumes a considerable amount of disk space, particularly when frequent visualizations are required.
The optimal scenario is to enable PyFR to generate visualizations and output images in real-time during execution. This approach allows us to create animations that illustrate the flow behavior and simulation evolution without unnecessary steps or wasted computational resources.
To achieve this, we utilize a library called Ascent, developed at Lawrence Livermore National Laboratories. Ascent facilitates in situ visualization and post-processing, providing an efficient means to visualize data dynamically as the simulation progresses.
The ideal scenario is to enable PyFR to generate visualizations and output images during runtime. This allows us to create animations that illustrate the flow dynamics and simulation evolution without requiring additional steps or wasting computational resources. We achieve this through the Ascent library, developed at Lawrence Livermore National Laboratories, which facilitates both in situ visualization and post-processing.
One significant advantage of this approach, compared to previous methods of in situ visualization, is its ease of integration. Users need to focus on just four essential functions. Additionally, it supports parallel rendering, making it suitable for applications that utilize thousands of GPUs.
In this approach, postprocessing must be parallelized to avoid bottlenecks, especially when utilizing thousands of GPUs. The rendering specifications are easily defined using a standard query language like JSON.
For example, you can define a velocity field with variables U, V, and W. You can then calculate the magnitude of this field and specify a slice of the velocity field by providing the point locations and the normal for the slicing plane.
Next, you can render this slice by indicating the desired color range, color scheme, camera position, and the filename for the output image. This process is straightforward and allows for multiple images and scenes to be rendered simultaneously, facilitating comprehensive insights.
This method is exemplified by a standard test case included with PyFR.
This slide features one of the triangular aerofoils discussed by Peter. The code generates visualizations automatically during execution with minimal additional effort and only a slight decrease in performance. This capability not only enables large-scale simulations but also simplifies routine simulations, such as design optimization.
In conclusion, I have presented the second major release of PyFR, highlighting several new features. These include support for new hardware, enhanced shock capturing, and in situ visualization capabilities.
Conclusions
This slide features one of the triangular aerofoils discussed by Peter. The code generates visualizations automatically during execution, requiring minimal effort and causing only a slight decrease in performance. This capability facilitates very large-scale simulations and simplifies routine simulations, such as those involving design optimization.
In conclusion, I have presented the second major release of PyFR, highlighting several new features, including support for new hardware, robust shock capturing, and in situ visualization.
I would also like to acknowledge the support of the AFOSR through a Young Investigator Award. Thank you.
Finally, I would like to acknowledge the Air Force Office of Scientific Research (AFOSR) for their support of this work through a Young Investigator Award. Thank you for your assistance.