On-the-fly clustering for exascale molecular dynamics simulations. - presented by Dr Alizée Dubois and Thierry Carrard

On-the-fly clustering for exascale molecular dynamics simulations.

Alizée Dubois and Thierry Carrard

AD

Table of contents

Introduction And Motivation

Impact Loading And Damage

ExaNBody Framework

EXASTAMP Framework

Problem & Proposed Approach

Methodology & Technical Bottlenecks

Poster

Preamble

No slides

Introduction And Motivation

Peter Vincent
ON-THE-FLY CLUSTERING FOR EXASCALE MOLECULAR
DYNAMICS SIMULATIONS.
Killian Babilotte, A. Dubois, T. Carrard, P. Lafourcade, L. Videau, J.-F. Molinari(EPFL), L.
Soulard CEA/DAM/DIF
Computer Physics Communications Seminar Series
Mars 3, 2025
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

Welcome to the next installment of our Computer Physics Communications Seminar Series. Today, we are pleased to have Alizée Dubois and Thierry Carrard from CEA in France. They will discuss their recent paper on "On-the-Fly Clustering for Exascale Molecular Dynamics Simulations."

Without further delay, I will now turn the presentation over to our speakers.

Good afternoon, everyone. It is a pleasure to be here, and we appreciate the opportunity to present our work on clustering methods for exascale molecular dynamics simulations. We are part of a laboratory located in Bruyères-le-Châtel, south of Paris, where we focus on studying matter under extreme conditions.

LMCE : LABORATORY FOR MATTER IN EXTREME CONDITIONS
Alizée Dubois and Thi...
Bruyeres-le-Chatel
Bull
Bull
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app. zoom.us.
Arrêter le partage
Masquer

No summary

LMCE : LABORATORY FOR MATTER IN EXTREME CONDITIONS
Alizée Dubois and Thi..
Component
Opaque Overlay (Tape)
Laser Pulse
Shock Wave
Transparent Overlay (Water)
Shock Physics
Equation of State
Phase Transition Mechanism
Damage
ccnt
Bull
Bull
ccrt
Constitutive Law
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

In extreme conditions, we aim to understand how the physical properties of matter are altered under shock conditions. This includes examining the equation of state, which describes pressure as a function of density, and investigating phase transition mechanisms, such as transitions between solid, liquid, and plasma states. We also analyze how shocks can induce damage and how the constitutive behavior of solid materials is affected.

Extreme conditions involve very high pressure applied over a very short time scale or at a very high loading rate. To achieve these conditions, we utilize various loading devices. For instance, we employ diamond anvil cells for generating super high pressures in the hundreds of gigapascals, albeit at relatively low loading rates. For higher loading rates, we utilize gun techniques.

MATTER IN EXTREME CONDITIONS
Alizée Dubois and Thi..
DIAMOND ANVIL CELL
GAZ GUN
LASER
P~0-500 Gpa
P~5 - 50 Gpa
P~5-50 Gpa
E~0-103 -
E 104 - 106 s-1
E 106 - 108 s-Superscript(1)
Force
Diamond site
laser beam
shock wave
Diamond anvil
Gasket
Sample
pressurized
He chamber
velocity
measurement
0 < V < 900 m/s
Gasket
10 P < 100 bar
plasma formation
plasma expansion
Gauge
Pressure
medium
courtesy of Loïc Toraille
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de el'applicationapp.zoom.us
Arrêter le partage
Masquer

In extreme conditions, we aim to understand how the physical properties of matter are altered under shock conditions. This includes examining the equation of state, which relates pressure to density, as well as the mechanisms of phase transitions, such as solid to solid, solid to liquid, or solid to plasma. We also investigate how shocks can induce damage and affect the constitutive behavior of solid materials.

Extreme conditions are characterized by very high pressure applied over a very short time scale or at a very high loading rate. To achieve this, we utilize various loading devices. For instance, a diamond anvil cell can generate super high pressures, reaching hundreds of gigapascals, but at relatively low loading rates. For higher loading rates, we employ a gas gun, which propels a bullet that impacts a target, generating a shock wave.

Additionally, we use lasers that focus energy onto a metal surface, creating a plasma that ejects material and generates shock waves in the sample. A shock wave is defined as a compressive pressure wave that propagates through the material.

Impact Loading And Damage

IMPACT LOADING AND DAMAGE
Alizée Dubois and Thi.
P > 0
P < 0
defects
tensile loading
nucleation
percolation
200 um
0.00 us
250 jum
Monocrystalline tantalum, damaged under laser impact
loading (CEA/GCLT), SEM image (J. Bontaz).
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

A pressurized chamber can deliver high-pressure gas onto a bullet, which then impacts the target and generates a shock wave. Additionally, a laser can be used to focus on the metal, creating a plasma that ejects material and produces shockwaves in the sample.

A shock wave is a compressive pressure that propagates through a medium. It can reflect off free surfaces, leading to tension loading. However, materials do not sustain tension loading for extended periods; instead, they may nucleate bubbles and generate ejecta, as demonstrated in the accompanying video.

In this seminar, we aim to analyze phenomena such as bubble generation, ejecta formation, and void creation.

MULTISCALE MODELLING
Alizée Dubois and Thi...
How to Bridge the Gap from Defects at the Atomic Scale to Damage in Engineered
Materials ?
200 um
200 um
SEM image of laser impact Ta
EBSD(SEM) analyse of Ta polycrystal
polycrystal (C. Thouenon)
void in a monocrystal of Ta
(C. Thouenon)
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us. Arrêter le partage
Masquer

We aim to understand the implications of voids and bubbles at both the atomic and grain scales. The slide presents a monocrystal of Tantalum, along with an image of Tantalum grains at a higher magnification, and the surface of a sample impacted by a laser, which illustrates the void generation.

To analyze these phenomena, we utilize several computational tools. For the atomic scale, we employ Molecular Dynamics simulations. At the intermediate scale, we use continuum mechanics codes, and for the structural scale, we apply hydrodynamics codes. This presentation will primarily focus on an example using our Molecular Dynamics code.

MULTISCALE MODELLING
Alizée Dubois and Thi..
< 1 um
1 um - 1 mm
> 1 mm
Molecular Dynamics
Continuum Mechanics Code
Hydrodynamic Code
Denoual JPMS 2016
courtesy of P. Bouteiller
Shear Strain
courtesy of P. Lafourcade
ExaCoddex
Charon
N-BODY
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

No summary

ExaNBody Framework

EXANBODY FRAMEWORK
Alizée Dubois and Thi..
N-BODY
exaDEM (Discrete Element Method) exaStamp (Molecular Dynamics)
exaSPH (Smooth Particule Hydrodynamic)
Carrard, T., et al. ExaNBody : A HPC Framework for N-Body Applications in Euro-Par 2023 : Parallel
Processing Workshops, (2024)
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

Hello everyone. I am Cheria, responsible for overseeing our computer science initiatives, particularly in simulation tools. Our journey began with the development of our Molecular Dynamics code, ExaMD, with a strong emphasis on high-performance computing facilities. Over the years, we have refined this tool and introduced numerous massively parallel algorithms, including GPU ports for most of these algorithms. This advancement has sparked interest in optimizing other simulation tools to leverage similar parallel algorithms tailored for specific simulations.

For example, we have created a specialized N-body simulation platform called Exabody. This platform serves as a foundation for any N-body based simulation code. Currently, we have three simulation codes built on Exabody that model various systems, including small grains of matter, rigid bodies like sand grains or powders, and complex rigid bodies that incorporate contact dynamics, friction, and collision detection.

In addition to our work with Molecular Dynamics, we are also exploring hydrodynamics through particle-based simulations using Smooth Particle Hydrodynamics. This approach is particularly advantageous for scenarios where traditional mesh-based methods are insufficient.

EXANBODY FRAMEWORK
Alizée Dubois and Thi...
appropriate
exaNBody
N-BODY
N-Body parallel algorithms MPI + Threads + GPU support
Tools, I/O, In-Situ analyses, etc.
exaDEM (Discrete Element Method)
Onika
exaStamp (Molecular Dynamics)
Component programming paradigm
exaSPH (Smooth Particule Hydrodynamic)
Hybrid memory managment
Many-core / GPU execution framework
Carrard, T., et al. ExaNBody : A HPC Framework for N-Body Applications in Euro-Par 2023 : Parallel
Cuda
OpenMP
C++17 YAML
Processing Workshops, (2024)
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

Hello everyone. I am Cheria, and I oversee the computational aspects of our projects, particularly in simulation tools. Our journey began with the development of our Molecular Dynamics code, ExaMD, which emphasizes high-performance computing capabilities. Over the years, we have created numerous massively parallel algorithms and GPU ports for most of these algorithms. This advancement has sparked interest in optimizing other simulation tools to leverage similar parallel algorithms tailored for specific simulations.

Currently, we have established a specialized N-body platform called ExaNBody. This platform serves as a foundation for any N-body simulation code. To date, we have developed three simulation codes based on ExaNBody, which can simulate various systems, including small grains of matter, rigid bodies such as sand or powders, and more complex interactions involving rigid bodies with contact dynamics, friction, and collision detection. In addition to our Molecular Dynamics code, we are also exploring hydrodynamics through particle simulation techniques, specifically smooth particle hydrodynamics, which address the limitations of mesh-based methods.

We have created a robust ecosystem for developing our simulation software tailored for high-performance computing. This ecosystem consists of three main components. First, we utilize industry-standard low-level libraries, including MPI, OpenMP, C++, YAML for user inputs, and CUDA and HIP for GPU acceleration. Built on top of these low-level standards, we have developed three software layers. The first is OnikaBrick, which serves as the interface with the hardware and incorporates a component-based programming model. The second layer is ExaNBody, which focuses on N-body parallel algorithms, and on top of this, we have our three simulation code applications.

EXANBODY FRAMEWORK
Molecular Dynamics (MD)
Alizée Dubois and Thi..
process id
0.0e+00 5 10 15 20 25 3.1e+01
temperature
N-BODY
-3.3e+03 1e+6 2.2e+0
Discrete Element Method (DEM)
exaDEM (Discrete Element Method) exaStamp (Molecular Dynamics)
exaSPH (Smooth Particule Hydrodynamic)
Carrard, T., et al. ExaNBody : A HPC Framework for
1.0e+04
1.4e+02
8000 6000
4000
N-Body Applications in Euro-Par 2023 : Parallel
2000
Processing Workshops, (2024)
8.0e-04
20 3.0e+00
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'applicationapp.zoom.us.
Arrêter le partage
Masquer

We have developed a comprehensive ecosystem for our simulation code software tailored for high-performance computing facilities. This ecosystem consists of three primary components.

First, we utilize industry-standard low-level libraries, including MPI, OpenMP, C++, YAML for user inputs, and CUDA and HIP for GPU acceleration. These libraries form the foundational layer of our software.

Building on this foundation, we have created three proprietary software layers. The first is Onika Brick, which serves as the interface with the hardware and supports a component-based programming model. The second is exaNBody, which focuses on N-body parallel algorithms. Finally, our three simulation code applications are built on top of these layers.

The accompanying slide illustrates two of our simulation codes. At the top, there is an example of a block of matter subjected to shock from a piston, resulting in rapid movement and the ejection of micro jets of matter through molecular dynamics. The bottom section depicts a rotating drum containing millions of small grains of material. As the drum rotates, the grains mix and move dynamically. Now, let us focus on the molecular dynamics aspect of our work.

EXASTAMP Framework

MOLECULAR DYNAMICS (MD) : EXASTAMP FRAMEWORK
Alizee Dubois and Thi..
potential choice
(x) potential
definition of the material's physics
F(x) force
mr =
equilibrium
attraction
repulsion
adapted from V.A. Yastrebov class
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

The primary focus of Molecular Dynamics is its nature as an unbound problem characterized by specific properties. It employs a potential function that describes the interactions between particles, indicating whether they are repelling or attracting each other. When particles are in close proximity, they repel each other, while at greater distances, they exhibit attractive forces.

A key feature of this potential function is its rapid convergence towards zero. To effectively manage these interactions, Molecular Dynamics utilizes a cutoff approach, limiting the consideration of particle interactions to a defined maximum radius. This ensures that, at any given moment, only particles within a specified distance from a central particle are taken into account.

EXASTAMP SPATIAL DOMAIN DECOMPOSITION
Alizée Dubois and Thi...
Inter node communication through MPI
Subdomain 1
Subdomain 2
Subdomain n
Ghost zone
Each cell
duplicates particles
stores quantities of owned particles
from nearby subdomains
is a compute work unit
RCB split
and receive updates (1)
can be divided into subcells (AMR)
subdomains
T. Carrard et al, 2024
RCB split domains, R. Prat PhD (2019)
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app zoom.us.
Arrêter le partage
Masquer

Molecular Dynamics is characterized as an unbound problem, yet it possesses strong properties. The potential function defines the interactions between particles, indicating that they repel each other when too close and attract as they move further apart. A key feature of this function is its rapid convergence towards zero.

To optimize calculations in molecular dynamics, we implement a cutoff, limiting the potential function to a maximum radius. This approach ensures that we only consider particles within a specified distance from a central particle at any given moment.

In the context of parallelization, we utilize a domain decomposition strategy. Each computation node has a shared memory horizon, and each MPI process operates within a designated bluish area, surrounded by a ghost region. This ghost area contains copies of particles from neighboring MPI processes, facilitating the exchange of information necessary to update properties across processes.

Now, we will apply the molecular dynamics codes to perform physics simulations. For instance, we can observe a video illustrating an impact on Taluma, showcasing the impact event, the reflection of waves, and the nucleation of voids.

EXASTAMP PISTON IMPACT SIMULATION
Alizée Dubois and Thi.
Piston directed compression and release on Tantalum (Vp = 1500 m.s 1)
EAM potential (Ravelo et al, PRB 2013)
t = 43,146 ps
83 Millions of atoms
Particles Velocity
100 X 100 X 150 nm 3
4.8e+01
30 nodes with 64 cores per
node
103680 h CPU
0.0e+00
18 Go / time step
> 1000To in total
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

The simulation involves 83 million atoms, and our objective is to analyze the resulting data effectively. We aim to detect voids, as well as determine their positions, growth velocities, and volume ratios. However, existing literature offers limited methodologies for this analysis.

One notable tool in the field is the VET code, which provides extensive post-processing capabilities. Despite its strengths, it is only parallelized using OpenMP, which constrains its performance based on the system architecture. As a result, when processing tens of millions of atoms, the tool becomes inefficient. Additionally, since VET is a post-processing tool, it requires the recording of vast amounts of data to achieve high temporal resolution, making it unsuitable for our specific analysis needs.

Problem & Proposed Approach

PROBLEM
Alize ee Dubois and Thi.
How CAN WE POST-PROCESS QUANTITATIVELY SUCH A SIMULATION ?
OVITO
Post-processing analysis code
Open MP parallelization + GPU
System size limited by RAM (~ a few
Stukowski, A. Visualization and analysis of atomistic
tens of millions of atoms)
simulation data with OVITO-the Open Visualization Tool.
Modelling and simulation in materials science and
engineering 18, 015012 (2009)
not adapted to our problem
DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

No summary

PROPOSED APPROACH : ON-THE-FLY ANALYIS
Alizee Dubois and Thi...
Development of an Algorithm for Detecting and Characterizing Regions of Interest
Versatile (voids, aggregates, any
Simple - Low cost compared to the total
binarizable zone of interest)
simulation cost
Robust and Scalable - Capable of
Characterization of measurement
processing large in-situ systems
uncertainty
t = 0,000 ps
Particles Velocity
Pores Volume
4.8e+01
2.3e+08
1.00e+7 1.00e+6
1.00e+5
1.00e+4 1.00e+3
100.
0.0e+00
8.0e+00
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

We propose a new on-the-fly approach for developing an algorithm that detects particle aggregates or, more broadly, regions of interest within molecular dynamics simulations. This algorithm should be versatile, capable of identifying any binarizable zone of interest. It must also be robust and scalable to function effectively with billions of atoms. Additionally, it should be simple and low-cost relative to the overall simulation expenses. This will enable physicists to obtain accurate measurements, including quantification of bars and qualification of data. To achieve these objectives, we have selected connected component analysis as our methodology.

Methodology & Technical Bottlenecks

METHODOLOGY AND TECHNICAL BOTTLENECKS
Alizée Dubois and Thi...
CONNECTED COMPONENTS ANALYSIS
IMAGE PROCESSING
GRAPH PROCESSING
COMMUNITY
COMMUNITY
Shared Memory Parallelization
Distributed Memory
Finite number of pass
Iterative algorithm
Not scalable beyond a single
Scalable, but requires multiple
node
iterations
Necessity to port the algorithm
To be adapted for 3D images
to a distributed memory system
while limiting the number of
passes
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

We propose a new approach for on-the-fly detection of particle aggregates and regions of interest within molecular dynamics simulations. This algorithm must be versatile, capable of identifying any binarizable zone of interest. It should be robust and scalable to handle simulations involving billions of atoms. Additionally, it should be cost-effective relative to the overall simulation expenses, enabling physicists to obtain real measurements, including quantification of bars and qualification of data. To achieve these objectives, we have selected connected components analysis.

Connected components analysis can be divided into two main branches: one developed by the image processing community and the other by the graph processing community. The image processing community typically employs shared memory algorithms with a finite number of passes, which are not scalable beyond a single node. Therefore, utilizing these algorithms requires integration into a routine memory system. Conversely, the graph processing community offers fully parallelized algorithms, although many are iterative and would need adaptation for 3D images.

The reason for focusing on images is that we work with atomic data, specifically the movement of atoms. However, we will not remain solely on atomic data; instead, we will project our atomic quantities onto a regular grid. This approach will allow us to define our spatial subdomain effectively.

DATA FORMAT
Alizée Dubois and Thi..
ATOMIC DATA
CONTINUOUS DATA ON A REGULAR GRID
cells
MPI domains
Subdomain MPI
voxels of
Ghost zone
the analysis
MD ghost layers
PROJECTION STEP
Each cell
stores quantities of owned particles
is a compute work unit can be divided into subcells (AMR)
f10 = W3,10,f3 + W1,10 f1
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

Connected components analysis can be categorized into two primary branches: those developed by the image processing community and those from the graph processing community.

The image processing community primarily utilizes shared memory algorithms that operate with a finite number of passes, which limits their scalability to a single node. To effectively employ these algorithms, it is necessary to integrate them into a structured memory system.

In contrast, the graph processing community has developed algorithms that are fully parallelized and typically involve many iterative processes. However, these algorithms require adaptation for application to 3D images.

The emphasis on images arises from the need to work with atomic data, which consists of moving atoms. However, our focus will shift from atomic data to projecting these quantities onto a regular grid. This involves subdividing the spatial domain into cells, allowing us to define subcells or box cells for our analysis. Each atomic quantity will be projected onto these boxes, resulting in a 3D voxelized image. Our initial approach will be to implement the methodologies from the image processing community.

METHODOLOGY AND TECHNICAL BOTTLENECKS
Alizee Dubois and Thi.
CONNECTED COMPONENTS ANALYSIS
IMAGE PROCESSING
GRAPH PROCESSING
COMMUNITY
COMMUNITY
Shared Memory Parallelization
Distributed Memory
Finite number of pass
Iterative algorithm
Not scalable beyond a single
Scalable, but requires multiple
node
iterations
Necessity to port the algorithm
To be adapted for 3D images
to a distributed memory system
while limiting the number of
passes
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

The analysis begins by subdividing the space into cells, which allows us to define subcells or box cells. Each atomic quantity is projected onto these boxes, resulting in a 3D voxelized image. Our initial approach will be to implement the image processing methodology developed by the COMMUNITY, as it aligns closely with our existing data format.

This work is based on the PhD thesis of Kan, who conducted the foundational research. I am here to present his findings and provide a comprehensive overview of the work he has accomplished.

SUM UP OF THE ALGORITHM (K. BABILOTTE PHD THESIS)
projection
quantities per atom
quantities per voxels
- creation of local graph
work on output table
- thread boundaries connection
work on the atoms
- MPI boundaries connection
work on the graph
- final graph realization
work on the voxels
(master MPI)
- final label writting
- features computation per thread
- global reduction + writting
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

The algorithm begins with the projection step, transitioning from atomic data to cell representation. Following this, we will conduct connected component labeling and perform subsequent analysis. The process involves moving from atomic values to binarized quantities, and then to aggregation.

PROJECTION STEP
THRESHOLD DEFINITION
density
f10 = W3,10 f3 + W1,10 f1
P atomic projection splat
S : size of subcells in the simulation.
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app .zoom.us.
Arrêter le partage
Masquer

The algorithm begins with a projection step, transitioning from atomic data to a cell-based representation. Following this, we will perform connected component labeling and analysis. This process involves moving from atom-based quantities to binarized data and then to aggregation.

The projection steps are designed to enhance the distribution of the quantities. Key parameters for the projection include the atomic projection splat, which serves as a smoothing parameter for the fields, and the size of the simulation subcell, which dictates our discretization approach.

After projecting the atomic quantities—such as density, velocity, or any mechanical field computed per atom—we obtain a visual representation of these fields. At this stage, we will establish a generalization threshold. For instance, when considering density, we have various threshold options. A stringent criterion may require selected voxels to be entirely void, while a more lenient approach allows for some flexibility. The choice of threshold will influence the detection of void quantities, particularly in circular structures.

Once the projection is complete, we will have a visualized image annotated with unique ID labels for each cell, ranging from 0 to 5. These IDs correspond to the cells' positions within the global simulation, providing a convenient reference that will be utilized in subsequent analyses.

CONNECTED COMPONENT LABELLING
UNIQUE ID PER VOXEL
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app zoom.us.
Arrêter le partage
Masquer

In this process, we begin by assigning unique identifiers to each voxel in the simulation. As we traverse through the voxels, we label the first voxel as 1. The second voxel, which is connected to the first, receives the same label. This continues sequentially until we reach the sixth voxel, which is isolated. At this point, we define a new connected component and increment the label for the next set of connected voxels.

As we progress to the next row, we find that the sixth voxel is connected to one of the previously labeled voxels. We can then assign it the same label as the first connected component. This process allows us to establish a global graph of connected components. However, not all voxels are yet connected, necessitating a final pass to consolidate the information and assign labels based on the minimum identifiers.

The objective is to achieve domain labeling, which is necessary for parallel processing between threads and MPI (Message Passing Interface). The domain labeling between threads is relatively straightforward, but we will focus on the more complex MPI domain labeling next.

DOMAIN STITCHING : CONNECTING COMPONENTS ACROSS BOUNDARIES
master MPI
thread
reading
message
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

You begin with your unique voxel idea and proceed to label each voxel. Start with the first voxel, labeled as 0, and assign it the label 1. Continue this process for the second, third, fourth, fifth, and sixth voxels. The sixth voxel is isolated and currently unconnected, prompting the definition of a new connected component or level.

Next, move to the second line. The sixth voxel is now connected to a voxel from the first level, allowing you to assign it the same label. This process facilitates the creation of global graphs. However, at this stage, not all voxels are interconnected, necessitating a final pass to consolidate the information and unify all levels to the minimum label assigned.

The objective is to achieve DOMAIN labeling, which involves the interaction between threads and MPI. The DOMAIN labeling between threads is relatively straightforward. We will now focus on the more complex MPI DOMAIN labeling.

To implement this, we will utilize a master MPI to facilitate the stitching process. Each MPI will send messages to the master MPI, enabling it to perform the global connection. For illustration, consider a scenario with one MPI and one NKI, where the local graphs are constructed by the NPI, as shown on the right.

DOMAIN STITCHING : CONNECTING COMPONENTS ACROSS BOUNDARIES
MPI m
MPI n
MPIn
MPI n
ghost layer
message of MPI n
MPIm
ghost layer
LOCAL CCL AT THE BOUNDARIES
MPI m
BETWEEN MPI REGIONS
message of MPI m
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partage par le biais de l'applicationapp.zoom.us.
Arrêter le partage
Masquer

In this process, we utilize a master MPI to facilitate the stitching of data. Each MPI sends messages to the master MPI, enabling it to establish global connections.

Consider the scenario with one MPI and one NKI. The local graphs constructed by the NPI are displayed on the right. For the 36 labels, we observe connections to subsequent labels. Each MPI performs local connected component labeling (CCL) at the boundaries of its regions, utilizing the concept of ghost layers.

At the boundary of each MPI, such as the boundary of MIN, there exists a ghost layer comprising the first layer of voxels outside the MPI domain. This allows for the execution of local 2D CCL across these two layers of voxels. For instance, MIN will send a message to the master MPI, reporting the IDs of the voxels outside its spatial zone and those within it. If label 36 is connected to label 25, MIN will communicate this as "36 25," recognizing that label 25 is also connected to label 3, thus reporting the root of the connected component.

The same process applies to label 39, which connects to 25, and subsequently, the connections continue with 25 to 43. This pattern allows for the aggregation of connections, such as 25, 36, and 362. Each MPI sends its local CCL information at the boundaries to the master MPI, which consolidates this data and transmits a unified message back to each MPI.

DOMAIN STITCHING : CONNECTING COMPONENTS ACROSS BOUNDARIES
MPI m
MPI n
MPI all to one
message of MPI m
message of MPI n
MPI n
MPI one to all
ghost layer
message of MPI n
MPI m
MPI n
ghost layer
MPI m
message of MPI m
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

The connection between the 36 labels and the second label is established through the graph created by the MPI. Each MPI will perform local connected component labeling (CCL) at the boundaries of its regions, utilizing the concept of ghost layers. A ghost layer is the first layer of voxels outside the MPI domain, which allows for local 2D CCL to be executed between these layers.

For instance, at the boundary of MPI MIN, it will send a message to the master MPI that includes the identifiers of the connected cells both inside and outside its spatial zone. If the label 36 is connected to label 25, MPI MIN will report this connection as "36-25." It also recognizes that label 25 is connected to label 3, thus reporting the root of the connected component. This process is repeated for other labels, such as 39 reporting "39-25" and then "25-43."

Subsequently, the master MPI consolidates this information from all MPIs, reducing it to a single message that reflects the global connection. This allows for the local graphs of each MPI to be summarized effectively.

In summary, the algorithm involves multiple steps where all workers are engaged in processing. The left side of the slide outlines these steps, while the right side presents a graph indicating the number of active workers and their respective timeframes. Initially, all workers are operational; however, a bottleneck may occur during the information reduction phase at the boundaries, potentially leading to traffic congestion in the MPI system. It is crucial to assess whether this presents a problem for the scalability of our algorithm.

Visionneur de documents
3 mars 15:23
pres_algo_killian.pdf
192.2%
GANTT CHART OF WORKER TASKS FOR PARALLEL ALGORITHM
work on output table
work on the atoms
MPI synchronization
81.9 M atoms 58.9 M voxels 4 nodes 4 MPI X 32 OMP/node
work on the graph
thread synchronization
work on the voxels
NMPI NMPI
NOMP
projection
quantities per atom
quantities per voxels
Projection: init
- creation of local graph
Projection: computation
compression
CCL: init
CCL: 1st pass
thread boundaries connection
Wait or Comm
compression
CCL: boundaries connection
- MPI boundaries connection
CCL: Final graph realization CCL: 2nd pass
- first message (all to one)
CCA: init
- second message (all to one)
CCA: features computation CCA: global reduction
- final graph realization
CCA: CSV writting
(master MPI)
- third message (one to all)
- final graph realization per MPI
- final label writting
- features computation per thread
- local reduction
- fourth message (all to one)
t (s)
- global reduction + writting
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application zoom.us.
Arrêter le partage
Masquer

To evaluate the performance of our algorithm, we conducted a simulation involving 81.9 million atoms. The accompanying graph illustrates the work ID, representing the number of cores, while the color-coded activity on the cores is displayed over time. The lower graph integrates this information, showing the number of active workers over time.

We observed a bottleneck caused by the information reduction at the master MPI, which leads to a temporary slowdown. Although this is a drawback, it does not significantly impact the overall analysis time, indicating that it is manageable.

Next, we aim to determine the optimal partitioning time for the NNP, which should enhance the efficiency of the Molecular Dynamics analysis. Our computational nodes consist of 64 cores, and we can explore various decompositions, such as 1 MPI with 125 processes or 2 MPI. However, increasing the number of MPI processes leads to a higher volume of message exchanges. Conversely, increasing the number of threads can result in longer memory access times, making it less optimal.

Ultimately, we seek the best scenario for performance, which we identified as a configuration of 4 MPI with 32 processes, maximizing overall efficiency.

Activités
Visionneur de documents
3 mars 15:25 0
pres_algo_killian.pdf
192,2%
IMPACT OF MPI X OMP PARTITIONING ON EXECUTION TIME
4 MPI node 32 OMP
MD EAM
Analysis
Baseline
Best compositor
core
(1:100)
1x128
2x64
4x32
8x16
16x8
32x4
64x2
Node tasks compositon (MPIxOMP)
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

To evaluate the performance of our simulation, we conducted a test involving 81.9 million atoms. The accompanying graph illustrates the activity across the cores over time, with a specific focus on the workload distribution indicated by the work ID. The lower graph integrates this information, showing the number of active workers on the cores throughout the simulation. We observed a bottleneck related to the communication of information through the master MPI; however, the time required for this process is relatively minor compared to the overall analysis duration. While this presents a limitation, it is manageable and not a critical issue.

Our goal is to determine the performance of the algorithm by selecting the optimal partitioning for the Neural Network Potential (NNP). The ideal partitioning should enhance the efficiency of the Molecular Dynamics analysis. We utilize nodes composed of 64 cores each, allowing for various decomposition configurations. For instance, a configuration of 1 MPI with 125 threads can be tested. Increasing the number of MPI processes results in a higher volume of messages, but adding more threads can lead to increased memory access time, which may not be optimal. Therefore, we aim to identify the best scenario, ultimately determining that a configuration of 4 MPI with 32 threads yields the best performance.

In terms of strong scaling, we maintain a constant load of 81.9 million atoms, which is relatively manageable for our system.

Results

3 mars 15:25 a
Activités
Visionneur de documents
192,2%
pres_algo_killian.pdf
STRONG SCALING RESULTS
Analysis
MD EAM
8.00
Ideal scaling
81.9M atoms
Per-node MPI/OMP work
4.00
distribution :
2.00
4 MPI X 32 OMP
1.00
CONSTANT WORKLOAD :
0.50
81.9 106 atoms
0.25
58.9 . 106 voxels
0.12
0.06
1024
2048
4096
Cores
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

We evaluate the reduction in computation time as we increase the number of calls. The dotted black line represents the ideal scenario. Although we do not perfectly align with this ideal, our performance improves with a higher load of atoms.

tivités
Visionneur de documents
3mars 15.25
pres_algo_killian.pdf
192,2% Q E
STRONG SCALING RESULTS
Analysis
MD EAM
8.00
Ideal scaling
81.9M atoms
Per-node MPI/OMP work
4.00
655.2M atoms
distribution :
2.00
4 MPI X 32 OMP
1.00
0.50
CONSTANT WORKLOAD :
0.25
655.2 . 106 atoms
471.6 . 106 voxels
0.12
0.06
1024
2048
4096
Cores
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le blais de l'application app.zoom.us.
Arrêter le partage
Masquer

In our analysis of weak scaling results, we maintain a constant load per core. As we increase the number of cores, we also increase the number of atoms proportionally. This approach allows us to evaluate the efficiency of our computation under varying workloads. The dotted black line represents the ideal scenario, and while we do not perfectly align with it, we observe that as the number of atoms increases, our performance improves relative to this ideal case.

Activités
Visionneur de documents
3 mars 15:25 0
37 sur68
pres_algo_killian.pdf
192,2%
mind
WEAK SCALING RESULTS
MD EAM
Analysis Ideal scaling
Per-node MPI/OMP work
distribution :
4 MPI X 32 OMP
WORKLOAD PER CORE :
160 . 103 atoms/core
113.103 voxels/core
1024
2048
4096
8192
16384
32768
Cores
81.9M atoms
5.2 B atoms
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

For the weak scaling results, we maintain a constant load per core. As we increase the number of cores, we also increase the number of atoms, targeting approximately 160,000 atoms per core. Although the scalability of our existing system is not ideal, the scalability of the molecular dynamics is nearly perfect, indicating that communication is not a significant issue.

In our analysis, while the performance is not as strong, the bottleneck we identified—reducing the information from a single molecular interaction—is manageable. Our primary objective is to conduct an analysis that does not excessively increase computation time. We must ensure that the time required for analysis does not exceed that of a single step of molecular dynamics. The accompanying slide presents the ratio of analysis time to one step of dynamics.

pres_algo_killian.pdf
192,2%
none
PROFILING IN SITU ANALYSIS VS. MD STEP COMPUTATION
655.2
Per-node MPI/OMP work
distribution :
4 MPI X 32 OMP
OVERHEAD :
Tanalysis
81.9
1024
1536
2048
2560
3072
3584
4096
Cores
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us
Arrêter le partage
Masquer

To achieve approximately 160 × 10^3 atoms per core, we note that while the scalability of our existing blue system is not ideal, the scalability of molecular dynamics is nearly perfect, and communication issues are minimal. Although our analysis is less efficient, the bottleneck we identified indicates that reducing the information from a single molecular interaction (MI) is manageable.

We aim to ensure that our analysis does not require excessive computational time. It is critical to consider the time associated with molecular dynamics; specifically, the analysis must not exceed the computational cost of a single step of dynamics. The following graph illustrates the ratio of analysis time to one step of molecular dynamics across varying loads and core counts.

The light blue points represent data from low load strength scaling, while the pink points correspond to high loads. The points with varying colors reflect weak scaling. As observed, the load increases with the number of cores. When optimization is adequately implemented, the analysis time approaches 5% of the molecular dynamics time, which is satisfactory. Consequently, this analysis will not be conducted after every step but rather every 10 or 100 steps, making its impact negligible compared to the overall cost of molecular dynamics.

Analysis

3 mars 15:28 a
192,2%
Activités
Visionneur de documents
pres_algo_killian.pdf
ANALYSIS
nb pores = 0
t = 6.0 ps
X Axis
X Axis
2000
10000
2000
0000
7 Axll 4000
Z Axil
4000
6000
9000
Tantalum
6000
8000
8000
8000
8000
7000
10000
1 x 1 X 0.15 um 3
7000
10000
10000
6000
10000
6000
9000
500g
Axil
9000
5000
Axis
4000
8, 3 B atoms
8000
4000
8000
3000
3000
7000
2000
2000
6000
1000
5, 2 B voxels
1000
5000
Y Axi
5000
4.0e+09
4000
2000
.9e+02
4000
2000
10 Mh
3000
4000
3.9543e+09
3000
4000
2000
2 Akm
6000
2 Axs
3.9542e+09
2000
6000
1000
ROOD
3.9541e+09
32768 cores
8000
X Axis
3 954e+09
580 560
40e+09
5.4e+02
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de -'application app.zoom.us.
Arrêter le partage
Masquer

In this section, we present an example comparing a Tantalum structure of 1 micron by 1 micron to one measuring 150 nanometers. This represents a significant advancement in our capabilities. We successfully performed connected components analysis and detected all relevant features, which was a highly satisfactory outcome. Additionally, we will discuss the image processing techniques employed in this analysis.

Activités
Visionneur de documents
3 mars 15:28
sur 68
pres_algo_killian.pdf
192,2%
METHODOLOGY AND TECHNICAL BOTTLENECKS
CONNECTED COMPONENTS ANALYSIS
IMAGE PROCESSING
GRAPH PROCESSING
COMMUNITY
COMMUNITY
Shared Memory Parallelization
Distributed Memory
Finite number of pass
Iterative algorithm
Not scalable beyond a single
Scalable, but requires multiple
node
iterations
Necessity to port the algorithm
To be adapted for 3D images
to a distributed memory system
while limiting the number of
passes
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

The image processing method for developing connected components has been established. We will now transition to a new implementation focused on graph processing.

As previously mentioned, the algorithm developed during the PhD is efficient and well-optimized. However, it has a limitation: it consolidates information into a central MPI node. To address this, we aimed to create an alternative algorithm for two main reasons. First, we sought a simpler algorithm that could be adapted for GPU implementation in the near future. Second, for large simulations, we wanted to ensure robustness, avoiding scenarios where all information is gathered on a single node.

The alternative implementation we have recently finalized maintains the same starting point as before. Regarding the local MPI process, it operates similarly to the previous algorithm.

Activités
Visionneur de documents
3 mars 15:30 00
sur68
pres_algo_killian.pdf
192.2%
CONNECTED COMPONENT LABELLING
UNIQUE ID PER VOXEL
dvoxel = Funique_ Id (I,J,K)
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
R. Dickau, Z-curve wikipedia
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

The image processing method for developing connected components is based on the concept of image connected components. We will now shift our focus to a new implementation from a graph processing perspective. As previously mentioned, the algorithm developed during the PhD is efficient and highly optimized. However, it has the limitation of aggregating condensed information to a central MPI node.

To address this, we aimed to create an alternative algorithm for two primary reasons. First, we sought a simpler algorithm that could be easily adapted for GPU implementation in the near future. Second, in cases involving large simulations, we wanted to ensure robustness, avoiding any implementation that centralizes all information on a single node. The alternative implementation has recently been finalized, and I will present it to you now.

The starting point remains the same as before. From the perspective of the local MPI process, our approach is similar to that of the previous algorithm. The key difference lies in the function used to associate a unique Voxel ID with the coordinates of each cell. This function is crucial to the algorithm's effectiveness and must be deterministic, invariant to the number of MPI processes, and exhibit locality features, which is why we employ a curve-based function.

Additionally, it is essential to know the main and maximum possible values for this function in advance, prior to propagating the labels. These properties are significant, and their importance will become clear shortly. We begin with unique IDs assigned to each voxel and then propagate by identifying the minimum label ID within the connected components.

Activités
Visionneur de documents
sur68
3 mars 15:31 a
pres_algo_killian.pdf
192,2%
CONNECTED COMPONENT LABELLING - DOMAIN STICHING
ghost layer
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votreécran est partagé par le biais de l'application app.zoom.us
Arrêter le partage
Masquer

The primary distinction in this process lies in the selection of the function used to associate a unique Voxel ID with the cell coordinates. This function is crucial to the algorithm, as its deterministic nature must remain invariant to the number of MPI processes and incorporate a locality feature. Consequently, we employ a curve-based function. It is essential to identify the main and maximum possible values for this function prior to label propagation, as these properties are significant for the algorithm's effectiveness.

We initiate the process by assigning unique IDs to each voxel and then propagate the minimum label ID within each connected component. Instead of relaying information to a central MPI process, we leverage the existing ghost layer communication algorithm from standard Molecular Dynamics simulations. This approach updates values in the ghost area based on computations from neighboring MPI processes.

Once we receive updated information from neighboring processes, we may encounter additional labels that require further propagation. The process alternates between local steps for propagating the minimum local ID within the connected component of a local MPI process and steps for updating the ghost areas. This iterative method can lead to more iterations, particularly in cases where a connected component spans multiple subdomains. Each time we propagate and reach the ghost area, the neighboring MPI processes will continue the propagation, resulting in a greater number of iterations needed to achieve global stability. This stability is easily identified when updating values from neighboring MPI processes.

Activités
Visionneur de documents
3 mars 15:34 a
pres_algo_killian.pdf
192,2%
CONNECTED COMPONENT LABELLING - DOMAIN STICHING
ITERATIVE PROCESS
ghost layer
CCId stable inside a MPI
Message for ghost layer exchange
Rebalancing CCId at the MPI scale
GLOBAL STABILITY OF CCId
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

After completing a series of iterations to propagate the minimum local ID within each connected component, we utilize the existing ghost layer communication algorithm, similar to that used in standard Molecular Dynamics simulations. This approach updates the values in the ghost areas based on computations from neighboring MPI processes.

Once updated information is received from neighboring processes, we may encounter additional labels or lower minimum IDs that require further propagation. We alternate between local steps, where the minimum local ID is propagated within the MPI process, and steps that update the ghost areas. This process may reveal lower label IDs in the surrounding area that need to be propagated again.

It is important to note that more iterations may be necessary, especially if a connected component spans multiple subdomains. As propagation occurs, each MPI process updates its ghost areas, and these updates are communicated to neighboring processes. This iterative process continues until a global stable state is achieved, which is identified when no changes are observed in the state across all MPI processes during the local minimum propagation step. At this point, further iterations are unnecessary.

Activités
Visionneur de documents
3 mars 15:34
sur68
pres_algo_killian.pdf
192,2%
DEFINITION OF AN MPI OWNER PER CC
rank = G(Id
voxel,
MPI1
MPI: 2
MPI1
MPI2
MPI3
MPI 4
MPI 5
MPI3
MPI 4
MPI5
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II otreécranestpartagépar le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

In this state, each MPI subdomain represents a part of a connected component. It is essential to aggregate metrics for every connected component, such as the number of voxels and other relevant data. However, we aim to avoid centralizing all information to a single MPI process. Instead, we assign each connected component to a specific process, designated as the owner, responsible for aggregating all values associated with that component.

To facilitate this, we introduce a function G. This function determines the MPI rank of the process responsible for aggregating metrics for each connected component label, based on the voxel ID and the total number of MPI processes.

Activités
Visionneur de documents
3 mars 15:36 as
pres_algo_killian.pdf
192,2%
MERGING CONNECTED COMPONENT DATA ACROSS RANKS
MPI1
MPI2
MPI ALL TO ALL
MPI1
MPI2
MPI3
MPI4
MPI3
MPI 4
MPI5
MPI5
MPI1
MPI2
MPI1
MPI2
send
count
02000
send
count
10011
receive
count
01000
receive
count
20110
MPI3
MPI 4
MPI5
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de I l'application app.zoom.us.
Arrêter le partage
Masquer

In this state, each MPI subdomain corresponds to a specific connected component. We must aggregate metrics for every connected component, such as the number of voxels and other relevant metrics, using the information contained in each voxel. Rather than consolidating all this data into a single MPI process, we assign each connected component to a designated owner process responsible for aggregating its metrics. This is achieved by implementing a function ( G ) that, for each connected component label (or voxel ID), determines the MPI rank of the process responsible for aggregation based on the total number of MPI processes.

Each MPI process can deterministically identify the owner of any arbitrary connected component ID. Consequently, if a process holds part of a connected component within its local subdomain, it can accurately determine the amount of information to send to other MPI processes. This is represented in the send count array. Additionally, we can construct a receive code to ascertain how much information needs to be received from other MPI processes. We utilize the MPI All-to-All communication model, allowing each MPI processor to exchange data with other processes, thereby facilitating efficient data transfer after these steps.

pres_algo_killian.pdf
192,2% V
CONNECTED COMPONENT ANALYSIS
MPI1
MPI2
FINAL WRITTING OF THE CC
PROPERTIES
MPI3
MPI 4
MPI 5
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

Each MPI process is responsible for maintaining the complete information for every connected component it owns. Subsequently, all MPI processes can write the connected component properties in parallel to a result file. It is important to note that throughout the algorithm, there is no step that requires gathering information from all connected components.

Activités
Visionneur de documents
3 mars 15:38 0
pres_algo_killian.pdf
192.2%
TEST CASE : MICRO-JETTING - - 1 NODE CASE
MPIrank
CCId
Per-node MPI/OMP work
distribution :
4MPI X 32 OMP
60 106 atoms
CCvolume
14.106 voxels
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'applicationapp.zoom.us
Arrêter le partage
Masquer

This slide presents a detailed overview of the test case utilized in our analysis. Similar to the previous scenario, we examined the algorithm developed by Killian, which represents a worst-case situation characterized by numerous connected components that are sparsely distributed across the domain. This algorithm demonstrates excellent scalability when handling many small connected components that are uniformly distributed throughout the simulation domain.

However, in this instance, we observe a significant connected component that extends widely. This phenomenon occurs due to the presence of numerous filaments that link groups of atoms together, resulting in a more complex structure within the simulation.

pres_algo_killian.pdf
192,2%
TEST CASE : MICRO-JETTING
CCvolume (A3)
7 . 106
MPIrank
1000
100000
Per-node MPI/OMP work
distribution :
4 MPI X 32 OMP
704 . 106 atoms
168.6 . 106 voxels
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25

This slide presents a detailed examination of the test case utilized. In the previous scenario, we analyzed the algorithm developed by Killian, which represents a worst-case scenario characterized by numerous connected components that are sparsely distributed across the domain. This situation poses a challenge for the algorithm, as it is designed to scale effectively when many small, connected components are uniformly dispersed throughout the simulation domain.

In this particular case, however, we observe a significant connected component that extends widely. This phenomenon occurs due to the presence of numerous filaments that interconnect groups of atoms. Understanding the algorithm's performance in this worst-case scenario is crucial for evaluating its robustness and effectiveness.

Activités
Visionneur de documents
3 mars 15:39
sur68
pres_algo_killian.pdf
192,2%
DEFINITION OF AN MPI OWNER PER CC
6000
rank = G(Id voxel, NMPT
5000
MPI1
MPI2
4000
3000
2000
1000
MPI3
MPI 4
MPI5
# MPI
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partage par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

This slide illustrates the distribution of connected components when utilizing a static and deterministic approach. The distribution is based solely on the ID assigned to different MPI processes. As shown, the distribution is not ideal; several MPI processes, particularly those associated with higher values, have few or no connected components.

Despite this, the overall repartition is adequate for our algorithm's scalability, particularly from a memory perspective. This demonstrates that the algorithm can effectively handle both weak and strong scaling scenarios.

1 3 mars 15:40 M
sur68
pres_algo_killian.pdf
192,2%
WEAK SCALING RESULTS
8.00
Per-node MPI/OMP work
4.00
distribution :
4 MPI X 32 OMP
thure
2.00
re-scaled 1st method
1.00
WORKLOAD PER CORE :
468 . 103 atoms/core
0.50
109 . 103 voxels/core
0.25
1024
2048
4096
Cores
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de I l'application app.zoom.us.
Arrêter le partage
Masquer

This slide illustrates the distribution of connected components within the context of a static and deterministic distribution method, which allocates collected components based solely on their IDs across different MPI processes. The distribution is not optimal; we observe that many MPI processes, particularly those associated with higher values, have few or no connected components. However, this is not entirely detrimental. The repartitioning is sufficient for our algorithm to scale effectively, particularly from a memory perspective.

Regarding weak scaling, our algorithm demonstrates slightly less scalability compared to other algorithms. This is partly due to our focus on developing a robust and portable solution for future GPU implementation. Despite this, our performance remains competitive when compared to the initial attempts from the PHGT.

In terms of strong scaling results, we observe commendable scalability, although the presence of a large connected component necessitates more iterations as the number of MPI processes increases. Nonetheless, we still achieve satisfactory scaling properties.

Activités
Visionneur de documents
3 mars 15:41 0
sur 68
pres_algo_killian.pdf
192.2%
STRONG SCALING RESULTS
CCId
32.00
16.00
8.00
thur
4.00
Per-node MPI/OMP work
2.00
distribution :
1.00
re-scaled 1st method
4 MPI X 32 OMP
0.50
CONSTANT WORKLOAD :
0.25
704 . 106 atoms
1024
2048
4096
Cores
168.6 . 106 voxels
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de el'applicationapp.zoom.us.
Arrêter le partage
Masquer

For weak scaling, our algorithm demonstrates some scalability, although it is slightly less effective compared to other algorithms. This is partly due to our focus on developing a robust and portable algorithm for future GPU integration. However, our algorithm still exhibits lower performance than the initial attempt from the PHGT.

In terms of strong scaling results, we observe promising scalability despite the presence of a large connected component. This central component increases the number of iterations required as the number of MPI processes rises. Nevertheless, we maintain decent scaling properties.

We achieved scalability with up to 4,400 cores. Notably, the memory footprint of this algorithm remains nearly constant across different MPI processes, regardless of the number of processes utilized.

Summary

Activités
Visionneur de documents
3 mars 15:42 0
56 sur68
pres_algo_killian.pdf
192,2%
SUMMARY OF ADVANTAGES AND DRAWBACKS OF EACH METHOD
FINITE NUMBER OF PASS ALGORITHM
ITERATIVE ALGORITHM
Scalable
Scalable
Versatile
Versatile
Robust
Robust
Simple
Simple
PORTABLE ON GPU
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

In summary, we present two algorithms for connected component labeling. The first is our Formula One implementation, which is scalable and inspired by the image processing community. It has been successfully extended to MPI-level scalability, although achieving this optimization requires considerable complexity in the code.

The second algorithm is an iterative and alternative approach that is fully scalable. Even in the worst-case scenarios, it maintains scalability from a memory perspective. This algorithm is robust, producing deterministic connected component labeling regardless of the number of MPI processes used. Furthermore, the code is concise and will soon be portable to GPU platforms.

Now that we have explored both implementations of connected component labeling, we can discuss their applications. For instance, on the left, we illustrate the impact of liquids on surfaces and the resulting ejecta or aggregate explosions.

Activités
Visionneur de documents
3 mars 15:43 0
pres_algo_killian.pdf
192.2%
PHYSICAL CASE STUDIES
1.2 Batoms 29.4 Bvoxels
8.3 Batoms
5.2 Bvoxels
1 um
600 nm
460 000
22 01 20
200 000
100 000
50000
Veruno
20000
1 um
10000
300 nm
0 00
5 000 3 800
100 nm
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé parlebiais de l'application app.zoom.us.
Arrêter le partage
Masquer

In summary, we have developed two algorithms for connected component labeling. The first algorithm, inspired by image processing techniques, is our Formula One implementation. This algorithm is scalable and has been successfully extended to MPI-level scalability. However, optimizing this code requires significant complexity.

The second algorithm is an iterative and alternative approach that is fully scalable, even in worst-case scenarios. From a memory perspective, it maintains scalability and robustness, producing deterministic connected component labeling regardless of the number of MPI processes. This code is straightforward, consisting of only a few lines, and is expected to be portable to GPU architectures in the near future.

Now, let's explore the applications of these algorithms. The first example illustrates the distribution of mass and velocity of aggregates resulting from liquid impacts on a surface. The second example focuses on ball fracture, where the blue regions represent voids. This allows us to assess whether the simulation is sufficiently large to capture the microscopic properties of small fractures. The accompanying diagram depicts the correlation between compressive wave reflection, tension inflation of voids, velocity at the rear of the sample, and the number of void nuclei.

Activités
Visionneur de documents
3 mars 15:44 0
sur 68
pres_algo_killian.pdf
192,2%
SPALL FRACTURE
8.3 Batoms 5.2 Bvoxels
200.0
100.0
-6.0
-12.0
P (GPa)
Nvoid
6000
4000
2000
1 um
22 01 20
4000
3000
2000
1000
x (nm)
V (m/s)
1 um
300 nm
0 000
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
II Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

This slide presents the temporal resolution of void nuclei, measured every 10 units. This data allows us to analyze the nucleation properties of voids and their evolution over time, which is valuable for microscopic modeling. Additionally, we can assess and quantify the distribution of void volumes.

Activités
Visionneur de documents
3 mars 15:45 9
pres_algo_killian.pdf
192,2%
SPALL FRACTURE
10-3
10-7
10-8
(1000 X 1000 X 150) nm3
(100 X 100 X 150) nm3
10-9
(30 X 30 X 150) nm3
10-10
(10 X 10 X 150) nm3
V (A³)
t(ps)
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

The data presented here demonstrates the stability of void nucleation across various sample cross sections. We tested cross sections of 10 by 10 nanometers, 30 by 130 nanometers, 100 by 100 nanometers, and 1 micron by 1 micron. The results showed that all cross sections yielded similar outcomes. While a 10 nanometer cross section might seem sufficient for studying the phenomena, the 1 by 1 micron simulation was crucial for validating our findings.

Additionally, when examining the global distribution of void volumes, the limited data from the smaller cross sections would not allow us to differentiate between a log-normal and other distribution forms. In this instance, the larger simulation size enabled us to reject the hypothesis of a log-normal distribution, indicating a more complex form.

Furthermore, the right side of the slide illustrates the evolution of the number of voids and their volume distribution over time. For a macroscopic model, it is essential to consider this evolution in volume distribution. Overall, the analysis conducted using these algorithms provides valuable insights into the dynamics of void nucleation and growth.

Activités
Visionneur de documents
3 mars 15:47 5
sur 68
pres_algo_killian.pdf
192,2%
SPALL FRACTURE
Impact on Tantalum (100 X 100 X 150) nm 3 .
Evolution of the void volume distribution (AS).
10-5
Ongoing work :
Quantifying uncertainties
Characterizing the shape of the
distribution
Generative model ?
Implications for macroscopic models ?
VALUABLE INSIGHTS INTO THE
PHENOMENOLOGY OF DAMAGE
t(ps)
A. DUBOIS - T. CARRARD - COMPUTER PHYSICS COMMUNICATIONS SEMINAR SERIES - 03/03/25
Votre écran est partagé par le biais de l'application app .zoom.us.
Arrêter le partage
Masquer

To ensure stability in your sample's cross section, it is essential to test various dimensions. In our experiments, we evaluated cross sections of 10 by 10 nanometers, 30 by 130 nanometers, 100 by 100 nanometers, and 1 micron by 1 micron. All results were consistent across these dimensions. While a 10 nanometer cross section may seem adequate for studying the phenomena, the data from the 1 by 1 micron simulation is crucial for validating this conclusion.

Moreover, if you only analyze the blue data point, it would be challenging to differentiate between a log-normal distribution and other forms. However, with the broader dataset, we can confidently reject the hypothesis that the distribution is a simple pole, suggesting a more complex behavior. Increasing the simulation size is beneficial for microscopic models.

Additionally, the right side of the slide illustrates the evolution of voids over time, along with changes in volume distribution. For a macroscopic model addressing aging, it is vital to consider this evolution in volume distribution. Overall, the analysis conducted using these algorithms provides significant insights into the phenomena under study.

Moving forward, we must focus on quantifying the uncertainties in our measurements. This will allow us to address various physical questions about how local information can enhance our microscopic models. Thank you for your attention, and I welcome any questions.

Activités
Visionneurdedocuments
3 mars 15:47 a
pres_algo_killian.pdf
192,2%
Thank you for your attention.
CEA DAM-DIF
Bruyères-le-Châtel 91297 Arpajon, France
alizee.dubois@cea.fr
Votre écran est partagé par le biais de l'application app.zoom.us.
Arrêter le partage
Masquer

The macroscopic model phenomenology is essential for our ongoing research and modification efforts. It is crucial to accurately quantify the uncertainties associated with measurements. Additionally, there are numerous physical questions regarding how we can utilize local information to enhance our microscopic models. Thank you for your attention. I am now open to any questions.