Improving the performance of HiRep lattice simulations software by exploiting the CPU hardware architecture details and algorithm characteristics
Improving the performance of HiRep lattice simulations software by exploiting the CPU hardware architecture details and algorithm characteristics
IN THE SCIENTIFIC exploration of Quantum Chromodynamics (QCD)—the theory governing the strong interaction among quarks and gluons—large-scale numerical simulations are performed using the framework of lattice gauge theories. Lattice Gauge Theory (LGT) simulations involve the formulation of gauge field theories on a space-time lattice.
HiRep is a simulation suite designed for running lattice simulations, leveraging high-performance computing platforms. HiRep is designed to be flexible enough to study a wide range of strongly interacting systems, particularly those pertinent to novel physics investigations at CERN’s Large Hadron Collider (LHC). However, improving the execution time of HiRep is a challenging and non-trivial task. Even marginal improvements in HiRep’s execution time can have a significant
impact on paving the way to new discoveries in the field of particle physics.
However, a detailed study, analysis, and profiling of the HiRep application revealed that the implementation of the Dirac operator is one of the most computationally intensive routines, serving as the main performance bottleneck. Consequently, this routine was optimized for CPU-based distributed-memory hardware platforms. The main performance inefficiencies include communication overhead due to extensive data exchanges between MPI processes, workload imbalances in OpenMP regions, inefficient data reuse of lattice sites, and ineffective auto-vectorization.
To this end, both algorithmic and hardware-dependent optimization strategies are employed. These strategies include efficient hybrid parallelization (using both MPI and OpenMP parallel programming frameworks), optimizing OpenMP parallelism through loop collapsing, memory access patterns optimization, and vectorization (using both AVX2 and Clang compiler’s vector intrinsics).
Based on experimental results obtained from two distinct High-Performance Computing (HPC) platforms, the proposed optimizations boost the performance of HiRep, achieving an overall speedup of up to ×1.80 compared to the baseline MPI version.
Lattice simulation, Dirac operator, Performance optimization, Hybrid programming, emory access patterns, Vectorization
Rahman, Md Shidur
55f3c1b5-efaf-42bc-aa97-80e496193b81
15 November 2024
Rahman, Md Shidur
55f3c1b5-efaf-42bc-aa97-80e496193b81
Kelefouras, Vasilios
32729c6b-90d3-4539-a083-d5ce6fbb75d9
Rago, Antonio
5ae8e38a-7699-4ef6-9a58-099d18a88c32
Rahman, Md Shidur
(2024)
Improving the performance of HiRep lattice simulations software by exploiting the CPU hardware architecture details and algorithm characteristics.
University of Plymouth, Doctoral Thesis, 104pp.
Record type:
Thesis
(Doctoral)
Abstract
IN THE SCIENTIFIC exploration of Quantum Chromodynamics (QCD)—the theory governing the strong interaction among quarks and gluons—large-scale numerical simulations are performed using the framework of lattice gauge theories. Lattice Gauge Theory (LGT) simulations involve the formulation of gauge field theories on a space-time lattice.
HiRep is a simulation suite designed for running lattice simulations, leveraging high-performance computing platforms. HiRep is designed to be flexible enough to study a wide range of strongly interacting systems, particularly those pertinent to novel physics investigations at CERN’s Large Hadron Collider (LHC). However, improving the execution time of HiRep is a challenging and non-trivial task. Even marginal improvements in HiRep’s execution time can have a significant
impact on paving the way to new discoveries in the field of particle physics.
However, a detailed study, analysis, and profiling of the HiRep application revealed that the implementation of the Dirac operator is one of the most computationally intensive routines, serving as the main performance bottleneck. Consequently, this routine was optimized for CPU-based distributed-memory hardware platforms. The main performance inefficiencies include communication overhead due to extensive data exchanges between MPI processes, workload imbalances in OpenMP regions, inefficient data reuse of lattice sites, and ineffective auto-vectorization.
To this end, both algorithmic and hardware-dependent optimization strategies are employed. These strategies include efficient hybrid parallelization (using both MPI and OpenMP parallel programming frameworks), optimizing OpenMP parallelism through loop collapsing, memory access patterns optimization, and vectorization (using both AVX2 and Clang compiler’s vector intrinsics).
Based on experimental results obtained from two distinct High-Performance Computing (HPC) platforms, the proposed optimizations boost the performance of HiRep, achieving an overall speedup of up to ×1.80 compared to the baseline MPI version.
This record has no associated files available for download.
More information
Published date: 15 November 2024
Keywords:
Lattice simulation, Dirac operator, Performance optimization, Hybrid programming, emory access patterns, Vectorization
Identifiers
Local EPrints ID: 497478
URI: http://eprints.soton.ac.uk/id/eprint/497478
PURE UUID: 7075ff09-3c12-4b36-bf39-f8b4b12df3b9
Catalogue record
Date deposited: 23 Jan 2025 17:48
Last modified: 23 Jan 2025 17:48
Export record
Contributors
Author:
Md Shidur Rahman
Thesis advisor:
Vasilios Kelefouras
Thesis advisor:
Antonio Rago
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics