Fast Hybrid (Shared/Distributed Memory) SPH Code for Astrophysics

Welcome to SWIFT

Welcome to SWIFT, a joint project of the Institute for Computational Cosmology (ICC) and Institute of Advanced Research Computing (IARC) at the University of Durham.

The ICC is a world leading research institute in the field of cosmology which focuses on the simulation of the formation of structures in the Universe and the evolution of galaxies. Via a collaboration with the IARC, the ICC started the development of the SPH With Inter-dependent Fine-grained Tasking (SWIFT) code to provide astrophysicists with a state of the art framework to perform particle based simulations. The long term goal of the project is to create a single framework to allow astnophysicists to run simulations efficiently on all type of architectures ranging from desktop machines to the largest super-computers.

The entirely open source code uses the concept of task-based parallelism to distribute the work on the different computing units of modern clusters. The library used to this end, itself also open source, is named QuickShed and provides users with an alternative to standard parallelisation strategies.

The collaboration between IARC and the ICC that is behind this project is partially supported by Intel through the establishment of an Intel Parallel Computing Centre (IPCC) at the University of Durham.

Movies & Animations

Simulation results from standard hydrodynamical tests can be found here:

Those two examples have been run using the code with the "Gadget-2 SPH" hydrodynamics switched on. The initial conditions for these cases can be found alongside the source code.

Detailed description

The aim of the code is to tackle the challenge of running particle simulations with a very large dynamic range - arising for example in problems of compressible hydrodynamics or galaxy formation - efficiently on modern computer architectures. Such architectures combine many levels of parallelism, using shared memory nodes of many cores, some of which may have additionally an accelerator. An example density field of a galaxy formation simulations is shown below: the density in the hot regions (gas in haloes of galaxies) is many orders of magnitude higher than in the dark regions (voids), and consequently the time-steps over which the particles march in time as the system evolves, also differ by many orders of magnitude.

A visualisation of the EAGLE

Figure 1: A visual impression of the virtual universe from the EAGLE project, run with a heavily modified version of the Gadget code. Cosmic gas is coloured according to temperature, from cold (dark) to very hot (red). Such simulations often take months to run on thousands of core. Speeding-up such calculations by an order of magnitude would represent a step-change in the way cosmologists can understand how galaxies form.

The main bottleneck of such simulations is load imbalance, arising when calculations on a core depend on those performed on another core. Such interdepency severely limits strong scaling behaviour, yet good scaling is a vital requirement as computers become ever more parallel. Swift also tackles the issue of how to distribute work if not all cores are equal - as is the case when nodes contain accelerators. Finally the speed with which cores do work is often limited by the rate at which data gets fed to it: cache-efficiency of the code is crucial.

The main design specifications of Swift are:

  • Task-based parallelism to exploit shared-memory parallelism. This provides fine-grained load balancing enabling strong scaling, combined with mixing communication and computation, both on each node$,1ry(Bs cores as well as on external devices.
  • SIMD vectorization and mixed-precision computation using a gather-scatter paradigm and the use of single-precision values where excessive accuracy is unwarranted. This is supported by the underlying algorithms which attempt to maximize data locality such that vectorization is even possible, and maximises cache throughput.
  • Hybrid shared/distributed memory parallelism, using the task-based schemes. Parts of the computation are scheduled only once the asynchronous transfers of the required data have completed. Communication latencies are thus hidden by computation, providing for strong scaling across multi-core nodes.
  • Graph-based domain decomposition, which uses information from the task graph to decompose the simulation domain such that the work, as opposed to just the data, as in other space-filling curve schemes, is equally distributed amongst all nodes
Strong scaling of the SWIFT
code on a cosmological problem

Figure 2: Strong scaling test of the SWIFT and Gadget-2 on a cosmological problem with 51×106 particles. The left panel shows the speed-up from one to 1024 cores (linear scale). Perfect scaling is indicated by the dotted line. Gadget-2 stops scaling when more than 400 cores are used while SWIFT still speeds-up. The numbers indicate the wallclock time per time step for both codes. SWIFT is 40x faster than Gadget-2 on that problem. On one core, SWIFT is 7x faster than Gadget-2. The right panel shows the corresponding parallel efficiency. SWIFT presents an efficiency of more than 80% up to 256 cores and of 60% at 1024 cores.

A technical presentation of the SWIFT code at the conference on Exascale computing held in Ascona 2013 can be found here.