Improving CUDA Performance#
One can regard the action of a CUDA kernel on a mesh as the distribution of elementary tasks (one mesh cell = one elementary task) to the CUDA cores of the GPU. The CUDA cores are distributed within streaming multiprocessors (SMP) on board the GPU. For instance, on a Kepler K20 GPU, there are 13 SMP, with 192 cores each (for single precision data), hence there are in total 2496 CUDA cores. In a similar manner, splitting the whole task into threads that perform elementary tasks on the CUDA core obeys a two-level hierarchy: the global mesh must be split in logical blocks, and the blocks are then split in threads. The user has to determine the size of the blocks in X, Y and Z. A given block runs on a single SMP. If you choose blocks that are too small, the SMPs are underused and the performance is degraded. If you choose blocks that are too large, the small amount of memory within the SMPs (48k) is saturated and the extra data is stored within the device’s global memory (the “Video RAM”), with a dramatic performance penalty. There are other considerations that matter the choice of the CUDA blocks (for instance memory alignment), but in short it is obvious that there is an optimal block size that will maximize performance. This size depends on:
the GPU
the kernel itself (it is not the same for all kernels of FARGO3D).
By default, the block sizes used in a kernel execution are the numbers provided in the .opt file, which are “reasonable” numbers, but they are the same for all kernels (hence they cannot be optimal).
A makefile rule combined with python scripting has been developed in order to do perform a systematic test of the performance of each kernel, individually, as a function of the size of the CUDA blocks.
Warning
make blocks
does not work when the .opt
option -DMPIIO
is enabled.
At compilation time, a file called setup.blocks (setup is the name of
your setup) is looked for in the corresponding src/setup
directory
in order to provide to c2cuda.py
the best block size for each
kernel. You could hand-write this file, but in practice, it is
automatically generated by the makefile when you execute the rule
called “blocks”:
make blocks setup=SETUP
It is necessary to use “setup” in lower case in order to avoid a misunderstanding with the SETUP variable. Example:
make blocks setup=fargo
And you will see lines similar to:
CompPresIso 64 8 1 appended
CompPresAd was skipped.
compute_slopes was skipped.
compute_star was skipped.
compute_emf was skipped.
update_magnetic was skipped.
substep1_x 16 8 1 appended
substep1_y 32 4 1 appended
substep1_z was skipped.
substep2_a 64 8 1 appended
...
and a file called fargo.blocks
inside setups/fargo
is created
and is filled with this information, which represents the best block
size for each kernel. All the functions skipped were skipped because
they are not used in this particular setup.
It generally takes a few minutes. At the end, you have a .blocks file similar to:
CompPresIso 64 8 1
substep1_x 16 8 1
substep1_y 32 4 1
substep2_a 64 8 1
...
Now, each time you compile the code, this file is taken by the c2cuda.py script. In the best cases, you can increase the performance in a 10/20%. In 3D massive MHD problems, you will have a maximum gain.
Note: The .blocks
file could be saved for the future if you want to
save time. In theory, the .blocks file is hardware dependent. Be
careful if you share the same file on multiple platforms.
MPICUDA#
The considerations about GPU Direct and improvement of MPI communications between GPUs have been exposed in section CUDA aware MPI implementations.