GPU vs CPU Benchmarking#
Some acceleration factors#
At the present time, FARGO3D has run on a limited number of platforms, so we have only a limited amount of acceleration factors to quote between CPU and GPU.
The FARGO_SPEEDUP macrocommand#
We have developed a macrocommand named FARGO_SPEEDUP
. You can see
its source in the file src/define.h
, near the line 475. This
macrocommand is meant to give the speedup factor of a given CUDA kernel
with respect to its CPU counter. Its use is overly simple. Suppose
that we want to know the speedup ratio GPU vs CPU of the function
SubStep1_x()
, for the setup fargo
.
First, we need to identify where this function is invoked. It is
called near the line 73 in the file src/algogas.c
:
#ifdef X
FARGO_SAFE(SubStep1_x(dt));
#endif
Note that the invocation is wrapped in a FARGO_SAFE
macrocommand,
the definition of which is… empty (see file src/define.h
near
line 409). All the substeps of FARGO3D are wrapped similarly into this
macrocommand. In normal use, it does not do anything. However, it may
be redefined (see the alternate definitions commented out near line
409 in src/define.h
), so as to provide useful debugging
diagnostics.
What we need to do here to get an automatic evaluation of the speedup
factor is simply to change our wrapper from FARGO_SAFE
to
FARGO_SPEEDUP
. Note that this new macrocommand will manipulate a
bit the function name (it will subsequently invoke
SubStep1_x_cpu()
then SubStep1_x_gpu()
. Since the C
preprocessor is unable to manipulate strings, we need to help it
identify where the sub-string of arguments begins, by inserting a comma:
#ifdef X
FARGO_SPEEDUP(SubStep1_x,(dt)); // <= Note the comma before the '('
#endif
We now build the code for the target setup, with the PROFILING and GPU options enabled:
make SETUP=mri GPU=1 PROFILING=1
and we run it:
./fargo3d in/mri.par
You should see an output such as:
Wall clock time elapsed during MPI Communications : 0.030 s
OUTPUTS 0 at Physical Time t = 0.000000 OK
TotalMass = 0.0271300282
******
Check point created
******
******
Check point restored
*******
GPU/CPU speedup in SubStep1_x: 22.775
CPU time : 91.1 ms
GPU time : 4 ms
We see that the function is timed both in its CPU and GPU version (this test was obtained on an Intel(R) Core(TM) i7 950 at 3.07 GHz, and on a Tesla C2050 card). We also note how execution continues after the evaluation, so that periodically an evaluation of the speedup of our target function is provided. It is interesting to see how the macrocommand is expanded by the preprocessor:
{
SynchronizeHD ();
SaveState ();
InitSpecificTime (&t_speedup_cpu, "");
for (t_speedup_count=0; t_speedup_count < 200; t_speedup_count++) {
SubStep1_x_cpu (dt);
}
time_speedup_cpu = GiveSpecificTime (t_speedup_cpu);
SynchronizeHD ();
RestoreState ();
InitSpecificTime (&t_speedup_gpu, "");
for (t_speedup_count=0; t_speedup_count < 2000; t_speedup_count++) {
SubStep1_x_gpu (dt);
}
time_speedup_gpu = GiveSpecificTime (t_speedup_gpu);
printf ("GPU/CPU speedup in %s: %g\n", "SubStep1_x", time_speedup_cpu/time_speedup_gpu*10.0);
printf ("CPU time : %g ms\n", 1e3*time_speedup_cpu/200.0);
printf ("GPU time : %g ms\n", 1e3*time_speedup_gpu/2000.0);
};
(a proper indentation has been added for legibility.)
We see that the function is firstly executed 200 times on the CPU, then 2000 times on the GPU. The respective single times on CPU and GPU are inferred, and thus the speedup ratio.
Note that we have developed another useful macrocommand in the same
spirit, called FARGO_DEBUG
, which is meant to automatically
compare the result of the CPU version of one routine with the result
of its GPU counterpart. It is presented in section Using FARGO_DEBUG.