Benchmark results BSPonMPI 0.2 and a comparison
Currenlty the two leading BSPlib implementations are PUB and the Oxford BSP Toolset. The purpose and functionality of BSPonMPI differs from PUB or Oxford, just as PUB and Oxford differ. Therefore you should choose your BSPlib which fits your needs best. To get a feeling which implementation that would be, I ran some benchmarks.
The benchmark program is heavily inspired on the benchmark program written by Bisseling, which measures the worst case g (throughput) and l (latency). Bisselings benchmark program can be found in the BSPedupack package, which is available at his website. The source code of the benchmark programs is available here and here.
The benchmark program measures the latency and throughput using two different methods.
- Send each superstep one message to each processor. Let the message size vary
- Send each superstep a varying number of messages to each processors. Balance the communication
During the tests, Oxfords implementation proved to be the most challenging
competitor, when it uses bsp_put()
only. Other situations are not
of interest, because the g differs one or more orders of magnitude in favor of BSPonMPI. See below for examples
Benchmark on 2 processors of a 4-way AMD opteron system using the bsp_put()
communication primitive. BSPonMPI 0.2 was compared against PUB version 8 shmem-release
Benchmark on 2 processors of `Aster' an SGI Altix 3000 using the
bsp_send()
communication primitive. BSPonMPI 0.2 was compared against Oxford BSP Toolset version 1.4 on top of MPI.
Benchmark on 2 processors of `Teras' an SGI Origin 3800 using the
bsp_get()
communication primitive. BSPonMPI 0.2 was compared
against Oxford BSP Toolset version 1.4 on top of the native SGI_SHMEM
device.
We now only consider Oxford BSP Toolset v1.4 using the bsp_put()
primitive. The benchmarks were performed on the two Dutch national
supercomputers at SARA: Teras and Aster. Teras is an SGI Origin 3800 on which
Oxford BSP Toolset v1.4 using native shared memory calls is installed. Aster is
an SGI Altix 3000 which has an MPI version of Oxford BSP Toolset v1.4.
Below the g and l are shown for Aster and Teras.
Teras
The latency on a varying number of processors of `Teras' an SGI Origin 3800 using
the bsp_put()
communication primitive. BSPonMPI 0.2 was compared
against Oxford BSP Toolset version 1.4 on top of the native SGI_SHMEM device.
The throughput on a varying number of processors of `Teras' an SGI Origin 3800 using
the bsp_put()
communication primitive. BSPonMPI 0.2 was compared
against Oxford BSP Toolset version 1.4 on top of the native SGI_SHMEM device.
Note that a y-value
in the graph corresponds to the number of seconds it takes for a single
double
(64 bit floating point value) to be delivered, i.e.: higher is worse
Aster
The latency on a varying number of processors of `Aster' an SGI Altix 3000 using
the bsp_put()
communication primitive. BSPonMPI 0.2 was compared
against Oxford BSP Toolset version 1.4 on top of MPI.
The throughput on a varying number of processors of `Aster' an SGI Altix 3000 using
the bsp_put()
communication primitive. BSPonMPI 0.2 was compared
against Oxford BSP Toolset version 1.4 on top of MPI.
Note that a y-value
in the graph corresponds to the number of seconds it takes for a single
double
(64 bit floating point value) to be delivered, i.e.: higher
is worse
Measuring the throughput by varying the message size, only succeeded on Teras. On Aster there wasn't a clear linear correspondence between message size and communication duration.
The throughput on a varying number of processors of `Teras' an SGI Origin 3800 using
the bsp_put()
communication primitive. BSPonMPI 0.2 was compared
against Oxford BSP Toolset version 1.4 on top of the native SGI_SHMEM device.
Conclusion
- The performance of a BSP program using BSPonMPI is more predictable. The l and g of BSPonMPI lie close together when comparing the various communcation primitives.
- The performance of BSPonMPI is good. It is faster than PUB and most of the the times faster then the Oxford BSP Toolset. However some applications built on PUB may defeat BSPonMPI on machines with many processors, because PUB has the ability to subdivide the machine into small groups of processors. This results in lower latencies.