Benchmark results BSPonMPI 0.2 and a comparison

Currenlty the two leading BSPlib implementations are PUB and the Oxford BSP Toolset. The purpose and functionality of BSPonMPI differs from PUB or Oxford, just as PUB and Oxford differ. Therefore you should choose your BSPlib which fits your needs best. To get a feeling which implementation that would be, I ran some benchmarks.

The benchmark program is heavily inspired on the benchmark program written by Bisseling, which measures the worst case g (throughput) and l (latency). Bisselings benchmark program can be found in the BSPedupack package, which is available at his website. The source code of the benchmark programs is available here and here.

The benchmark program measures the latency and throughput using two different methods.

Send each superstep one message to each processor. Let the message size vary
Send each superstep a varying number of messages to each processors. Balance the communication

Method 2 measures a worst case scenario as it heavily relies on the message combining functionality of the BSPlib implementation. When determining the running time of a BSP program, one should take these g and l as parameters. The results of method 1 are of interest, when you want to predict what happens when you send more data in each message.

During the tests, Oxfords implementation proved to be the most challenging competitor, when it uses bsp_put()only. Other situations are not of interest, because the g differs one or more orders of magnitude in favor of BSPonMPI. See below for examples

Benchmark on 2 processors of a 4-way AMD opteron system using the bsp_put() communication primitive. BSPonMPI 0.2 was compared against PUB version 8 shmem-release

Benchmark on 2 processors of `Aster' an SGI Altix 3000 using the bsp_send() communication primitive. BSPonMPI 0.2 was compared against Oxford BSP Toolset version 1.4 on top of MPI.

Benchmark on 2 processors of `Teras' an SGI Origin 3800 using the bsp_get() communication primitive. BSPonMPI 0.2 was compared against Oxford BSP Toolset version 1.4 on top of the native SGI_SHMEM device.

Note:This difference between various communication primitives is not wanted. It makes predicting the running time of a BSP program complicated, because you have to deal with different g and l's. In fact: According to the BSP paradigm when calculating the running time of a BSP program you will have to pick the slowest l and g.

We now only consider Oxford BSP Toolset v1.4 using the bsp_put() primitive. The benchmarks were performed on the two Dutch national supercomputers at SARA: Teras and Aster. Teras is an SGI Origin 3800 on which Oxford BSP Toolset v1.4 using native shared memory calls is installed. Aster is an SGI Altix 3000 which has an MPI version of Oxford BSP Toolset v1.4. Below the g and l are shown for Aster and Teras.

Teras

The latency on a varying number of processors of `Teras' an SGI Origin 3800 using the bsp_put() communication primitive. BSPonMPI 0.2 was compared against Oxford BSP Toolset version 1.4 on top of the native SGI_SHMEM device.

The throughput on a varying number of processors of `Teras' an SGI Origin 3800 using the bsp_put() communication primitive. BSPonMPI 0.2 was compared against Oxford BSP Toolset version 1.4 on top of the native SGI_SHMEM device. Note that a y-value in the graph corresponds to the number of seconds it takes for a single double (64 bit floating point value) to be delivered, i.e.: higher is worse

The latency increases much faster than the throughput decreases. Also note that the scale of the y-axes differ a factor 1000. One may conclude that latency becomes the most important factor when the number of processors grows. BSPonMPI is faster when the number of processors is 16 or more.

Aster

The latency on a varying number of processors of `Aster' an SGI Altix 3000 using the bsp_put() communication primitive. BSPonMPI 0.2 was compared against Oxford BSP Toolset version 1.4 on top of MPI.

The throughput on a varying number of processors of `Aster' an SGI Altix 3000 using the bsp_put() communication primitive. BSPonMPI 0.2 was compared against Oxford BSP Toolset version 1.4 on top of MPI. Note that a y-value in the graph corresponds to the number of seconds it takes for a single double (64 bit floating point value) to be delivered, i.e.: higher is worse

Now the latency increases much slower in comparison to the throughput. Although the time scales differ a factor 1000, the throughput becomes the dominant factor when the number of processors increases. From 16 processors on BSPonMPI is quicker when performing more than 50 puts

Measuring the throughput by varying the message size, only succeeded on Teras. On Aster there wasn't a clear linear correspondence between message size and communication duration.

Conclusion

The performance of a BSP program using BSPonMPI is more predictable. The l and g of BSPonMPI lie close together when comparing the various communcation primitives.
The performance of BSPonMPI is good. It is faster than PUB and most of the the times faster then the Oxford BSP Toolset. However some applications built on PUB may defeat BSPonMPI on machines with many processors, because PUB has the ability to subdivide the machine into small groups of processors. This results in lower latencies.

BSPonMPI

Contents

Benchmark results BSPonMPI 0.2 and a comparison

Teras

Aster

Conclusion