Contents Home News Download Documentation Benchmark Support Roadmap Credits Project on SF	Benchmark results BSPonMPI 0.2 and a comparison Currenlty the two leading BSPlib implementations are PUB and the Oxford BSP Toolset. The purpose and functionality of BSPonMPI differs from PUB or Oxford, just as PUB and Oxford differ. Therefore you should choose your BSPlib which fits your needs best. To get a feeling which implementation that would be, I ran some benchmarks. The benchmark program is heavily inspired on the benchmark program written by Bisseling, which measures the worst case g (throughput) and l (latency). Bisselings benchmark program can be found in the BSPedupack package, which is available at his website. The source code of the benchmark programs is available here and here. The benchmark program measures the latency and throughput using two different methods. Send each superstep one message to each processor. Let the message size vary Send each superstep a varying number of messages to each processors. Balance the communication Method 2 measures a worst case scenario as it heavily relies on the message combining functionality of the BSPlib implementation. When determining the running time of a BSP program, one should take these g and l as parameters. The results of method 1 are of interest, when you want to predict what happens when you send more data in each message. During the tests, Oxfords implementation proved to be the most challenging competitor, when it uses `bsp_put()`only. Other situations are not of interest, because the g differs one or more orders of magnitude in favor of BSPonMPI. See below for examples Benchmark on 2 processors of 4-way AMD opteron system using the `bsp_put()` communication primitive. PUB version 8 was used and compiled as shmem-release Benchmark on 2 processors of the Dutch national supercomputer Aster. This computer is an SGI Altix 3000. Oxford BSP Toolset version 1.4 on top of MPI was used here. Benchmark on 2 processors of the Dutch national supercomputer Teras. This computer is an SGI Origin 3800. Oxford BSP Toolset version 1.4 on top of the native SGI_SHMEM device was used here. Note:This difference between various communication primitives is not wanted. It makes predicting the running time of a BSP program complicated, because you have to deal with different g and l's. In fact: According to the BSP paradigm when calculating the running time of a BSP program you will have to pick the slowest l and g. We now only consider Oxford BSP Toolset v1.4 using the `bsp_put()` primitive. The benchmarks were performed on the two Dutch national supercomputers at SARA: Teras and Aster. Teras is an SGI Origin 3800 on which Oxford BSP Toolset v1.4 using native shared memory calls is installed. Aster is an SGI Altix 3000 which has an MPI version of Oxford BSP Toolset v1.4. Below the g and l are shown for Aster and Teras. Teras The latency depends on the number of processors The throughput depends on the number of processors. Note that a y-value in the graph corresponds to the number of seconds it takes for a single `double` (64 bit floating point value) to be delivered, i.e.: higher is worse The latency increases much faster than the throughput decreases. Also note that the scale of the y-axes differ a factor 1000. One may conclude that latency becomes the most important factor when the number of processors grows. BSPonMPI is faster when the number of processors is 16 or more. Aster The latency depends on the number of processors The throughput depends on the number of processors. Note that a y-value in the graph corresponds to the number of seconds it takes for a single `double` (64 bit floating point value) to be delivered, i.e.: higher is worse Now the latency increases much slower in comparison to the throughput. Although the time scales differ a factor 1000, the throughput becomes the dominant factor when the number of processors increases. From 16 processors on BSPonMPI is quicker when performing more than 50 puts Measuring the throughput by varying the message size, only succeeded on Teras. On Aster there wasn't a clear linear correspondence between message size and communication duration. The throughput depends on the number of processors Conclusion The performance of a BSP program using BSPonMPI is more predictable. The l and g of BSPonMPI lie close together when comparing the various communcation primitives. The performance of BSPonMPI is good. It is faster than PUB and most of the the times faster then the Oxford BSP Toolset. However some applications built on PUB may defeat BSPonMPI on machines with many processors, because PUB has the ability to subdivide the machine into small groups of processors. This results in lower latencies.

Benchmark results BSPonMPI 0.2 and a comparison

Currenlty the two leading BSPlib implementations are PUB and the Oxford BSP Toolset. The purpose and functionality of BSPonMPI differs from PUB or Oxford, just as PUB and Oxford differ. Therefore you should choose your BSPlib which fits your needs best. To get a feeling which implementation that would be, I ran some benchmarks.

The benchmark program is heavily inspired on the benchmark program written by Bisseling, which measures the worst case g (throughput) and l (latency). Bisselings benchmark program can be found in the BSPedupack package, which is available at his website. The source code of the benchmark programs is available here and here.

The benchmark program measures the latency and throughput using two different methods.

Send each superstep one message to each processor. Let the message size vary
Send each superstep a varying number of messages to each processors. Balance the communication

Method 2 measures a worst case scenario as it heavily relies on the message combining functionality of the BSPlib implementation. When determining the running time of a BSP program, one should take these g and l as parameters. The results of method 1 are of interest, when you want to predict what happens when you send more data in each message.

During the tests, Oxfords implementation proved to be the most challenging competitor, when it uses bsp_put()only. Other situations are not of interest, because the g differs one or more orders of magnitude in favor of BSPonMPI. See below for examples

Benchmark on 2 processors of 4-way AMD opteron system using the bsp_put() communication primitive. PUB version 8 was used and compiled as shmem-release

Benchmark on 2 processors of the Dutch national supercomputer Aster. This computer is an SGI Altix 3000. Oxford BSP Toolset version 1.4 on top of MPI was used here.

Benchmark on 2 processors of the Dutch national supercomputer Teras. This computer is an SGI Origin 3800. Oxford BSP Toolset version 1.4 on top of the native SGI_SHMEM device was used here.
Note:This difference between various communication primitives is not wanted. It makes predicting the running time of a BSP program complicated, because you have to deal with different g and l's. In fact: According to the BSP paradigm when calculating the running time of a BSP program you will have to pick the slowest l and g.

We now only consider Oxford BSP Toolset v1.4 using the bsp_put() primitive. The benchmarks were performed on the two Dutch national supercomputers at SARA: Teras and Aster. Teras is an SGI Origin 3800 on which Oxford BSP Toolset v1.4 using native shared memory calls is installed. Aster is an SGI Altix 3000 which has an MPI version of Oxford BSP Toolset v1.4. Below the g and l are shown for Aster and Teras.

Teras

The latency depends on the number of processors

The throughput depends on the number of processors. Note that a y-value in the graph corresponds to the number of seconds it takes for a single double (64 bit floating point value) to be delivered, i.e.: higher is worse
The latency increases much faster than the throughput decreases. Also note that the scale of the y-axes differ a factor 1000. One may conclude that latency becomes the most important factor when the number of processors grows. BSPonMPI is faster when the number of processors is 16 or more.

Aster

The latency depends on the number of processors

The throughput depends on the number of processors. Note that a y-value in the graph corresponds to the number of seconds it takes for a single double (64 bit floating point value) to be delivered, i.e.: higher is worse
Now the latency increases much slower in comparison to the throughput. Although the time scales differ a factor 1000, the throughput becomes the dominant factor when the number of processors increases. From 16 processors on BSPonMPI is quicker when performing more than 50 puts

Measuring the throughput by varying the message size, only succeeded on Teras. On Aster there wasn't a clear linear correspondence between message size and communication duration.

The throughput depends on the number of processors

Conclusion

The performance of a BSP program using BSPonMPI is more predictable. The l and g of BSPonMPI lie close together when comparing the various communcation primitives.
The performance of BSPonMPI is good. It is faster than PUB and most of the the times faster then the Oxford BSP Toolset. However some applications built on PUB may defeat BSPonMPI on machines with many processors, because PUB has the ability to subdivide the machine into small groups of processors. This results in lower latencies.

Contents

Benchmark results BSPonMPI 0.2 and a comparison

Teras

Aster

Conclusion