BSPonMPI

0.1

BSPonMPI is an implementation of the BSPlib standard on top of MPI 1.1

Introduction

You should read this document if you want to makes changes to BSPonMPI or if you want to write your own BSPlib implementation and you need some inspiration. If you just want to use BSPonMPI, you should read the README file (included with the BSPonMPI package) first. In all cases I assume that the reader is familiar with the BSPlib standard

This document contains

If you want to change something in BSPonMPI, you should read the Architecture description and the detailed descriptions of the code which will be affected by your changes. If you just need inspiration, you only need to read the Architecture description

Architecture

Requirements

The mission is to make a BSPlib implementation which will always be faster than the Oxford BSP Toolset on top of MPI and runs on all MPI enabled computer platforms. Until now this performance mission has not been achieved. BSPonMPI is still a bit slower when executing bsp_put() but it is faster on executing bsp_get() and bsp_send().

General Idea

The BSPlib standard is designed to make programming using the Bulk Synchronous Parallel (BSP) model easier. The BSP model prescribes that you cut up your program in computation and communication supersteps. BSPlib helps you by providing a function bsp_sync() to execute an entire communication superstep. Communication is gathered during a computation superstep using three functions: bsp_put(), bsp_get() and bsp_send(). To put it very simple: A BSPlib implementation is a

Actually it is not that simple: A bsp_get() needs some action to be taken on the remote processor and expects some data to be returned. This very simple model can be repaired by seeing that there are not only data deliveries (bsp_put() and bsp_send()) but also data requests (bsp_get()). Therefore we may conclude that any BSPlib implementation consists of two buffers: one request buffer and one delivery buffer. Note that I ignore the two unbuffered communication routines bsp_hpput() and bsp_hpget(). These are currently implemented as their buffered counterparts.

Let me now restate what a BSPlib implementation is:

Design Choices

The idea is to use the bulk communication procedures of MPI. The two most BSP like MPI functions are: MPI_Alltoall() and MPI_Alltoallv(). Nowadays computer manufacturers implement efficient MPI 1.1 libraries. Any MPI enabled machine has these functions and it is very probable that they are optimal. You may ask yourself why I don't use the DRMA procedures of MPI-2 . They may be very fast on DRMA enabled machines. However MPI-2 is not yet widely available and will therefore limit the use of this library.

In order to effectively use these communication procedures we need that the surrounding code is very fast and portable. The choices I made are

  1. Use ANSI C 99 as programming language in combination with the GNU autotools.
  2. Use an object oriented programming style.
  3. Collect all communication and transmit them using the least possible calls to MPI_Alltoallv.
  4. Use arrays to implement communication buffers. They can be used in a number of contexts, e.g. as a queue and as parameter of the MPI_Alltoallv() function.
  5. Allocate memory for these buffers only once (at the start of the program) and double the allocated memory if the buffer proves to be too small. Using this strategy the number of calls to malloc() remains very small.
  6. Minimise the use of branch statements (if, switch). I discovered this too late and I expect that more performance can be gained by rethinking the main data structure
  7. Try to make the optimising as easy as possible for a C compiler, but only use ANSI C constructions. Examples: restrict, inline, static, memory alignment, rather dereference a pointer than use memcpy(), etc...

Note:
When writing this documentation I discovered a lot of illogical constructions. This is caused by using trial and error design of this library. Each time when I discovered a way to improve performance I had to weigh the implementation time cost against performance improvement. Therefore sometimes the overall architecture does not comply to some design choices, some of which I made during experimenting and tweaking of the library.

Implementation

Object Oriented Programming in ANSI C

Though I used ANSI C as programming language, I tried to program in an object oriented fashion. To translate classes and inheritance into C constructs, I used struct's, union's and enum's, e.g.: the C++ code
  class A { int x, y;};
  class B: A { int a, b;};
  class C: A { int c, d;};

  void foo()
  {
    A a; a.x = 0; 
    B b; b.a = 1; b.x = 0;
    C c; c.d = 2; c.y = 1;
  }  
translates into something like
  typedef enum { B_type, C_type } class_type;
  typedef struct { int a, b;} B_info;
  typedef struct { int c, d;} C_info;
  typedef union { B_info b; C_info c;} class_info;
  typedef struct { int x, y; class_info info; class_type type;} A;

  void foo()
  {
    A a; a.x = 0;
    A b; b.type = B_type; b.b.a=1; b.x = 0;
    A c; c.type = C_type; c.c.d=2; c.y = 1;
  }  
Member functions are translated into className_functionName( classref *, ...). For example
   class A 
   { 
     int x;
     int foo() const
     {
       return x;
     };
   };  
translates into
   typedef struct {int x;} A;
   int a_foo( const A * restrict a)
   {
     return a->x;
   }

Communication Buffer

Because BSPlib is essentially a communication buffer, the performance of the library heavily depends on its data structure. As primitive communication buffer I designed ExpandableTable. This is essentially a one-dimensional array which is subdivided in equally sized blocks. I take as many blocks as there are processors and assign each block to a processor. When you look at these blocks as columns, you will get a table where each column corresponds to a processor. Putting data in a column, marks that data as 'received from' or 'to be sent to' the corresponding processor. This way, the array can be passed as a parameter of the MPI_Alltoallv() function.
exptable.png

Data layout of the ExpandableTable data structure

Each column is again subdivided in slots. Accessing arbitrary bytes may be very expensive on some architectures or impossible. Therefore I try to use aligned addresses by subdividing each column in slots. Usually I take slot_size equal to the size of a struct which I use as basic array element. The size of the entire array is equal to nprocs x rows x slot_size, where rows is the number of slots in a column. When data is added to a column the variable used_slot_count is incremented with the number of slots occupied by the new data. Additionally the variable count is incremented with 1. An example is shown below
exptable-ex.png

Example data layout of an ExpandableTable data structure

When one tries to add data to a column but there is not enough space available, new space is allocated. The new space is three times the size of the already allocated space. Again this array is subdivided in equally sized blocks. The old data is copied to the new locations and the new data is added. Note that the size of an ExpandableTable can only increase. This way only a very limited number of calls to malloc() are necessary. ExpandableTable and its member functions are declared in bsp_exptable.c and bsp_exptable.h

This data structure serves as a building block for the two communication buffers: RequestTable and DeliveryTable. They both have slightly different needs. RequestTable only has to handle data requests from other processors. Each data request is of a fixed size. Therefore implementing RequestTable will be rather straightforward. For details I refer to bsp_reqtable.h and bsp_reqtable.c. On the other hand DeliveryTable has to handle data deliveries which may differ in size. The implementation of DeliveryTable is a bit different; I use this buffer not only for communication of bsp_put(), bsp_send() and the data delivery part of a bsp_get(), but also for actions which have to be carried out during a bsp_sync(), e.g.: bsp_set_tagsize(), etc... for details see bsp_delivtable.h and bsp_delivtable.c

BSPlib also provides a way to address remote memory locations and a queue of received messages. These two can also be modelled in a fixed size element array and a variable size element array, respectively. Therefore two abstract classes are introduced: FixedElSizeTable and VarElSizeTable. We get the following UML class diagram

inline_dotgraph_1
A small UML legenda is shown below.
legenda.png

Legenda of UML class diagram and UML Sequence Diagram

A Sequence Diagram

Until now the explanation may still be a bit abstract. Below the source code and its sequence diagram of a simple BSP program are shown. It shows how processors and objects collaborate over time.
 #include <bsp.h>
 #include <stdio.h>
 
 void spmd_part()
 {
   int a=1, b=2, c=3;
   bsp_begin(2);
   bsp_push_reg(&a, sizeof(int));
   bsp_sync();

   if (bsp_pid() == 0)
     {
       bsp_put(1, &b, &a, 0, sizeof(int));
       bsp_get(1, &a, 0, &c, sizeof(int));
       bsp_send(1, NULL, "some text", 10);
     }
   bsp_sync();

   bsp_pop_reg(&a);
   bsp_sync();

   bsp_end();
 }
 
 int main(int argc, char *argv)
 {
   bsp_init(&spmd_part, argc, argv);

   printf("First perform some sequential code\n");

   spmd_part();

   printf("Finally perform some sequential code\n");
   return 0;
 }
sequence.png

Sequence diagram of the example above

Room For Improvement

There are still things which are not being taken care of in an optimal way. Ideas to improve performance are:
  1. Merge the two MPI_Alltoall()'s of RequestTable and DeliveryTable which exchange buffer sizes. Or make it an exception to exchange buffer sizes: use MPI_Allgather() to determine whether buffer sizes need to be communicated. This function may be less expensive.
  2. subdivide every column in ExpandableTable, where each part only holds elements of one kind, e.g.: every column has a send, put, other (=settag, pushreg, popreg, etc...) block. This avoids branches in tight loops (in deliveryTable_execute() )
  3. Align every data element
  4. Implement the MessageQueue s.t. copying memory from DeliveryTable becomes superfluous.
  5. When expanding a buffer, do it only for a specific processor / buffer. This increases performance on non full h-relation
  6. Implement the memoryRegister_find such that it is O(1) in stead of O(n) (e.g. hash on least significant bits)

License

BSPonMPI. This is an implementation of the BSPlib standard on top of MPI. Copyright © 2006 Wijnand J. Suijlen

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA


Generated on Sat Apr 8 12:56:50 2006 for BSPonMPI by  doxygen 1.4.6