Paraskevas Evripidou,Bill Farquhar,Neophytos Neophytou

For this research, we propose the design and development of an implicit fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detecting process failures, and a recovery mechanism responsible for failure recovery. Two different cases are considered. The first one deals with failures in slave processes. During program executions, the master process, distributes the data to the slave processes, and a copy is sent to an "Observer" responsible for the coordination of the whole procedure. If process A waits data from another process B for a certain time interval without receiving any data, it assumes a failure in B and informs the Observer about it. The latter spawns a new process C, to finish B's task, and notifies the rest of the processes that C is assigned B's responsibilities. In the second case, a process A waits data from the master, and time elapses without receiving any data. Thus, A informs the Observer thatthe master has failed to deliver. Since the Observer possesses a copyof all the data handled by the master, it takes over the master'stasks. This fault tolerant algorithm is implicit, thus it involves mainly changes in the communication part rather than the code itself. Consider, for example, the MPI_Send routine, int MPI_Send(...). A new routine replaces MPI_Send, int MPI_Send_ft(..., int Observer), specifying that the same message will also be sent to the Observer. In both MPI communication modes, blocking and non-blocking, the message passing routines will have to be modified, in order to include the Observer and its functions.