Developer Guide

Introduction

MPI is a standard in HPC community which allows a simple use of clusters. Nowadays, there are several implementation (OpenMPI, BullxMPI, MPT, IntelMPI, MPC, …) each of which involves a specific ABI (Application Binary Interface) for an application compiled with a specific MPI implementation. With wi4mpi, an application compiled with an alpha MPI implementation can be run under a beta MPI implementation without any compilation protocol and any concern about the ABI (Preload version). WI4MPI can also be seen as a dedicated MPI implementation; This time, the application is compiled against the wi4mpi library (libmpi.so) with the dedicated wrapper (mpicc,mpif90…) meant for that purpose, and can be run under any other MPI implementation (Interface Version).

Library

How it works

Before performing any translation we need to distinguish the application side from the runtime side. To do that, any MPI object from the application side are prefixed by A_ and those from the runtime side are prefixed by R_. To perform a translation, all original MPI calls from the application are intercepted by WI4MPI and replaced by the same call prefixed by A_. For example, with an OpenMPI —> IntelMPI conversion.

digraph G { // Nodes node [shape=box]; Application [label="Application: MPI_Init (OpenMPI)"]; Runtime [label="Runtime: MPI_Init (IntelMPI)"]; node [shape=ellipse, style="filled", fillcolor=orange]; Translation; node [shape=box, style="rounded,filled", fillcolor=lightblue]; WI4MPI [label="WI4MPI: A_MPI_Init"]; // Links Application -> WI4MPI [label="Phase 1"]; Translation -> WI4MPI [label="Phase 2", dir=both]; WI4MPI -> Runtime [label="Phase 3"]; }

Fig. 1 How it works

Implementation

Library settings

The library is set during its loading time, when the program start. All the runtime MPI routines are saved to function pointers using a dlsym call to be called later on, all the tables are created and set with MPI constant objects, and the spinlocks are initialized. To do so, we use the following syntax:

void **attribute** ((constructor)) wrapper_init {
  void(*) lib_handle = dlopen(getenv("WI4MPI_RUN_MPI_C_LIB"), RTLD_NOW);
  LOCAL_MPI_Function = dlsym(lib_handle, "PMPI_Function")
  ....
}

The library contains three constructors:

  • wrapper_init in test_generation_wrapper.c (API C) (preload and interface)

  • wrapper_init_f in wrapper.c (API Fortran) (preload and interface)

  • wrapper_init_c2ff2c in c2f_f2c.c (API c2f/f2c)

Symbol overload

The mpi calls are intercepted thanks to the following rerouting:

  • #define A_MPI_Send PMPI_Send

  • #pragma weak MPI_Send=PMPI_Send

(See interface_test.c in src/interface/gen/interface_test.c and src/interface/gen/interface_fort.c) This syntax is also present but hidden in test_generation_wrapper.c (src/{preload,interface}/gen/) within an asm code chooser for the next reason.

The MPI-IO implementation (ROMIO) present within most MPI implementations triggers some calls to the MPI user interface that WI4MPI intercepted using the symbols overload protocol. This implies that during the runtime (phase 3 and above), some MPI calls are made triggering WI4MPI to re-intercept them and crash the application. The crash is the result of WI4MPI trying to convert arguments from the runtime version to the runtime version. An example is illustrated below.

digraph G { // Nodes node [shape=box]; Application [label="Application: MPI_File_open (OpenMPI)"]; Runtime [label="Runtime: MPI_File_open (IntelMPI)"]; node [shape=box, style=filled, fillcolor=red]; Crash; node [shape=ellipse, style="filled", fillcolor=orange]; Translation; Translation_phase_3 [label="Translation"]; node [shape=box, style="rounded,filled", fillcolor=lightblue]; WI4MPI [label="WI4MPI: A_MPI_File_open"]; WI4MPI_phase_3 [label="WI4MPI: A_MPI_Allreduce but with runtime arguments\n instead of application arguments (R_ instead of A_)"]; // Links Application -> WI4MPI [label="Phase 1"]; Translation -> WI4MPI [label="Phase 2", dir=both]; WI4MPI -> Runtime [label="Phase 3"]; Runtime -> WI4MPI_phase_3; WI4MPI_phase_3 -> Translation_phase_3; Translation_phase_3 -> Crash; }

Fig. 2 Example: symbol overload

To overcome this issue, we used an assembly code router.

Code chooser assembly

The ASM code chooser does the simple following things:

If we already are in the wrapper:

  • The arguments are passed without any translation protocol to the underlying MPI runtime call (LOCAL_MPI_function)

Otherwise:

  • The arguments are translated and passed to the underlying MPI runtime call (LOCAL_MPI_function)

To know which state the process is, we check the value of the in_w variable:

  • in_w=1 : in the wrapper

  • in_w=0 : in the application

Since the implementation of MPI objects is developer dependent, some of them may vary in size. To make sure that there is no side effect, the code chooser analyzes the stack itself.

ASM Code chooser implementation (generated for each function):

.global PMPI_Function                   # Define global PMPI_Function symbol
.weak MPI_Function                      # Define a weak MPI_Function symbol
.set MPI_function,PMPI_Function         # Set contents of MPI_function to PMPI_Function
.extern in_w
.extern A_MPI_Function
.extern R_MPI_Function
.type PMPI_Function,@function           # Set PMPI_Function type to function
.text
PMPI_Function:
push %rbp
mov %rsp, %rbp
; ------------- Put arguments on stack for safekeeping
sub $0x20, %rsp
mov %rdi, -0x8(%rbp)
mov %rsi, -0x10(%rbp)
mov %rdx, -0x18(%rbp)
mov %rcx, -0x20(%rbp)
; ------------- Access thread-local variable in_w
.byte 0x66
leaq in_w@tlsgd(%rip), %rdi             # Load address of in_w into %rdi
.value 0x6666
rex64
call __tls_get_addr@PLT                 # Get contents of address in %rdi into %rax
; ------------- Put arguments back where we found them
mov -0x8(%rbp), %rdi
mov -0x10(%rbp), %rsi
mov -0x18(%rbp), %rdx
mov -0x20(%rbp), %rcx
leave                                   # Set %rsp to %rbp, then pop top of stack into %rbp
; ------------ Jump to the target function
cmpl $0x0, 0x0(%rax)
jne inwrap_MPI_Function
jmp (*)A_MPI_Function@GOTPCREL(%rip)    # If not in wrapper call application method
inwrap_MPI_Function:
jmp (*)R_MPI_Function@GOTPCREL(%rip)    # If in wrapper call run method
; ------------ Calculate symbol size
.size PMPI_Function,.-PMPI_Function     # Declares symbol size to be the size of the above
digraph G { // Nodes node [shape=box]; Application [label="Application: MPI_File_open (OpenMPI)"]; Runtime [label=" MPI_File_open (IntelMPI)"]; No_translation [label="R_MPI_Allreduce: No Translation"]; node [shape=ellipse, style="filled", fillcolor=orange]; Translation [label="A_MPI_File_open: Translation"]; node [shape=box, style="rounded,filled", fillcolor=lightblue]; WI4MPI [label="WI4MPI: PMPI_File_open\n Testing in_w: in_w=0"]; WI4MPI_phase_3 [label="WI4MPI: PMPI_Allreduce\n Testing in_w: in_w=1"]; // Links Application -> WI4MPI [label="Phase 1"]; Translation -> WI4MPI [label="Phase 2", dir=both]; WI4MPI -> Runtime [label="Phase 3"]; Runtime -> WI4MPI_phase_3; WI4MPI_phase_3 -> No_translation; }

Fig. 3 ASM Code chooser

A_MPI_Function

All translations are executed thanks to some mappers defined within mappers.h using an underlying hash table mechanism named uthash (https://troydhanson.github.io/uthash/) The mappers (see example below) always have the same syntax :

mapper_name_a2r(&buf, &buf_tmp);
mapper_name_r2a(&buf, &buf_tmp);

In case of an a2r translation, buf_tmp represent the translation of buf and vice versa for an r2a translation.

Example:

A_MPI_Send(void *buf, int count, A_MPI_Datatype datatype, int dest, int tag,
           A_MPI_Comm comm) {
  void *buf_tmp;
  const_buffer_conv_a2r(&buf, &buf_tmp); // mapper
  R_MPI_Datatype datatype_tmp;
  datatype_conv_a2r(&datatype, &datatype_tmp); // mapper
  int dest_tmp;
  dest_conv_a2r(&dest, &dest_tmp); // mapper
  int tag_tmp;
  tag_conv_a2r(&tag, &tag_tmp); // mapper
  R_MPI_Comm comm_tmp;
  comm_conv_a2r(&comm, &comm_tmp); // mapper
  int ret_tmp = LOCAL_MPI_Send(buf_tmp, count, datatype_tmp, dest_tmp, tag_tmp,
                               comm_tmp); // Runtime MPI_Send call
  return error_code_conv_r2a(ret_tmp);
}

R_MPI_Function

In R_MPI_Function, the arguments are directly passed to the MPI runtime call

int R_MPI_Send(void *buf, int count, R_MPI_Datatype datatype, int dest, int tag,
               R_MPI_Comm comm) {

  int ret_tmp = LOCAL_MPI_Send(buf, count, datatype, dest, tag, comm);

  return ret_tmp;
}

Hash table

The underlying hash table mechanism presented earlier is contained in engine.*, engine_fn.* and utash.h. For each MPI objects, two tables are created. One for the constants, and one for the MPI_Type created by the application.

The different types being:

  • MPI_Comm

  • MPI_Datatype

  • MPI_Errhandler

  • MPI_Group

  • MPI_Op

  • MPI_Request (Split en 2 tables, in order to dissociate blocking requests from asynchronous requests)

  • MPI_File

The table within engine_fn.* contains the following translation:

  • MPI_Handler_function

  • MPI_Comm_copy_attr_function

  • MPI_Comm_delete_function

  • MPI_Type_delete_function

  • MPI_Comm_errhandler_function

  • MPI_File_errhandler_function

Thread safety

To make WI4MPI usable in a multithread environment, the in_w (see above) variable is TLS protected.

  • __thread int in_w=0; (test_wrapper_generation.c:118)

  • extern __thread int in_w; (wrapper.c:7)

  • extern __thread int in_w; (c2f_f2c.c:6 || c2f_f2c.c:1149)

The table are spinlock protected. (cf :thread_safety.h):

  • #define lock_dest(a) pthread_spin_destroy(a)

  • #define lock_init(a) pthread_spin_init(a, PTHREAD_PROCESS_PRIVATE)

  • #define lock(a) pthread_spin_lock(a)

  • #define unlock(a) pthread_spin_unlock(a)

  • typedef pthread_spinlock_t (*)table_lock_t

Interface

The interface version of WI4MPI propose the promise as the preload version (one compilation, several run over different MPI implementation), but this time WI4MPI had to be seen as a fully MPI Library. All the previously section are still relevant for the interface, the only things that changed is the new level name INTERFACE (see the schema below). This level has to be considered as a “libmpi.so” which is linked to the user application.

digraph G { rankdir=LR; // Nodes node [shape=box]; lib_ompi [label="Lib_OMPI"]; lib_impi [label="Lib_IMPI"]; openmpi [label="OpenMPI"]; intelmpi [label="IntelMPI"]; node [shape=box, style="rounded,filled", fillcolor=lightblue]; Interface [label="INTERFACE\n libmpi.so"]; // Links Interface -> lib_ompi [label="dlopen"]; lib_ompi -> openmpi [label="dlopen"]; Interface -> lib_impi [label="dlopen"]; lib_impi -> intelmpi [label="dlopen"]; }

Fig. 4 Interface

The files interface_test.c and interface_fort.c, deal with the overload symbol mechanism see earlier for respectively the C and Fortran API, then according the conversion a dlopen is made to the appropriate library (WI4MPI_WRAPPER_LIB) responsible for the translation (ASM code chooser + A_MPI_Function + R_MPI_Function).

MPI_Init example

int MPI_Init(int *argc, char ***argv);
#define MPI_Init PMPI_Init
#pragma weak MPI_Init = PMPI_Init
int (*INTERFACE_LOCAL_MPI_Init)(int *, char ***);

int PMPI_Init(int *argc, char ***argv) {
  int ret_tmp = INTERFACE_LOCAL_MPI_Init(argc, argv);
  return ret_tmp;
}
__attribute__((constructor)) void wrapper_interface(void) {
  void *interface_handle =
      dlopen(getenv("WI4MPI_WRAPPER_LIB"), RTLD_NOW | RTLD_GLOBAL);
  if (!interface_handle) {
    printf("no true IC lib defined\nerror :%s\n", dlerror());
    exit(1);
  }
  INTERFACE_LOCAL_MPI_Init = dlsym(interface_handle, "CCMPI_MPI_Init");
}

Static mode

The static mode builds an executable with every targets translation. To avoid conflicts, symbols are renamed as follow: INTERF2_{TARGET}_{Symbol_name}. No more dlopen is needed (cf. Interface), functions pointer are chosen by 2 variables: WI4MPI_STATIC_TARGET_TYPE_F and WI4MPI_STATIC_TARGET_TYPE. Static sections are controlled by directives: #if(n)def WI4MPI_STATIC / #endif

Common files for both version of WI4MPI:

  • func_char_fort.*:

    Contain all Fortran MPI functions that deal with some character arguments. Since in Fortran a character argument always reference is len (character(len=*) :: dark_side) and since the len argument is not the same size according to the compiler (Intel/GNU < 8 or GNU >= 8) used, WI4MPI had to implement both.

    Example:

#ifdef IFORT_CALL
       void  A_f_MPI_Get_processor_name(char * name,int * resultlen,int * ret,int namelen) // The character length is of type int
#elif GFORT_CALL
       void  A_f_MPI_Get_processor_name(char * name,int * resultlen,int * ret,size_t namelen) // The character length is of type size_t
#endif
  • manual_wrapper.h: Contain some manual mappers for Fortran translation

  • mappers.h: Contain the a2r/r2a mappers for C translation

  • engine.*, engine_fn.*, uthash.h: Contain all table routines

  • thread_safety.h: Contain the spinlock protection

Preload files:

  • bin/{wi4mpi,mpirun}: see User_Guide

  • etc/wi4mpi.cfg: see User_Guide

  • gen:

    • c2f_f2c.c:

    • lib_empty.c: Empty file to create empty Libraries made to replace the one from MPI use for the compilation

    • test_generation_wrapper.c: contain all C MPI function within WI4MPI which deal with the translation

    • wrapper.c: contain all the Fortran MPI function within WI4MPI which deal with the translation

  • header:

    • INTEL_INTEL: app_mpi.h app_mpio.h run_mpi.h run_mpio.h wrapper_f.h

    • INTEL_OMPI: app_mpi.h app_mpio.h run_mpi.h wrapper_f.h

    • OMPI_INTEL: app_mpi.h run_mpi.h run_mpio.h wrapper_f.h

    • OMPI_OMPI: app_mpi.h run_mpi.h wrapper_f.h

Interface files:

  • gen:

    • c2f_f2c.c:

    • test_generation_wrapper.c: Same as the preload version

    • wrapper.c: Same as the preload version

    • interface_fort.c: Contain the overload symbol mechanism for Fortran MPI Function

    • interface_test.c: Contain the overload symbol mechanism for C MPI Function and rerouting to CodeChooser

  • header:

    • OMPI_INTEL: app_mpi.h run_mpi.h run_mpio.h wrapper_f.h

    • OMPI_OMPI: app_mpi.h run_mpi.h wrapper_f.h

  • interface_utils:

    • bin: Contain all mpi wrapper for compilation

    • include: Contain all include exposed to users

  • manual:

    • dlsym_global.c : Get runtime MPI constants

  • module: Contain all elements to create a descent module

Get involved in WI4MPI

Generator Guide is prerequisites to this part

Expand MPI cover of WI4MPI

On the generator side

  • Add the function name to the func_list_....txt files

  • Add the function description in the dictionary functions.json

  • Add the new mappers (if needed) to convert the arguments in the dictionary mappers.json

  • Get involved in the generator code if some special case have to be handled

  • Generate the new Fortran header for both interface and preload version

On the library side

  • Code the new mappers in mappers.h, engine*

  • Update app_mpi.h app_mpio.h run_mpi.h run_mpio.h for all conversion of both version

  • Update headers within src/interface/interface_utils/include

  • Make sure to respect the MPI norm

Expand WI4MPI conversion capability

  • In mappers.h, you have to make sure that the status mapper translate the MPI_Status.count in the right way since its implementation is developer dependent.

  • Generate the associated app_mpi.h and run_mpi.h to new conversion