Developer Guide
Introduction
MPI is a standard in HPC community which allows a simple use of clusters. Nowadays, there are several implementation (OpenMPI, BullxMPI, MPT, IntelMPI, MPC, …) each of which involves a specific ABI (Application Binary Interface) for an application compiled with a specific MPI implementation. With wi4mpi, an application compiled with an alpha MPI implementation can be run under a beta MPI implementation without any compilation protocol and any concern about the ABI (Preload version). WI4MPI can also be seen as a dedicated MPI implementation; This time, the application is compiled against the wi4mpi library (libmpi.so) with the dedicated wrapper (mpicc,mpif90…) meant for that purpose, and can be run under any other MPI implementation (Interface Version).
Library
How it works
Before performing any translation we need to distinguish the application
side from the runtime side. To do that, any MPI object from the
application side are prefixed by A_ and those from the runtime side are
prefixed by R_. To perform a translation, all original MPI calls from
the application are intercepted by WI4MPI and replaced by the same call
prefixed by A_. For example, with an OpenMPI —> IntelMPI conversion.
Fig. 1 How it works
Implementation
Library settings
The library is set during its loading time, when the program start. All
the runtime MPI routines are saved to function pointers using a dlsym
call to be called later on, all the tables are created and set with MPI
constant objects, and the spinlocks are initialized. To do so, we use
the following syntax:
void **attribute** ((constructor)) wrapper_init {
void(*) lib_handle = dlopen(getenv("WI4MPI_RUN_MPI_C_LIB"), RTLD_NOW);
LOCAL_MPI_Function = dlsym(lib_handle, "PMPI_Function")
....
}
The library contains three constructors:
wrapper_initintest_generation_wrapper.c(API C) (preload and interface)wrapper_init_finwrapper.c(API Fortran) (preload and interface)wrapper_init_c2ff2cinc2f_f2c.c(API c2f/f2c)
Symbol overload
The mpi calls are intercepted thanks to the following rerouting:
#define A_MPI_Send PMPI_Send#pragma weak MPI_Send=PMPI_Send
(See interface_test.c in src/interface/gen/interface_test.c and
src/interface/gen/interface_fort.c) This syntax is also present but
hidden in test_generation_wrapper.c (src/{preload,interface}/gen/)
within an asm code chooser for the next reason.
The MPI-IO implementation (ROMIO) present within most MPI implementations triggers some calls to the MPI user interface that WI4MPI intercepted using the symbols overload protocol. This implies that during the runtime (phase 3 and above), some MPI calls are made triggering WI4MPI to re-intercept them and crash the application. The crash is the result of WI4MPI trying to convert arguments from the runtime version to the runtime version. An example is illustrated below.
Fig. 2 Example: symbol overload
To overcome this issue, we used an assembly code router.
Code chooser assembly
The ASM code chooser does the simple following things:
If we already are in the wrapper:
The arguments are passed without any translation protocol to the underlying MPI runtime call (
LOCAL_MPI_function)
Otherwise:
The arguments are translated and passed to the underlying MPI runtime call (
LOCAL_MPI_function)
To know which state the process is, we check the value of the in_w variable:
in_w=1: in the wrapperin_w=0: in the application
Since the implementation of MPI objects is developer dependent, some of them may vary in size. To make sure that there is no side effect, the code chooser analyzes the stack itself.
ASM Code chooser implementation (generated for each function):
.global PMPI_Function # Define global PMPI_Function symbol
.weak MPI_Function # Define a weak MPI_Function symbol
.set MPI_function,PMPI_Function # Set contents of MPI_function to PMPI_Function
.extern in_w
.extern A_MPI_Function
.extern R_MPI_Function
.type PMPI_Function,@function # Set PMPI_Function type to function
.text
PMPI_Function:
push %rbp
mov %rsp, %rbp
; ------------- Put arguments on stack for safekeeping
sub $0x20, %rsp
mov %rdi, -0x8(%rbp)
mov %rsi, -0x10(%rbp)
mov %rdx, -0x18(%rbp)
mov %rcx, -0x20(%rbp)
; ------------- Access thread-local variable in_w
.byte 0x66
leaq in_w@tlsgd(%rip), %rdi # Load address of in_w into %rdi
.value 0x6666
rex64
call __tls_get_addr@PLT # Get contents of address in %rdi into %rax
; ------------- Put arguments back where we found them
mov -0x8(%rbp), %rdi
mov -0x10(%rbp), %rsi
mov -0x18(%rbp), %rdx
mov -0x20(%rbp), %rcx
leave # Set %rsp to %rbp, then pop top of stack into %rbp
; ------------ Jump to the target function
cmpl $0x0, 0x0(%rax)
jne inwrap_MPI_Function
jmp (*)A_MPI_Function@GOTPCREL(%rip) # If not in wrapper call application method
inwrap_MPI_Function:
jmp (*)R_MPI_Function@GOTPCREL(%rip) # If in wrapper call run method
; ------------ Calculate symbol size
.size PMPI_Function,.-PMPI_Function # Declares symbol size to be the size of the above
Fig. 3 ASM Code chooser
A_MPI_Function
All translations are executed thanks to some mappers defined within
mappers.h using an underlying hash table mechanism named uthash
(https://troydhanson.github.io/uthash/) The mappers (see example below)
always have the same syntax :
mapper_name_a2r(&buf, &buf_tmp);
mapper_name_r2a(&buf, &buf_tmp);
In case of an a2r translation, buf_tmp represent the translation of
buf and vice versa for an r2a translation.
Example:
A_MPI_Send(void *buf, int count, A_MPI_Datatype datatype, int dest, int tag,
A_MPI_Comm comm) {
void *buf_tmp;
const_buffer_conv_a2r(&buf, &buf_tmp); // mapper
R_MPI_Datatype datatype_tmp;
datatype_conv_a2r(&datatype, &datatype_tmp); // mapper
int dest_tmp;
dest_conv_a2r(&dest, &dest_tmp); // mapper
int tag_tmp;
tag_conv_a2r(&tag, &tag_tmp); // mapper
R_MPI_Comm comm_tmp;
comm_conv_a2r(&comm, &comm_tmp); // mapper
int ret_tmp = LOCAL_MPI_Send(buf_tmp, count, datatype_tmp, dest_tmp, tag_tmp,
comm_tmp); // Runtime MPI_Send call
return error_code_conv_r2a(ret_tmp);
}
R_MPI_Function
In R_MPI_Function, the arguments are directly passed to the MPI runtime
call
int R_MPI_Send(void *buf, int count, R_MPI_Datatype datatype, int dest, int tag,
R_MPI_Comm comm) {
int ret_tmp = LOCAL_MPI_Send(buf, count, datatype, dest, tag, comm);
return ret_tmp;
}
Hash table
The underlying hash table mechanism presented earlier is contained in
engine.*, engine_fn.* and utash.h. For each
MPI objects, two tables are created. One for the constants, and one for the
MPI_Type created by the application.
The different types being:
MPI_CommMPI_DatatypeMPI_ErrhandlerMPI_GroupMPI_OpMPI_Request(Split en 2 tables, in order to dissociate blocking requests from asynchronous requests)MPI_File
The table within engine_fn.* contains the following translation:
MPI_Handler_functionMPI_Comm_copy_attr_functionMPI_Comm_delete_functionMPI_Type_delete_functionMPI_Comm_errhandler_functionMPI_File_errhandler_function
Thread safety
To make WI4MPI usable in a multithread environment, the in_w (see above) variable is TLS protected.
__thread int in_w=0;(test_wrapper_generation.c:118)extern __thread int in_w;(wrapper.c:7)extern __thread int in_w;(c2f_f2c.c:6 || c2f_f2c.c:1149)
The table are spinlock protected. (cf :thread_safety.h):
#define lock_dest(a) pthread_spin_destroy(a)#define lock_init(a) pthread_spin_init(a, PTHREAD_PROCESS_PRIVATE)#define lock(a) pthread_spin_lock(a)#define unlock(a) pthread_spin_unlock(a)typedef pthread_spinlock_t (*)table_lock_t
Interface
The interface version of WI4MPI propose the promise as the preload version (one compilation, several run over different MPI implementation), but this time WI4MPI had to be seen as a fully MPI Library. All the previously section are still relevant for the interface, the only things that changed is the new level name INTERFACE (see the schema below). This level has to be considered as a “libmpi.so” which is linked to the user application.
Fig. 4 Interface
The files interface_test.c and interface_fort.c, deal with the
overload symbol mechanism see earlier for respectively the C and Fortran
API, then according the conversion a dlopen is made to the appropriate
library (WI4MPI_WRAPPER_LIB) responsible for the translation (ASM code
chooser + A_MPI_Function + R_MPI_Function).
MPI_Init example
int MPI_Init(int *argc, char ***argv);
#define MPI_Init PMPI_Init
#pragma weak MPI_Init = PMPI_Init
int (*INTERFACE_LOCAL_MPI_Init)(int *, char ***);
int PMPI_Init(int *argc, char ***argv) {
int ret_tmp = INTERFACE_LOCAL_MPI_Init(argc, argv);
return ret_tmp;
}
__attribute__((constructor)) void wrapper_interface(void) {
void *interface_handle =
dlopen(getenv("WI4MPI_WRAPPER_LIB"), RTLD_NOW | RTLD_GLOBAL);
if (!interface_handle) {
printf("no true IC lib defined\nerror :%s\n", dlerror());
exit(1);
}
INTERFACE_LOCAL_MPI_Init = dlsym(interface_handle, "CCMPI_MPI_Init");
}
Static mode
The static mode builds an executable with every targets translation. To
avoid conflicts, symbols are renamed as follow:
INTERF2_{TARGET}_{Symbol_name}. No more dlopen is needed (cf.
Interface), functions pointer are chosen by 2 variables:
WI4MPI_STATIC_TARGET_TYPE_F and WI4MPI_STATIC_TARGET_TYPE. Static
sections are controlled by directives: #if(n)def WI4MPI_STATIC / #endif
Common files for both version of WI4MPI:
func_char_fort.*:Contain all Fortran MPI functions that deal with some character arguments. Since in Fortran a character argument always reference is len (character(len=*) :: dark_side) and since the len argument is not the same size according to the compiler (Intel/GNU < 8 or GNU >= 8) used, WI4MPI had to implement both.
Example:
#ifdef IFORT_CALL
void A_f_MPI_Get_processor_name(char * name,int * resultlen,int * ret,int namelen) // The character length is of type int
#elif GFORT_CALL
void A_f_MPI_Get_processor_name(char * name,int * resultlen,int * ret,size_t namelen) // The character length is of type size_t
#endif
manual_wrapper.h: Contain some manual mappers for Fortran translationmappers.h: Contain the a2r/r2a mappers for C translationengine.*, engine_fn.*, uthash.h: Contain all table routinesthread_safety.h: Contain the spinlock protection
Preload files:
bin/{wi4mpi,mpirun}: see User_Guide
etc/wi4mpi.cfg: see User_Guide
gen:
c2f_f2c.c:lib_empty.c: Empty file to create empty Libraries made to replace the one from MPI use for the compilationtest_generation_wrapper.c: contain all C MPI function within WI4MPI which deal with the translationwrapper.c: contain all the Fortran MPI function within WI4MPI which deal with the translation
header:
INTEL_INTEL:app_mpi.h app_mpio.h run_mpi.h run_mpio.h wrapper_f.hINTEL_OMPI:app_mpi.h app_mpio.h run_mpi.h wrapper_f.hOMPI_INTEL:app_mpi.h run_mpi.h run_mpio.h wrapper_f.hOMPI_OMPI:app_mpi.h run_mpi.h wrapper_f.h
Interface files:
gen:
c2f_f2c.c:test_generation_wrapper.c: Same as the preload versionwrapper.c: Same as the preload versioninterface_fort.c: Contain the overload symbol mechanism for Fortran MPI Functioninterface_test.c: Contain the overload symbol mechanism for C MPI Function and rerouting to CodeChooser
header:
OMPI_INTEL:app_mpi.h run_mpi.h run_mpio.h wrapper_f.hOMPI_OMPI:app_mpi.h run_mpi.h wrapper_f.h
interface_utils:bin: Contain all mpi wrapper for compilationinclude: Contain all include exposed to users
manual:
dlsym_global.c: Get runtime MPI constants
module: Contain all elements to create a descent module
Get involved in WI4MPI
Generator Guide is prerequisites to this part
Expand MPI cover of WI4MPI
On the generator side
Add the function name to the
func_list_....txtfilesAdd the function description in the dictionary functions.json
Add the new mappers (if needed) to convert the arguments in the dictionary
mappers.jsonGet involved in the generator code if some special case have to be handled
Generate the new Fortran header for both interface and preload version
On the library side
Code the new mappers in
mappers.h, engine*Update
app_mpi.h app_mpio.h run_mpi.h run_mpio.hfor all conversion of both versionUpdate headers within
src/interface/interface_utils/includeMake sure to respect the MPI norm
Expand WI4MPI conversion capability
In mappers.h, you have to make sure that the status mapper translate the
MPI_Status.countin the right way since its implementation is developer dependent.Generate the associated
app_mpi.handrun_mpi.hto new conversion