6.4.3. Group Construction
For purposes of constructing PMIx groups, PMIx defines two classes of group members:
leaders have some global view of the group at time of construction. This might consist of knowing the number of leaders in the group, or knowing the process IDs of all group leaders. All leaders must, of course, know the group ID they are attempting to construct.
members know only that they are to participate in a given group ID, but have no other knowledge of the group. For example, a member may not know how many processes will be in the group or any of their process IDs. The only requirement for membership is that the process know the group ID to which they are to belong.
Within that context, PMIx supports three methods for constructing PMIx groups. The collective method is considered the more traditional form of the operation but requires all group leaders to know the process ID of all other leaders prior to calling the API.
In contrast, the bootstrap method is a somewhat more dynamic form of the operation that assumes each leader only knows the number of group leaders, but does not know their process IDs. This is commonly the case when two or mote collections of processes wish to join together (e.g., in an MPI connect/accept operation), but only the “root” processes know of each other. In such cases, each root process typically knows the process ID of all processes in its collection, but only the root process ID in the other collection(s).
In either of these two methods, additional group members can be
specified by any leader via the PMIX_GROUP_ADD_MEMBERS
attribute. The PMIx server library and host are jointly responsible
for aggregating the
additional group members specified across leaders. Processes that are on the
“additional member” list must call PMIx_Group_construct
with NULL
in the procs
argument - this
indicates that the process is to
be added to the group when the group construct operation has completed.
Note that the group construct operation cannot complete until all
“leaders” and all add members” have
called PMIx_Group_construct
. This is required so that any group
and/or endpoint information
provided by the added members can be included in the returned
pmix_info_t
array.
Finally, the invite method represents the most dynamic form of group construction as it is executed in an ad hoc manner that revolves around a single leader that asynchronously invites processes to join a group.
Each of these methods is explained further below. The PMIx library currently conducts Continuous Integration (CI) tests for all three methods on each pull request (PR) submitted to the library.
6.4.3.1. Collective Method
All leaders know the ID of all other leaders, and thus call
PMIx_Group_construct
with the array of all leader process IDs.
Note that in this method, all leaders must call the API
with the array of process IDs. However, the array does not need
to be ordered - i.e., the order of IDs in the array does not need to
be the same across leaders.
An example of the collective method can be seen in the group.c example taken from the PMIx library.
6.4.3.1.1. Library responsibilities
The PMIx client library will collect all information posted by the client
on a PMIX_REMOTE
scope and include it in the participation message
sent to its local server. Only “remote” data is included as all
local procs already have access to any “local” data that has been posted. The
message also includes any pmix_info_t values that were provided by
the caller, including any PMIX_GROUP_INFO
contributions.
The local server will aggregate participation from clients operating as leaders before passing the request up to the host. If a timeout value was provided by one or more clients, then the server must monitor the request for timeout until it is passed up to the host. This is to protect against the case where multiple local clients are participating, and one of those clients fails to call the construct API (thus preventing the server from passing the request to the host) within the specified timeout. The server no longer has responsibility for detecting a timeout condition once it determines the operation is locally complete and has passed it to the host.
The local server will also immediately pass to the host a participation request from a client operating as a “member” - i.e., a process that is being included in the construct operation via an “add member” specification. These participants are not counted against local completion of the operation.
Upon notification of operation completion, the server will locally store all information collected from the participants and then notify each participating client (leaders as well as members). Notification will include an array of the resulting membership IDs, plus any assigned context ID and contributed group info.
Upon concluding the operation, each client shall have access to all job-level information for namespaces that had a process participating in the group, plus all information provided by the individual participants (e.g., posted “remote” data).
6.4.3.1.2. Host responsibilities
The host is responsible for performing a collective allgather operation across participants, returning all provided information to the involved PMIx servers. This includes a complete membership list, including the process ID of all leaders and added members. The host must ensure that all “members” have participated prior to declaring the operation to be complete.
Note that the group construct operation requires that each participant have access to the job-level data of each namespace with a participating process in the group. Some hosts (e.g., PRRTE) automatically register each namespace with all PMIx servers, thus ensuring the job-level information is always available. Other hosts may need to assemble and register the participating namespaces separately.
6.4.3.2. Bootstrap Method
Bootstrap is used when the processes leading a group construct operation do
not know the identity of all other processes that will be participating, but at least
know how may leaders will be involved.
Leaders provide only their
own process ID in the procs
parameter to the PMIx_Group_construct
API, and are required to include the
PMIX_GROUP_BOOTSTRAP
attribute in their array of pmix_info_t
directives, with the value in that attribute set to equal the number
of leaders in the group construct operation. They may also provide the
PMIX_ADD_MEMBERS
attribute with an array of process IDs that are to
belong to the final group - each of those processes will also call the group
construct, but with a NULL
process ID to indicate they are joining
as “add members” and not leaders.
An example of the bootstrap method can be seen in the group_bootstrap.c example taken from the PMIx library.
6.4.3.2.1. Library responsibilities
The PMIx client library will collect all information posted by the client
on a PMIX_REMOTE
scope and include it in the participation message
sent to its local server. Only “remote” data is included as all
local procs have access to any “local” data that has been posted. The
message also includes any pmix_info_t values that were provided by
the caller, including any PMIX_GROUP_INFO
contributions.
Upon receiving the participation from a local client (whether “leader” or “member”), the PMIx server will immediately pass the request up to the host. The server bears no responsibility for aggregating the local participants or for monitoring timeout conditions.
Upon notification of operation completion, the server will locally store all information collected from the participants and then notify each participating client (leaders as well as members). Notification will include an array of the resulting membership IDs, plus any assigned context ID and contributed group info.
Upon concluding the operation, each client shall have access to all job-level information for namespaces that had a process participating in the group, plus all information provided by the individual participants (e.g., posted “remote” data).
6.4.3.2.2. Host responsibilities
The host is responsible for performing a collective allgather operation across participants, returning all provided information to the involved PMIx servers. This includes a complete membership list, including the process ID of all leaders and added members. The host in this scenario must look for and count participation by the specified number of leaders, plus each individual member, before declaring the operation to be complete.
Note that the group construct operation requires that each participant have access to the job-level data of each namespace with a participating process in the group. Some hosts (e.g., PRRTE) automatically register each namespace with all PMIx servers, thus ensuring the job-level information is always available. Other hosts may need to assemble and register the participating namespaces separately.
6.4.3.3. Invite Method
In contrast to the prior methods, the “invite” method does not involve
any collective operation. Instead, it relies solely on the event notification
subsystem (e.g., PMIx_Notify
) for its underlying execution.
Applications wishing to utilize this method must first register two event handlers to receive (see the asyncgroup.c example from the PMIx library for details):
the
PMIX_GROUP_INVITED
event. This will be triggered in a process when someone callsPMIx_Group_invite
and includes that process in the array of desired members. The handler must retrieve the specifiedPMIX_GROUP_ID
from the provided array of attributes, and then callPMIx_Group_join_nb
to indicate the desired response (e.g., a value ofPMIX_GROUP_ACCEPT
to join the group).Note
The handler cannot call the blocking form of the “group join” API as it is invoked from inside the PMIx library’s progress thread. Doing so will cause a thread deadlock condition.
the
PMIX_GROUP_COMPLETE
event, which will be triggered once the construct operation has completed. This can be used to receive the final group membership, along with any provided group info or other data.
The construct procedure is initiated by a single “leader” that calls the
PMIx_Group_Invite
API, providing (among other optional things) an array
of process IDs that it wishes to have join the group. Prior to doing so,
the leader may choose to register an event handler for the PMIX_GROUP_INVITE_FAILED
event. This will allow the library to notify the process should any of
the specified prospective members reject the invitation, or terminate
prior to responding to it. Upon receiving a “failed” event, the leader
can optionally replace the rejecting process with another, can terminate
the group construct operation, or can ignore the failure (thereby accepting
a reduced final group membership).
The leader will return from the PMIx_Group_invite
function once all
specified members have responded to the invitation. In addition, the leader
will (since it is a member of the group) receive the PMIX_GROUP_COMPLETE
event specifying the status return of the operation (PMIX_SUCCESS
to
indicate that the group successfully constructed, or else an appropriate
error value) and, if successful, containing the resulting information.
6.4.3.3.1. Library responsibilities
The PMIx client library is solely responsible for executing the underlying
handshakes to support this method of group construction. The library will
collect all information posted by each participant
on a PMIX_REMOTE
scope and include it in the invitation event (if
the process is the leader) or in the “join” event (if a participating
member). Only “remote” data is included as all
local procs have access to any “local” data that has been posted. The
event also includes any pmix_info_t values that were provided by
the caller, including any PMIX_GROUP_INFO
contributions.
The client library tracks all contributions and assembles the final “complete” event payload, which includes a complete membership array plus all provided information. In this case, the provided information (e.g., that posted by each participant) will not be stored on the server, but instead delivered to each individual participant’s data store for subsequent retrieval via PMIx_Get.
The PMIx server library’s sole responsibility is transferring the events generated by its local clients to/from the host.
6.4.3.3.2. Host responsibilities
The host is responsible solely for propagating event notifications across participating processes.