MPIRUN(1) LAM COMMANDS MPIRUN(1)
mpirun - Run MPI programs on LAM nodes.
mpirun [-fhvO] [-c <#> | -np <#>] [-D | -wd <dir>] [-ger | -nger]
[-sigs | -nsigs] [-ssi <key> <value>] [-nw | -w] [-nx] [-pty |
-npty] [-s <node>] [-t | -toff | -ton] [-tv] [-x
VAR1[=VALUE1][,VAR2[=VALUE2],...]] [[-p <prefix_str>] [-sa |
-sf]] [<where>] <program> [-- <args>]
mpirun [-fhvO] [-D | -wd <dir>] [-ger | -nger] [-sigs | -nsigs] [-ssi
<key> <value>] [-nw | -w] [-nx] [-pty | -npty] [-t | -toff |
-ton] [-tv] [-x VAR1[=VALUE1][,VAR2[=VALUE2],...]] <schema>
Note: The -c2c and -lamd options are now obsolete. Use -ssi instead.
See the "SSI" section, below.
There are two forms of the mpirun command -- one for programs (i.e.,
SPMD-style applications), and one for application schemas (see app-
schema(5)). Both forms of mpirun use the following options by default:
-nger -w. These may each be overriden by their counterpart options,
Additionally, mpirun will send the name of the directory where it was
invoked on the local node to each of the remote nodes, and attempt to
change to that directory. See the "Current Working Directory" section,
-c <#> Synonym for -np (see below).
-D Use the executable program location as the current working
directory for created processes. The current working direc-
tory of the created processes will be set before the user’s
program is invoked. This option is mutually exclusive with
-f Do not configure standard I/O file descriptors - use de-
-h Print useful information on this command.
-ger Enable GER (Guaranteed Envelope Resources) communication pro-
tocol and error reporting. See MPI(7) for a description of
GER. This option is mutually exclusive with -nger.
-nger Disable GER (Guaranteed Envelope Resources). This option is
mutually exclusive with -ger.
-nsigs Do not have LAM catch signals in the user application. This
is the default, and is mutually exclusive with -sigs.
-np <#> Run this many copies of the program on the given nodes. This
option indicates that the specified file is an executable
program and not an application schema. If no nodes are spec-
ified, all LAM nodes are considered for scheduling; LAM will
schedule the programs in a round-robin fashion, "wrapping
around" (and scheduling multiple copies on a single node) if
-npty Disable pseudo-tty support. Unless you are having problems
with pseudo-tty support, you probably do not need this op-
tion. Mutually exlclusive with -pty.
-nw Do not wait for all processes to complete before exiting
mpirun. This option is mutually exclusive with -w.
-nx Do not automatically export LAM_MPI_*, LAM_IMPI_*, or IMPI_*
environment variables to the remote nodes.
-O Multicomputer is homogeneous. Do no data conversion when
passing messages. THIS FLAG IS NOW OBSOLETE.
-pty Enable pseudo-tty support. Among other things, this enabled
line-buffered output (which is probably what you want). This
is the default. Mutually exclusive with -npty.
-s <node> Load the program from this node. This option is not valid on
the command line if an application schema is specified.
-sigs Have LAM catch signals in the user process. This options is
mutually exclusive with -nsigs.
-ssi <key> <value>
Send arguments to various SSI modules. See the "SSI" sec-
-t, -ton Enable execution trace generation for all processes. Trace
generation will proceed with no further action. These op-
tions are mutually exclusive with -toff.
-toff Enable execution trace generation for all processes. Trace
generation for message passing traffic will begin after pro-
cesses collectively call MPIL_Trace_on(2). Note that trace
generation for datatypes and communicators will proceed re-
gardless of whether trace generation is enabled for messages
or not. This option is mutually exclusive with -t and -ton.
-tv Launch processes under the TotalView Debugger.
-v Be verbose; report on important steps as they are done.
-w Wait for all applications to exit before mpirun exits.
-wd <dir> Change to the directory <dir> before the user’s program exe-
cutes. Note that if the -wd option appears both on the com-
mand line and in an application schema, the schema will take
precendence over the command line. This option is mutually
exclusive with -D.
-x Export the specified environment variables to the remote
nodes before executing the program. Existing environment
variables can be specified (see the Examples section, below),
or new variable names specified with corresponding values.
The parser for the -x option is not very sophisticated; it
does not even understand quoted values. Users are advised to
set variables in the environment, and then use -x to export
(not define) them.
-sa Display the exit status of all MPI processes irrespecive of
whether they fail or run successfully.
-sf Display the exit status of all processes only if one of them
Prefixes each process status line displayed by [-sa] and
[-sf] by the <prefix_str>.
<where> A set of node and/or CPU identifiers indicating where to
start <program>. See bhost(5) for a description of the node
and CPU identifiers. mpirun will schedule adjoining ranks in
MPI_COMM_WORLD on the same node when CPU identifiers are
used. For example, if LAM was booted with a CPU count of 4
on n0 and a CPU count of 2 on n1 and <where> is C, ranks 0
through 3 will be placed on n0, and ranks 4 and 5 will be
placed on n1.
<args> Pass these runtime arguments to every new process. These
must always be the last arguments to mpirun. This option is
not valid on the command line if an application schema is
One invocation of mpirun starts an MPI application running under LAM.
If the application is simply SPMD, the application can be specified on
the mpirun command line. If the application is MIMD, comprising multi-
ple programs, an application schema is required in a separate file.
See appschema(5) for a description of the application schema syntax,
but it essentially contains multiple mpirun command lines, less the
command name itself. The ability to specify different options for dif-
ferent instantiations of a program is another reason to use an applica-
As described above, mpirun can specify arbitrary locations in the cur-
rent LAM universe. Locations can be specified either by CPU or by node
(noted by the "<where>" in the SYNTAX section, above). Note that LAM
does not bind processes to CPUs -- specifying a location "by CPU" is
really a convenience mechanism for SMPs that ultimately maps down to a
Note that LAM effectively numbers MPI_COMM_WORLD ranks from left-to-
right in the <where>, regardless of which nomenclature is used. This
can be important because typical MPI programs tend to communicate more
with their immediate neighbors (i.e., myrank +/- X) than distant neigh-
bors. When neighbors end up on the same node, the shmem RPIs can be
used for communication rather than the network RPIs, which can result
in faster MPI performance.
Specifying locations by node will launch one copy of an executable per
specified node. Using a capitol "N" tells LAM to use all available
nodes that were lambooted (see lamboot(1)). Ranges of specific nodes
can also be specified in the form "nR[,R]*", where R specifies either a
single node number or a valid range of node numbers in the range of [0,
num_nodes). For example:
mpirun N a.out
Runs one copy of the the executable a.out on all available nodes in
the LAM universe. MPI_COMM_WORLD rank 0 will be on n0, rank 1 will
be on n1, etc.
mpirun n0-3 a.out
Runs one copy of the the executable a.out on nodes 0 through 3.
MPI_COMM_WORLD rank 0 will be on n0, rank 1 will be on n1, etc.
mpirun n0-3,8-11,15 a.out
Runs one copy of the the executable a.out on nodes 0 through 3, 8
through 11, and 15. MPI_COMM_WORLD ranks will be ordered as fol-
lows: (0, n0), (1, n1), (2, n2), (3, n3), (4, n8), (5, n9), (6,
n10), (7, n11), (8, n15).
Specifying by CPU is the preferred method of launching MPI jobs. The
intent is that the boot schema used with lamboot(1) will indicate how
many CPUs are available on each node, and then a single, simple mpirun
command can be used to launch across all of them. As noted above,
specifying CPUs does not actually bind processes to CPUs -- it is only
a convenience mechanism for launching on SMPs. Otherwise, the by-CPU
notation is the same as the by-node notation, except that "C" and "c"
are used instead of "N" and "n".
Assume in the following example that the LAM universe consists of four
4-way SMPs. So c0-3 are on n0, c4-7 are on n1, c8-11 are on n2, and
13-15 are on n3.
mpirun C a.out
Runs one copy of the the executable a.out on all available CPUs in
the LAM universe. This is typically the simplest (and preferred)
method of launching all MPI jobs (even if it resolves to one pro-
cess per node). MPI_COMM_WORLD ranks 0-3 will be on n0, ranks 4-7
will be on n1, ranks 8-11 will be on n2, and ranks 13-15 will be on
mpirun c0-3 a.out
Runs one copy of the the executable a.out on CPUs 0 through 3. All
four ranks of MPI_COMM_WORLD will be on MPI_COMM_WORLD.
mpirun c0-3,8-11,15 a.out
Runs one copy of the the executable a.out on CPUs 0 through 3, 8
through 11, and 15. MPI_COMM_WORLD ranks 0-3 will be on n0, 4-7
will be on n2, and 8 will be on n3.
The reason that the by-CPU nomenclature is preferred over the by-node
nomenclature is best shown through example. Consider trying to run the
first CPU example (with the same MPI_COMM_WORLD mapping) with the by-
node nomenclature -- run one copy of a.out for every available CPU, and
maximize the number of local neighbors to potentially maximize MPI per-
formance. One solution would be to use the following command:
mpirun n0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 a.out
This works, but is definitely klunky to type. It is typically easier
to use the by-CPU notation. One might think that the following is
mpirun N -np 16 a.out
This is not equivalent because the MPI_COMM_WORLD rank mappings will be
assigned by node rather than by CPU. Hence rank 0 will be on n0, rank
1 will be on n1, etc. Note that the following, however, is equivalent,
because LAM interprets lack of a <where> as "C":
mpirun -np 16 a.out
However, a "C" can tend to be more convenient, especially for batch-
queuing scripts because the exact number of processes may vary between
queue submissions. Since the batch system will determine the final
number of CPUs available, having a generic script that effectively says
"run on everything you gave me" may lead to more portable / re-usable
Finally, it should be noted that specifying multiple <where> clauses
are perfectly acceptable. As such, mixing of the by-node and by-CPU
syntax is also valid, albiet typically not useful. For example:
mpirun C N a.out
However, in some cases, specifying multiple <where> clauses can be use-
ful. Consider a parallel application where MPI_COMM_WORLD rank 0 will
be a "manager" and therefore consume very few CPU cycles because it is
usually waiting for "worker" processes to return results. Hence, it is
probably desirable to run one "worker" process on all available CPUs,
and run one extra process that will be the "manager":
mpirun c0 C manager-worker-program
Application Schema or Executable Program?
To distinguish the two different forms, mpirun looks on the command
line for <where> or the -c option. If neither is specified, then the
file named on the command line is assumed to be an application schema.
If either one or both are specified, then the file is assumed to be an
executable program. If <where> and -c both are specified, then copies
of the program are started on the specified nodes/CPUs according to an
internal LAM scheduling policy. Specifying just one node effectively
forces LAM to run all copies of the program in one place. If -c is
given, but not <where>, then all available CPUs on all LAM nodes are
used. If <where> is given, but not -c, then one copy of the program is
run on each node.
By default, LAM searches for executable programs on the target node
where a particular instantiation will run. If the file system is not
shared, the target nodes are homogeneous, and the program is frequently
recompiled, it can be convenient to have LAM transfer the program from
a source node (usually the local node) to each target node. The -s op-
tion specifies this behavior and identifies the single source node.
LAM looks for an executable program by searching the directories in the
user’s PATH environment variable as defined on the source node(s).
This behavior is consistent with logging into the source node and exe-
cuting the program from the shell. On remote nodes, the "." path is
the home directory.
LAM looks for an application schema in three directories: the local di-
rectory, the value of the LAMAPPLDIR environment variable, and lamin-
stalldir/boot, where "laminstalldir" is the directory where LAM/MPI was
LAM directs UNIX standard input to /dev/null on all remote nodes. On
the local node that invoked mpirun, standard input is inherited from
mpirun. The default is what used to be the -w option to prevent con-
flicting access to the terminal.
LAM directs UNIX standard output and error to the LAM daemon on all re-
mote nodes. LAM ships all captured output/error to the node that in-
voked mpirun and prints it on the standard output/error of mpirun. Lo-
cal processes inherit the standard output/error of mpirun and transfer
to it directly.
Thus it is possible to redirect standard I/O for LAM applications by
using the typical shell redirection procedure on mpirun.
% mpirun C my_app < my_input > my_output
Note that in this example only the local node (i.e., the node where
mpirun was invoked from) will receive the stream from my_input on
stdin. The stdin on all the other nodes will be tied to /dev/null.
However, the stdout from all nodes will be collected into the my_output
The -f option avoids all the setup required to support standard I/O de-
scribed above. Remote processes are completely directed to /dev/null
and local processes inherit file descriptors from lamboot(1).
The -pty option enabled pseudo-tty support for process output (it is
also enabled by default). This allows, among other things, for line
buffered output from remote nodes (which is probably what you want).
This option can be disabled with the -npty switch.
Process Termination / Signal Handling
During the run of an MPI application, if any rank dies abnormally (ei-
ther exiting before invoking MPI_FINALIZE, or dying as the result of a
signal), mpirun will print out an error message and kill the rest of
the MPI application.
By default, LAM/MPI only installs a signal handler for one signal in
user programs (SIGUSR2 by default, but this can be overridden when LAM
is configured and built). Therefore, it is safe for users to install
their own signal handlers in LAM/MPI programs (LAM notices death-by-
signal cases by examining the process’ return status provided by the
User signal handlers should probably avoid trying to cleanup MPI state
-- LAM is neither thread-safe nor async-signal-safe. For example, if a
seg fault occurs in MPI_SEND (perhaps because a bad buffer was passed
in) and a user signal handler is invoked, if this user handler attempts
to invoke MPI_FINALIZE, Bad Things could happen since LAM/MPI was al-
ready "in" MPI when the error occurred. Since mpirun will notice that
the process died due to a signal, it is probably not necessary (and
safest) for the user to only clean up non-MPI state.
If the -sigs option is used with mpirun, LAM/MPI will install several
signal handlers to locally on each rank to catch signals, print out er-
ror messages, and kill the rest of the MPI application. This is some-
what redundant behavior since this is now all handled by mpirun, but it
has been left for backwards compatability.
Process Exit Statuses
The -sa, -sf, and -p parameters can be used to display the exist sta-
tuses of the individual MPI processes as they terminate. -sa forces
the exit statuses to be displayed for all processes; -sf only displays
the exist statuses if at least one process terminates either by a sig-
nal or a non-zero exit status (note that exiting before invoking
MPI_FINALIZE will cause a non-zero exit status).
The status of each process is printed out, one per line, in the follow-
prefix_string node pid killed status
If killed is 1, then status is the signal number. If killed is 0, then
status is the exit status of the process.
The default prefix_string is "mpirun:", but the -p option can be used
override this string.
Current Working Directory
The default behavior of mpirun has changed with respect to the directo-
ry that processes will be started in.
The -wd option to mpirun allows the user to change to an arbitrary di-
rectory before their program is invoked. It can also be used in appli-
cation schema files to specify working directories on specific nodes
and/or for specific applications.
If the -wd option appears both in a schema file and on the command
line, the schema file directory will override the command line value.
The -D option will change the current working directory to the directo-
ry where the executable resides. It cannot be used in application
schema files. -wd is mutually exclusive with -D.
If neither -wd nor -D are specified, the local node will send the di-
rectory name where mpirun was invoked from to each of the remote nodes.
The remote nodes will then try to change to that directory. If they
fail (e.g., if the directory does not exists on that node), they will
start with from the user’s home directory.
All directory changing occurs before the user’s program is invoked; it
does not wait until MPI_INIT is called.
Processes in the MPI application inherit their environment from the LAM
daemon upon the node on which they are running. The environment of a
LAM daemon is fixed upon booting of the LAM with lamboot(1) and is typ-
ically inherited from the user’s shell. On the origin node, this will
be the shell from which lamboot(1) was invoked; on remote nodes, the
exact environment is determined by the boot SSI module used by lam-
boot(1). The rsh boot module, for example, uses either rsh/ssh to
launch the LAM daemon on remote nodes, and typically executes one or
more of the user’s shell-setup files before launching the LAM daemon.
When running dynamically linked applications which require the LD_LI-
BRARY_PATH environment variable to be set, care must be taken to ensure
that it is correctly set when booting the LAM.
Exported Environment Variables
All environment variables that are named in the form LAM_MPI_*,
LAM_IMPI_*, or IMPI_* will automatically be exported to new processes
on the local and remote nodes. This exporting may be inhibited with
the -nx option.
Additionally, the -x option to mpirun can be used to export specific
environment variables to the new processes. While the syntax of the -x
option allows the definition of new variables, note that the parser for
this option is currently not very sophisticated - it does not even un-
derstand quoted values. Users are advised to set variables in the en-
vironment and use -x to export them; not to define them.
Two switches control trace generation from processes running under LAM
and both must be in the on position for traces to actually be generat-
ed. The first switch is controlled by mpirun and the second switch is
initially set by mpirun but can be toggled at runtime with
MPIL_Trace_on(2) and MPIL_Trace_off(2). The -t (-ton is equivalent)
and -toff options all turn on the first switch. Otherwise the first
switch is off and calls to MPIL_Trace_on(2) in the application program
are ineffective. The -t option also turns on the second switch. The
-toff option turns off the second switch. See MPIL_Trace_on(2) and
lamtrace(1) for more details.
MPI Data Conversion
LAM’s MPI library converts MPI messages from local representation to
LAM representation upon sending them and then back to local representa-
tion upon receiving them. If the case of a LAM consisting of a homoge-
neous network of machines where the local representation differs from
the LAM representation this can result in unnecessary conversions.
The -O switch used to be necessary to indicate to LAM whether the
mulitcomputer was homogeneous or not. LAM now automatically determines
whether a given MPI job is homogeneous or not. The -O flag will
silently be accepted for backwards compatability, but it is ignored.
SSI (System Services Interface)
The -ssi switch allows the passing of parameters to various SSI mod-
ules. LAM’s SSI modules are described in detail in lamssi(7). SSI
modules have direct impact on MPI programs because they allow tunable
parameters to be set at run time (such as which RPI communication de-
vice driver to use, what parameters to pass to that RPI, etc.).
The -ssi switch takes two arguments: <key> and <value>. The <key> ar-
gument generally specifies which SSI module will receive the value.
For example, the <key> "rpi" is used to select which RPI to be used for
transporting MPI messages. The <value> argument is the value that is
passed. For example:
mpirun -ssi rpi lamd N foo
Tells LAM to use the "lamd" RPI and to run a single copy of "foo"
on every node.
mpirun -ssi rpi tcp N foo
Tells LAM to use the "tcp" RPI.
mpirun -ssi rpi sysv N foo
Tells LAM to use the "sysv" RPI.
And so on. LAM’s RPI SSI modules are described in lamssi_rpi(7).
The -ssi switch can be used multiple times to specify different <key>
and/or <value> arguments. If the same <key> is specified more than
once, the <value>s are concatenated with a comma (",") separating them.
Note that the -ssi switch is simply a shortcut for setting environment
variables. The same effect may be accomplished by setting correspond-
ing environment variables before running mpirun. The form of the envi-
ronment variables that LAM sets are: LAM_MPI_SSI_<key>=<value>.
Note that the -ssi switch overrides any previously set environment
variables. Also note that unknown <key> arguments are still set as en-
vironment variable -- they are not checked (by mpirun) for correctness.
Illegal or incorrect <value> arguments may or may not be reported -- it
depends on the specific SSI module.
The -ssi switch obsoletes the old -c2c and -lamd switches. These
switches used to be relevant because LAM could only have two RPI’s
available at a time: the lamd RPI and one of the C2C RPIs. This is no
longer true -- all RPI’s are now available and choosable at run-time.
Selecting the lamd RPI is shown in the examples above. The -c2c switch
has no direct translation since "C2C" used to refer to all other RPI’s
that were not the lamd RPI. As such, -ssi rpi <value> must be used to
select the specific desired RPI (whether it is "lamd" or one of the
Guaranteed Envelope Resources
By default, LAM will guarantee a minimum amount of message envelope
buffering to each MPI process pair and will impede or report an error
to a process that attempts to overflow this system resource. This ro-
bustness and debugging feature is implemented in a machine specific
manner when direct communication is used. For normal LAM communication
via the LAM daemon, a protocol is used. The -nger option disables GER
and the measures taken to support it. The minimum GER is configured by
the system administrator when LAM is installed. See MPI(7) for more
mpirun N prog1
Load and execute prog1 on all nodes. Search the user’s $PATH for
the executable file on each node.
mpirun -c 8 prog1
Run 8 copies of prog1 wherever LAM wants to run them.
mpirun n8-10 -v -nw -s n3 prog1 -q
Load and execute prog1 on nodes 8, 9, and 10. Search for prog1 on
node 3 and transfer it to the three target nodes. Report as each
process is created. Give "-q" as a command line to each new pro-
cess. Do not wait for the processes to complete before exiting
mpirun -v myapp
Parse the application schema, myapp, and start all processes speci-
fied in it. Report as each process is created.
mpirun -npty -wd /work/output -x DISPLAY C my_application
Start one copy of "my_application" on each available CPU. The num-
ber of available CPUs on each node was previously specified when
LAM was booted with lamboot(1). As noted above, mpirun will sched-
ule adjoining rank in MPI_COMM_WORLD on the same node where possi-
ble. For example, if n0 has a CPU count of 8, and n1 has a CPU
count of 4, mpirun will place MPI_COMM_WORLD ranks 0 through 7 on
n0, and 8 through 11 on n1. This tends to maximize on-node commu-
nication for many parallel applications; when used in conjunction
with the multi-protocol network/shared memory RPIs in LAM (see the
RELEASE_NOTES and INSTALL files with the LAM distribution), overall
communication performance can be quite good. Also disable pseudo-
tty support, change directory to /work/output, and export the DIS-
PLAY variable to the new processes (perhaps my_application will in-
voke an X application such as xv to display output).
mpirun: Exec format error
A non-ASCII character was detected in the application schema. This
is usually a command line usage error where mpirun is expecting an
application schema and an executable file was given.
mpirun: syntax error in application schema, line XXX
The application schema cannot be parsed because of a usage or syn-
tax error on the given line in the file.
<filename>: No such file or directory
This error can occur in two cases. Either the named file cannot be
located or it has been found but the user does not have sufficient
permissions to execute the program or read the application schema.
mpirun returns 0 if all ranks started by mpirun exit after calling
MPI_FINALIZE. A non-zero value is returned if an internal error oc-
curred in mpirun, or one or more ranks exited before calling MPI_FINAL-
IZE. If an internal error occurred in mpirun, the corresponding error
code is returned. In the event that one or more ranks exit before
calling MPI_FINALIZE, the return value of the rank of the process that
mpirun first notices died before calling MPI_FINALIZE will be returned.
Note that, in general, this will be the first rank that died but is not
guaranteed to be so.
However, note that if the -nw switch is used, the return value from
mpirun does not indicate the exit status of the ranks.
bhost(5), lamexec(1), lamssi(7), lamssi_rpi(7), lamtrace(1), loadgo(1),
MPIL_Trace_on(2), mpimsg(1), mpitask(1)
LAM 7.1.1 September, 2004 MPIRUN(1)
Man(1) output converted with