aprun(1) Last changed: 12-06-2011 NAME aprun - Launches an application SYNOPSIS aprun [-a arch ] [-b ] [-B] [-cc cpu_list | keyword ] [-cp cpu_placement_file_name ] [-d depth ] [-D value ] [-L node_list ] [-m size[h|hs] ] [-n pes ] [-N pes_per_node ][-F access mode ][-p protection domain identifier] [-q ] [-r cores][-S pes_per_numa_node ] [-sl list_of_numa_nodes ] [-sn numa_nodes_per_node ] [-ss ] [-T ] [-t sec ] executable [ arguments_for_executable ] IMPLEMENTATION Cray Linux Environment (CLE) DESCRIPTION To run an application on compute nodes, use the Application Level Placement Scheduler (ALPS) aprun command. The aprun command specifies application resource requirements, requests application placement, and initiates application launch. The aprun utility provides user identity and environment information as part of the application launch so that your login node session can be replicated for the application on the assigned set of compute nodes. This information includes the aprun current working directory, which must be accessible from the compute nodes. Before running aprun, ensure that your working directory has a file system accessible from the compute nodes. This will likely be a Lustre-mounted directory, such as /lus/nid00007/user1/. Do not suspend aprun; it is the local representative of the application that is running on compute nodes. If aprun is suspended, the application cannot communicate with ALPS, such as sending exit notification to aprun that the application has completed. Cray system compute node cores are paired to memory in NUMA (Non- Uniform Memory Access) nodes. Local NUMA node access is defined as memory accesses within the same NUMA node while remote NUMA node access is defined as memory accesses between separate NUMA nodes in a Cray compute node. Remote NUMA node accesses will have more latency as a result of this configuration. Cray XE5 and Cray XK6 compute nodes have two NUMA nodes while Cray XE6 compute nodes have four NUMA nodes. MPMD Mode You can use aprun to launch applications in Multiple Program, Multiple Data (MPMD) mode. The command format is: aprun -n pes [other_aprun_options] executable1 [arguments_for_executable1] : -n pes [other_aprun_options] executable2 [arguments_for_executable2] : -n pes [other_aprun_options] executable3 [arguments_for_executable3] : ... such as: % aprun -n 12 ./app1 : -n 8 -d 2 ./app2 : -n 32 -N 2 ./app3 A space is required before and after each colon. On compute nodes, other_aprun_options can be -a, -cc, -cp, -d, -L, -n, -N, -S, -sl, -sn, and -ss. If you specify the -m option it must be specified in the first executable segment and the value is used for all subsequent executables. If you specify -m more than once while launching multiple applications in MPMD mode, aprun will return an error. System commands are not supported in MPMD mode and will return an error when used. For more information about MPMD mode, see Workload Management and Application Placement for the Cray Linux Environment. Compute Nodes and NUMA The terms CPU , integer core or core are synonymous: they represent one single block of processor logic that serves as an execution engine. A core currently has a one-to-one relationship to processing elements. The term processor refers to the hardware package that goes in a single socket. For eight-core Cray XE6 processors, NUMA nodes 0 through 3 have four cores each (logical CPUs 0-3, 4-7, 8-11, and 12-15, respectively). For 12-core Cray XE6 processors, NUMA nodes 0 through 3 have six cores each (logical CPUs 0-5, 6-11, 12-17, and 18-23, respectively). For quad-core Cray XE5 processors, NUMA node 0 has four cores (logical CPUs 0-3), and NUMA node 1 has four cores (logical CPUs 4-7). For six- core Cray XE5 processors, NUMA node 0 has six cores (logical CPUs 0-5), and NUMA node 1 has six cores (logical CPUs 6-11). Two types of operations — remote-NUMA-node memory accesses and process migration — can reduce performance. The aprun command provides memory affinity and CPU affinity options that allow you to control these operations. For more information, see the Memory Affinity and CPU Affinity NOTES sections. Note: Having a compute node reserved for your job does not guarantee that you can use all NUMA nodes. You have to request sufficient resources through qsub -l resource options and aprun placement options (-n, -N, -d, and/or -m) to be able to use all NUMA nodes. See the aprun option descriptions and the EXAMPLES section for more information. aprun Options The aprun command accepts the following options: -b Bypasses the transfer of the application executable to compute nodes. By default, the executable is transferred to the compute nodes as part of the aprun process of launching an application. You would likely use the -b option only if the executable to be launched was part of a file system accessible from the compute node. For more information, see the EXAMPLES section. -B Tells ALPS to reuse the width, depth, nppn and memory requests specified with the corresponding batch reservation. This option obviates the need to specify aprun options -n, -d, -N, and -m and aprun will exit with an error if the user specifies these with the -B option. -cc cpu_list | keyword Binds processing elements (PEs) to CPUs. CLE does not migrate processes that are bound to a CPU. This option applies to all multicore compute nodes. The cpu_list is not used for placement decisions, but is used only by CLE during application execution. For further information about binding (CPU affinity), see the CPU Affinity NOTES section. The cpu_list is a comma-separated or hyphen-separated list of logical CPU numbers and/or ranges. As PEs are created, they are bound to the CPU in cpu_list corresponding to the number of PEs that have been created at that point. For example, the first PE created is bound to the first CPU in cpu_list, the second PE created is bound to the second CPU in cpu_list, and so on. If more PEs are created than given in cpu_list, binding starts over at the beginning of cpu_list and starts again with the first CPU in cpu_list. The cpu_list can also contain an x, which indicates that the application-created process at that location in the fork sequence should not be bound to a CPU. If a PE creates any threads or child processes, those threads or processes will be bound to a CPU from the cpu_list in the same manner as PEs. If multiple PEs are created on a compute node, the user may optionally specify a cpu_list for each PE. Multiple cpu_lists are separated by colons (:). This provides the user with the ability to control the placement for PEs that may conflict with other PEs that are simultaneously creating child processes and threads of their own. % aprun -n 2 -d 3 -cc 0,1,2:4,5,6 ./a.out The example above contains two cpu_lists. The first (0,1,2) is applied to the first PE created and any threads or child processes that result. The second (4,5,6) is applied to the second PE created and any threads or child processes that result. Out-of-range cpu_list values are ignored unless all CPU values are out of range, in which case an error message is issued. If you want to bind PEs starting with the highest CPU on a compute node and work down from there, you might use this -cc option: % aprun -n 8 -cc 7-0 ./a.out See Example 4: Binding PEs to CPUs (-cc cpu_list options). The following keyword values can be used: · The cpu keyword (the default) binds each PE to a CPU within the assigned NUMA node. You do not have to indicate a specific CPU. If you specify a depth per PE (aprun -d depth), the PEs are constrained to CPUs with a distance of depth between them so each PE's threads can be constrained to the CPUs closest to the PE's CPU. The -cc cpu option is the typical use case for an MPI application. Note: If you oversubscribe CPUs for an OpenMP application, Cray recommends that you not use the -cc cpu default. Try the -cc none and -cc numa_node options and compare results to determine which option produces the better performance. · The numa_node keyword causes a PE to be constrained to the CPUs within the assigned NUMA node. CLE can migrate a PE among the CPUs in the assigned NUMA node but not off the assigned NUMA node. For example, if PE 2 is assigned to NUMA node 0, CLE can migrate PE 2 among CPUs 0-3 but not among CPUs 4-7. If PEs create threads, the threads are constrained to the same NUMA-node CPUs as the PEs. There is one exception. If depth is greater than the number of CPUs per NUMA node, once the number of threads created by the PE has exceeded the number of CPUs per NUMA node, the remaining threads are constrained to CPUs within the next NUMA node on the compute node. For example, if depth is 5, threads 0-3 are constrained to CPUs 0-3 and thread 4 is constrained to CPUs 4-7. · The none keyword allows PE migration within the assigned NUMA nodes. For more information about the -cc keywords, see Example 5: Binding PEs to CPUs (-cc keyword options). -cp cpu_placement_file_name (Deferred implementation) Provides the name of a CPU binding placement file. This option applies to all multicore compute nodes. This file must be located on a file system accessible from the compute nodes. The CPU placement file provides more extensive CPU binding instructions than the -cc options. -D value The -D option value is an integer bitmask setting that controls debug verbosity, where: · A value of 1 provides a small level of debug messages · A value of 2 provides a medium level of debug messages · A value of 4 provides a high level of debug messages Because this option is a bitmask setting, value can be set to get any or all of the above levels of debug messages. Therefore, valid values are 0 through 7. For example, -D 3 provides all small and medium level debug messages. -d depth Specifies the number of CPUs for each PE and its threads. ALPS allocates the number of CPUs equal to depth times pes. The -cc cpu_list option can restrict the placement of threads, resulting in more than one thread per CPU. The default depth is 1. For OpenMP applications, use both the OMP_NUM_THREADS environment variable to specify the number of threads and the aprun -d option to specify the number of CPUs hosting the threads. ALPS creates -n pes instances of the executable, and the executable spawns OMP_NUM_THREADS-1 additional threads per PE. Note: For a PathScale OpenMP program, set the PSC_OMP_AFFINITY environment variable to FALSE For Cray systems, compute nodes must have at least depth CPUs. For Cray XE5 systems, depth cannot exceed 12, and for Cray XE6 compute blades, depth cannot exceed 32. See Example 3: OpenMP threads (-d option). -L node_list Specifies the candidate nodes to constrain application placement. The syntax allows a comma-separated list of nodes (such as -L 32,33,40), a range of nodes (such as -L 41-87), or a combination of both formats. Node values can be expressed in decimal, octal (preceded by 0), or hexadecimal (preceded by 0x). The first number in a range must be less than the second number (8-6, for example, is invalid), but the nodes in a list can be in any order. See Example 12: Using node lists (-L option). This option is used for applications launched interactively; use the qsub -lmppnodes=\"node_list\" option for batch and interactive batch jobs. If the placement node list contains fewer nodes than the number required, a fatal error is produced. If resources are not currently available, aprun continues to retry. A common source of node lists is the cnselect command. See the cnselect(1) man page for details. -m size[h|hs] Specifies the per-PE required Resident Set Size (RSS) memory size in megabytes. K, M, and G suffixes (case insensitive) are supported (16M = 16m = 16 megabytes, for example). If you do not include the -m option, the default amount of memory available to each PE equals the minimum value of (compute node memory size) / (number of CPUs) calculated for each compute node. For example, given Cray XE5 compute nodes with 32 GB of memory and 8 CPUs, the default per-PE memory size is 32 GB / 8 CPUs = 4 GB. See Example 10: Memory per PE (-m option). If you want huge pages allocated for a Cray XE application, use the h or hs suffix. The default huge page size for Cray XE systems is 2 MB. On Cray XE systems, additional sizes are available: 128KB, 512KB, 8MB, 16MB, and 64MB. The use of the -m option is not required on the Cray XE system because the kernel allows the dynamic creation of huge pages. However, it is advisable to specify this option and preallocate an appropriate number of huge pages, when memory requirements are known, to reduce operating system overhead. See Hugepages for Cray Systems. -m sizeh Requests memory size to be allocated to each PE, where memory is preferentially allocated out of the huge page pool. All nodes use as much huge page memory as they are able to allocate and 4 KB pages thereafter. See the NOTES section and Example 11: Using huge pages (-m h and hs suffixes). -m sizehs Requests memory size to be allocated to each PE, where memory is allocated out of the huge page pool. If the request cannot be satisfied, an error message is issued and the application launch is terminated. See Example 11: Using huge pages (-m h and hs suffixes). Note: To use huge pages, you must first load the huge pages library during the linking phase, such as: % cc -c my_hugepages_app.c % cc -o my_hugepages_app my_hugepages_app.o -lhugetlbfs Then set the huge pages environment variable: % setenv HUGETLB_MORECORE yes or % export HUGETLB_MORECORE=yes -n pes Specifies the number of processing elements (PEs) needed for your application. A PE is an instance of an ALPS-launched executable. The number of PEs can be expressed in decimal, octal, or hexadecimal form. If pes has a leading 0, it is interpreted as octal (-n 16 specifies 16 PEs, but -n 016 is interpreted as 14 PEs). If pes has a leading 0x, it is interpreted as hexadecimal (-n 16 specifies 16 PEs, but -n 0x16 is interpreted as 22 PEs). The default is 1. See Example 1: PE placement (-n option). -N pes_per_node Specifies the number of PEs to place per node. You can use this option to reduce the number of PEs per node, thereby making more resources available per PE. For Cray systems, the default is the number of CPUs available on a node. For Cray systems, the maximum pes_per_node is 24. -F exclusive|share exclusive mode specifies affinity options to provide a program with exclusive access to all the processing and memory resources on a node. Using this option along with the cc option will bind processes to those mentioned in the affinity string. share mode access restricts the application specific cpuset contents to only the application reserved cores and memory on NUMA node boundaries, meaning the application will not have access to cores and memory on other NUMA nodes on that compute node. The exclusive option does not need to be specified as exclusive access mode is enabled by default. However, if nodeShare is set to share in /etc/alps.conf then you must use the -F exclusive to override the policy set in this file. You can check the value of nodeShare by executing apstat -svv | grep access. -p protection domain identifier Requests use of a protection domain using the user pre- allocated protection identifier. You cannot use this option with protection domains already allocated by system services. Any cooperating set of applications must specify this same aprun -p option to have access to the shared protection domain. aprun will return an error if either the protection domain identifier is not recognized or if the user is not the owner of the specified protection domain identifier. -q Specifies quiet mode and suppresses all aprun-generated non- fatal messages. Do not use this option with the -D (debug) option; aprun terminates the application if both options are specified. Even with the -q option, aprun writes its help message and any fatal ALPS message when exiting. Normally, this option should not be used. -r cores Enables core specialization on Cray compute nodes, where the number of cores specified is the number of system services cores per node for the application. -S pes_per_numa_node Specifies the number of PEs to allocate per NUMA node. This option applies to both Cray XE5 and Cray XE6 compute nodes. You can use this option to reduce the number of PEs per NUMA node, thereby making more resources available per PE. The pes_per_numa_node value can be 1-6. For eight-core Cray XE5 nodes, the default is 4. For 12-core Cray XE5 and 24-core Cray XE6 nodes, the default is 6. A zero value is not allowed and is a fatal error. For more information, see the Memory Affinity NOTES section and Example 6: Optimizing NUMA-node memory references (-S option). -sl list_of_numa_nodes Specifies the NUMA node or nodes (comma separated or hyphen separated) to use for application placement. A space is required between -sl and list_of_numa_nodes. This option applies to Cray XE5 and Cray XE6 compute nodes. The list_of_numa_nodes value can be -sl <0,1> on Cray XE5 compute nodes, -sl <0,1,2,3> on Cray XE6 compute nodes, or a range such as -sl 0-1. The default is no placement constraints. You can use this option to find out if restricting your PEs to one NUMA node per node affects performance. List NUMA nodes in ascending order; -sl 1-0 and -sl 1,0 are invalid. For more information, see the Memory Affinity NOTES section and Example 7: Optimizing NUMA-node memory references (-sl option). -sn numa_nodes_per_node Specifies the number of NUMA nodes per node to be allocated. A space is required between -sn and numa_nodes_per_node. This option applies to Cray XE5 and Cray XE6 compute nodes. The numa_nodes_per_node value can be 1 or 2 on Cray XE5 compute nodes, or 1, 2, 3, 4 on Cray XE6 compute nodes. The default is no placement constraints. You can use this option to find out if restricting your PEs to one NUMA node per node affects performance. A zero value is not allowed and is a fatal error. For more information, see the Memory Affinity NOTES section and Example 8: Optimizing NUMA node-memory references (-sn option). -ss Specifies strict memory containment per NUMA node. This option applies to Cray XE5 and Cray XE6 compute nodes. When -ss is specified, a PE can allocate only the memory local to its assigned NUMA node. The default is to allow remote-NUMA-node memory allocation to all assigned NUMA nodes. You can use this option to find out if restricting each PE's memory access to local-NUMA- node memory affects performance. For more information, see the Memory Affinity NOTES section. -T Synchronizes the application's stdout and stderr to prevent interleaving of its output. -t sec Specifies the per-PE CPU time limit in seconds. The sec time limit is constrained by your CPU time limit on the login node. For example, if your time limit on the login node is 3600 seconds but you specify a -t value of 5000, your application is constrained to 3600 seconds per PE. If your time limit on the login node is unlimited, the sec value is used (or, if not specified, the time per-PE is unlimited). You can determine your CPU time limit by using the limit command (csh) or the ulimit -a command (bash). : Separates the names of executables and their associated options for Multiple Program, Multiple Data (MPMD) mode. A space is required before and after the colon. NOTES Standard I/O When an application has been launched on compute nodes, aprun forwards stdin only to PE 0 of the application. All of the other application PEs have stdin set to /dev/null. An application's stdout and stderr messages are sent from the compute nodes back to aprun for display. Signal Processing The aprun command forwards the following signals to an application: · SIGHUP · SIGINT · SIGQUIT · SIGTERM · SIGABRT · SIGUSR1 · SIGUSR2 · SIGURG · SIGWINCH User Environment Variables The following environment variables modify the behavior of aprun: APRUN_DEFAULT_MEMORY Specifies default per PE memory size. An explicit aprun -m value overrides this setting. APRUN_XFER_LIMITS Sets the rlimit() transfer limits for aprun. If this is set to a non-zero string, aprun will transfer the {get,set}rlimit() limits to apinit, which will use those limits on the compute nodes. If it is not set or set to 0, none of the limits will be transferred other than RLIMIT_CORE, RLIMIT_CPU, and possibly RLIMIT_RSS. APRUN_SYNC_TTY Sets synchronous tty for stdout and stderr output. Any non- zero value enables synchronous tty output. An explicit aprun -T value overrides this value. PGAS_ERROR_FILE Redirects error messages issued by the PGAS library (libpgas) to standard output stream when set to stdout. The default is stderr. Output Environment Variables ALPS will pass values to the following application environment variables: ALPS_APP_DEPTH Reflects the aprun -d value as determined by apshepherd. The default is 1. The value can be different between compute nodes or sets of compute nodes when executing a MPMD job. In that case, an instance of apshepherd will determine the appropriate value locally for an executable. Memory Affinity Cray XE5 compute blades use dual-socket quad-core or dual-socket, six- core compute nodes. The Cray XE6 compute blades use dual-socket twelve-core or eight-core compute nodes. Cray XK6 compute blades use single G34-socket host processors accompanied by a guest GPU in one compute node, however the NUMA designation is only applicable to the host. Because Cray systems can run more tasks simultaneously, this can increase overall performance. However, remote-NUMA-node memory references, such as a process running on NUMA node 0 accessing NUMA node 1 memory, can adversely affect performance. To give you run time controls that can optimize memory references, Cray has added the following aprun memory affinity options: · -S pes_per_numa_node · -sl list_of_numa_nodes · -sn numa_nodes_per_node · -ss Hugepages for Cray Systems When memory usage, specifically memory which is mapped through the high speed network, exceeds 2GB on a single node, an application should be linked with the libhugetlbfs library to use the larger address range available with huge pages. At run time, set HUGETLB_ELFMAP=W to map static data to huge pages and set HUGETLB_MORECORE=yes to map the private heap to huge pages. Please see intro_hugepages(1) for more information. CPU Affinity CPU affinity options enable you to bind a PE or thread to a particular CPU or a subset of CPUs on a node. These options apply to all Cray multicore compute nodes. The compute node kernel can dynamically distribute work by allowing PEs and threads to migrate from one CPU to another within a node. In some cases, moving PEs or threads from CPU to CPU increases cache and translation lookaside buffer (TLB) misses and therefore reduces performance. Also, there may be cases where an application runs faster by avoiding or targeting a particular CPU. Cray systems support the following aprun CPU affinity options: -cc cpu_list | keyword. Note: On Cray compute nodes, your application can access only the resources you request on the aprun or qsub command (or default values). Your application does not have automatic access to all of a compute node's resources. For example, if you request four or fewer CPUs per dual-socket, quad-core compute node and you are not using the aprun -m option, your application can access only the CPUs and memory of a single NUMA node per node. If you include CPU affinity options that reference the other NUMA node's resources, the kernel either ignores those options or causes the application's termination. For more information, see Example 4: Binding PEs to CPUs (-cc cpu_list options) and the Workload Management and Application Placement for the Cray Linux Environment. Core Specialization When you use the -r option, cores are assigned to system services associated with your application. Using this option may improve the performance of your application. The width parameter of the batch reservation (e.g. mppwidth) that you use may be affected. To help you calculate the appropriate width when using core specialization, you can use apcount. For more information, see the apcount(1) manpage. Resolving Claim exceeds reservation's node-count" Errors" If your aprun command requests more nodes than were reserved by the qsub command, ALPS displays the Claim exceeds reservation's node-count error message. For batch jobs, the number of nodes reserved is set when the qsub command is successfully processed. If you subsequently request additional nodes through aprun affinity options, apsched issues the error message and aprun exits. For example, on a Cray system , the following qsub command reserves two nodes (290 and 294): % qsub -I -lmppwidth=4 -lmppnppn=2 % aprun -n 4 -N 2 ./xthi | sort Application 225100 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00290. (core affinity = 0) Hello from rank 1, thread 0, on nid00290. (core affinity = 1) Hello from rank 2, thread 0, on nid00294. (core affinity = 0) Hello from rank 3, thread 0, on nid00294. (core affinity = 1) In contrast, the following aprun command fails because the -S 1 option constrains placement to one PE per NUMA node. Two additional nodes are required: % aprun -n 4 -N 2 -S 1 ./xthi | sort Claim exceeds reservation's CPUs ERRORS If all application processes exit normally, aprun exits with zero. If there is an internal aprun error or a fatal message is received from ALPS on a compute node, aprun exits with 1. Otherwise, the aprun exit code is 128 plus the termination signal number of an application process that was abnormally terminated, or the aprun exit code is the exit code of an application process that exited abnormally. The ordering of exit signals and exit codes is arbitrary, and aprun retains and displays only four application signals and exit codes. LIMITATIONS Cray systems currently do not support running more than one application on a compute node. EXAMPLES Example 1: PE placement (-n option) ALPS uses the smallest number of nodes available to fulfill the -n requirements. For example, the command: % aprun -n 32 ./a.out places 32 PEs on: · Cray XE5 dual-socket, quad-core processors on 4 nodes. · Cray XE5 dual-socket, six-core processors on 3 nodes. · Cray XE6 dual-socket, eight-core processors on 2 nodes. · Cray XE6 dual-socket, 12-core processors on 2 nodes. · Cray XE6 dual-socket, 16-core processors on 1 node. Note: Cray XK6 nodes are populated with single-socket host processors. There is still a one-to-one relationship between PEs and host processor cores. The above aprun command would place 32 PEs on: · Cray XK6 single-socket, eight-core processors on 4 nodes. · Cray XK6 single-socket, 12-core processors on 3 nodes. · Cray XK6 single-socket, 16-core processors on 2 nodes. The following example runs 12 PEs on three quad-core compute nodes (nodes 28-30): % cnselect coremask.eq.15 28-95,128-207 % qsub -I -lmppwidth=12 -lmppnodes=\"28-95,128-207\" % aprun -n 12 ./xthi | sort Application 1071056 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 0, thread 1, on nid00028. (core affinity = 0) Hello from rank 1, thread 0, on nid00028. (core affinity = 1) Hello from rank 1, thread 1, on nid00028. (core affinity = 1) Hello from rank 10, thread 0, on nid00030. (core affinity = 2) Hello from rank 10, thread 1, on nid00030. (core affinity = 2) Hello from rank 11, thread 0, on nid00030. (core affinity = 3) Hello from rank 11, thread 1, on nid00030. (core affinity = 3) Hello from rank 2, thread 0, on nid00028. (core affinity = 2) Hello from rank 2, thread 1, on nid00028. (core affinity = 2) Hello from rank 3, thread 0, on nid00028. (core affinity = 3) Hello from rank 3, thread 1, on nid00028. (core affinity = 3) Hello from rank 4, thread 0, on nid00029. (core affinity = 0) Hello from rank 4, thread 1, on nid00029. (core affinity = 0) Hello from rank 5, thread 0, on nid00029. (core affinity = 1) Hello from rank 5, thread 1, on nid00029. (core affinity = 1) Hello from rank 6, thread 0, on nid00029. (core affinity = 2) Hello from rank 6, thread 1, on nid00029. (core affinity = 2) Hello from rank 7, thread 0, on nid00029. (core affinity = 3) Hello from rank 7, thread 1, on nid00029. (core affinity = 3) Hello from rank 8, thread 0, on nid00030. (core affinity = 0) Hello from rank 8, thread 1, on nid00030. (core affinity = 0) Hello from rank 9, thread 0, on nid00030. (core affinity = 1) Hello from rank 9, thread 1, on nid00030. (core affinity = 1 The following example runs 12 PEs on one dual-socket, six-core compute node: % cnselect coremask.eq.4095 168-171, 172-175, 176-179 % qsub -I -lmppwidth=12 -lmppnodes=\"168-171\" % aprun -n 12 ./xthi | sort Application 225101 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00170. (core affinity = 0) Hello from rank 10, thread 0, on nid00170. (core affinity = 10) Hello from rank 11, thread 0, on nid00170. (core affinity = 11) Hello from rank 1, thread 0, on nid00170. (core affinity = 1) Hello from rank 2, thread 0, on nid00170. (core affinity = 2) Hello from rank 3, thread 0, on nid00170. (core affinity = 3) Hello from rank 4, thread 0, on nid00170. (core affinity = 4) Hello from rank 5, thread 0, on nid00170. (core affinity = 5) Hello from rank 6, thread 0, on nid00170. (core affinity = 6) Hello from rank 7, thread 0, on nid00170. (core affinity = 7) Hello from rank 8, thread 0, on nid00170. (core affinity = 8) Hello from rank 9, thread 0, on nid00170. (core affinity = 9) Example 2: PEs per node (-N option) If you want more compute node resources available for each PE, you can use the -N option. For example, the following command used on a quad- core system runs all PEs on one compute node: % cnselect coremask.eq.15 25-88 % qsub -I -lmppwidth=4 -lmppnodes=\"25-88\" % aprun -n 4 ./xthi | sort Application 225102 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 1, thread 0, on nid00028. (core affinity = 1) Hello from rank 2, thread 0, on nid00028. (core affinity = 2) Hello from rank 3, thread 0, on nid00028. (core affinity = 3) In contrast, the following commands restrict placement to 1 PE per node: % qsub -I -lmppwidth=4 -lmppnppn=1 -lmppnodes=\"25-88\" % aprun -n 4 -N 1 ./xthi | sort Application 225103 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 1, thread 0, on nid00029. (core affinity = 0) Hello from rank 2, thread 0, on nid00030. (core affinity = 0) Hello from rank 3, thread 0, on nid00031. (core affinity = 0) Example 3: OpenMP threads (-d option) For OpenMP applications, use the OMP_NUM_THREADS environment variable to specify the number of OpenMP threads and the -d option to specify the depth (number of CPUs) to be reserved for each PE and its threads. Note: If you are using a PathScale compiler, set the PSC_OMP_AFFINITY environment variable to FALSE before compiling: % setenv PSC_OMP_AFFINITY FALSE or: % export PSC_OMP_AFFINITY=FALSE ALPS creates -n pes instances of the executable, and the executable spawns OMP_NUM_THREADS-1 additional threads per PE. For example, if we use dual-socket, quad-core compute nodes, set OMP_NUM_THREADS to 4, request four PEs, and use the default depth( -d 1), then each PE spawns three additional threads: % cnselect coremask.eq.255 28-95 % qsub -I -lmppwidth=4 -lmppnodes=\"28-95\" % setenv OMP_NUM_THREADS 4 % aprun -n 4 ./xthi | sort Application 1304346 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00092. (core affinity = 0) Hello from rank 0, thread 1, on nid00092. (core affinity = 0) Hello from rank 0, thread 2, on nid00092. (core affinity = 0) Hello from rank 0, thread 3, on nid00092. (core affinity = 0) Hello from rank 1, thread 0, on nid00092. (core affinity = 1) Hello from rank 1, thread 1, on nid00092. (core affinity = 1) Hello from rank 1, thread 2, on nid00092. (core affinity = 1) Hello from rank 1, thread 3, on nid00092. (core affinity = 1) Hello from rank 2, thread 0, on nid00092. (core affinity = 2) Hello from rank 2, thread 1, on nid00092. (core affinity = 2) Hello from rank 2, thread 2, on nid00092. (core affinity = 2) Hello from rank 2, thread 3, on nid00092. (core affinity = 2) Hello from rank 3, thread 0, on nid00092. (core affinity = 3) Hello from rank 3, thread 1, on nid00092. (core affinity = 3) Hello from rank 3, thread 2, on nid00092. (core affinity = 3) Hello from rank 3, thread 3, on nid00092. (core affinity = 3) Because we used the default depth, each PE (rank) and its threads execute on one CPU of a single compute node. By setting the depth to 4, each PE and its threads run on separate CPUs: % cnselect coremask.eq.255 28-95 % qsub -I -lmppwidth=4 -lmppdepth=4 -lmppnodes=\"28-95\" % setenv OMP_NUM_THREADS 4 % aprun -n 4 -d 4 ./xthi | sort Application 225105 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 0, thread 1, on nid00028. (core affinity = 1) Hello from rank 0, thread 2, on nid00028. (core affinity = 2) Hello from rank 0, thread 3, on nid00028. (core affinity = 3) Hello from rank 1, thread 0, on nid00028. (core affinity = 4) Hello from rank 1, thread 1, on nid00028. (core affinity = 5) Hello from rank 1, thread 2, on nid00028. (core affinity = 6) Hello from rank 1, thread 3, on nid00028. (core affinity = 7) Hello from rank 2, thread 0, on nid00029. (core affinity = 0) Hello from rank 2, thread 1, on nid00029. (core affinity = 1) Hello from rank 2, thread 2, on nid00029. (core affinity = 2) Hello from rank 2, thread 3, on nid00029. (core affinity = 3) Hello from rank 3, thread 0, on nid00029. (core affinity = 4) Hello from rank 3, thread 1, on nid00029. (core affinity = 5) Hello from rank 3, thread 2, on nid00029. (core affinity = 6) Hello from rank 3, thread 3, on nid00029. (core affinity = 7) If you want all of a compute node's cores and memory available for one PE and its threads, use -n 1 and -d depth. In the following example, one PE and its threads run on cores 0-11 of a 12-core Cray XE5 compute node: % setenv OMP_NUM_THREADS 12 % aprun -n 1 -d 12 ./xthi | sort Application 286315 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00514. (core affinity = 0) Hello from rank 0, thread 10, on nid00514. (core affinity = 10) Hello from rank 0, thread 11, on nid00514. (core affinity = 11) Hello from rank 0, thread 1, on nid00514. (core affinity = 1) Hello from rank 0, thread 2, on nid00514. (core affinity = 2) Hello from rank 0, thread 3, on nid00514. (core affinity = 3) Hello from rank 0, thread 4, on nid00514. (core affinity = 4) Hello from rank 0, thread 5, on nid00514. (core affinity = 5) Hello from rank 0, thread 6, on nid00514. (core affinity = 6) Hello from rank 0, thread 7, on nid00514. (core affinity = 7) Hello from rank 0, thread 8, on nid00514. (core affinity = 8) Hello from rank 0, thread 9, on nid00514. (core affinity = 9) Example 4: Binding PEs to CPUs (-cc cpu_list options) This example uses the -cc option to bind the PEs to CPUs 0-2: % aprun -n 6 -cc 0-2 ./xthi | sort Application 225107 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 1, thread 0, on nid00028. (core affinity = 1) Hello from rank 2, thread 0, on nid00028. (core affinity = 2) Hello from rank 3, thread 0, on nid00028. (core affinity = 0) Hello from rank 4, thread 0, on nid00028. (core affinity = 1) Hello from rank 5, thread 0, on nid00028. (core affinity = 2) Normally, if the -d option and the OMP_NUM_THREADS values are equal, each PE and its threads will run on separate CPUs. However, the -cc cpu_list option can restrict the dynamic placement of PEs and threads: % setenv OMP_NUM_THREADS 5 % aprun -n 4 -d 4 -cc 2,4 ./xthi | sort Application 225108 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 2) Hello from rank 0, thread 1, on nid00028. (core affinity = 2) Hello from rank 0, thread 2, on nid00028. (core affinity = 4) Hello from rank 0, thread 3, on nid00028. (core affinity = 2) Hello from rank 0, thread 4, on nid00028. (core affinity = 2) Hello from rank 1, thread 0, on nid00028. (core affinity = 4) Hello from rank 1, thread 1, on nid00028. (core affinity = 4) Hello from rank 1, thread 2, on nid00028. (core affinity = 4) Hello from rank 1, thread 3, on nid00028. (core affinity = 2) Hello from rank 1, thread 4, on nid00028. (core affinity = 4) Hello from rank 2, thread 0, on nid00029. (core affinity = 2) Hello from rank 2, thread 1, on nid00029. (core affinity = 4) Hello from rank 2, thread 2, on nid00029. (core affinity = 4) Hello from rank 2, thread 3, on nid00029. (core affinity = 2) Hello from rank 2, thread 4, on nid00029. (core affinity = 4) Hello from rank 3, thread 0, on nid00029. (core affinity = 4) Hello from rank 3, thread 1, on nid00029. (core affinity = 2) Hello from rank 3, thread 2, on nid00029. (core affinity = 2) Hello from rank 3, thread 3, on nid00029. (core affinity = 2) Hello from rank 3, thread 4, on nid00029. (core affinity = 4) If depth is greater than the number of CPUs per NUMA node, once the number of threads created by the PE exceeds the number of CPUs per NUMA node, the remaining threads are constrained to CPUs within the next NUMA node on the compute node. In the following example, all threads are placed on NUMA node 0 except thread 6, which is placed on NUMA node 1: % setenv OMP_NUM_THREADS 7 % aprun -n 2 -d 7 ./xthi | sort Application 286320 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00514. (core affinity = 0) Hello from rank 0, thread 1, on nid00514. (core affinity = 1) Hello from rank 0, thread 2, on nid00514. (core affinity = 2) Hello from rank 0, thread 3, on nid00514. (core affinity = 3) Hello from rank 0, thread 4, on nid00514. (core affinity = 4) Hello from rank 0, thread 5, on nid00514. (core affinity = 5) Hello from rank 0, thread 6, on nid00514. (core affinity = 6) Hello from rank 1, thread 0, on nid00262. (core affinity = 0) Hello from rank 1, thread 1, on nid00262. (core affinity = 1) Hello from rank 1, thread 2, on nid00262. (core affinity = 2) Hello from rank 1, thread 3, on nid00262. (core affinity = 3) Hello from rank 1, thread 4, on nid00262. (core affinity = 4) Hello from rank 1, thread 5, on nid00262. (core affinity = 5) Hello from rank 1, thread 6, on nid00262. (core affinity = 6) Example 5: Binding PEs to CPUs (-cc keyword options) By default, each PE is bound to a CPU (-cc cpu). For a Cray XE5 application, each PE runs on a separate CPU of NUMA nodes 0 and 1. In the following example, each PE is bound to a CPU of a 12-core Cray XE5 compute node: % aprun -n 12 -cc cpu ./xthi | sort Application 286323 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00514. (core affinity = 0) Hello from rank 10, thread 0, on nid00514. (core affinity = 10) Hello from rank 11, thread 0, on nid00514. (core affinity = 11) Hello from rank 1, thread 0, on nid00514. (core affinity = 1) Hello from rank 2, thread 0, on nid00514. (core affinity = 2) Hello from rank 3, thread 0, on nid00514. (core affinity = 3) Hello from rank 4, thread 0, on nid00514. (core affinity = 4) Hello from rank 5, thread 0, on nid00514. (core affinity = 5) Hello from rank 6, thread 0, on nid00514. (core affinity = 6) Hello from rank 7, thread 0, on nid00514. (core affinity = 7) Hello from rank 8, thread 0, on nid00514. (core affinity = 8) Hello from rank 9, thread 0, on nid00514. (core affinity = 9) In the following example, each PE is bound to the CPUs within a NUMA node; CLE can migrate PEs among the CPUs in the assigned NUMA node but not off the assigned NUMA node: % cnselect coremask.eq.255 28-95 % qsub -I -lmppwidth=8 -lmppnodes=\"28-95\" % aprun -n 8 -cc numa_node ./xthi | sort Application 225113 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0-3) Hello from rank 1, thread 0, on nid00028. (core affinity = 0-3) Hello from rank 2, thread 0, on nid00028. (core affinity = 0-3) Hello from rank 3, thread 0, on nid00028. (core affinity = 0-3) Hello from rank 4, thread 0, on nid00028. (core affinity = 4-7) Hello from rank 5, thread 0, on nid00028. (core affinity = 4-7) Hello from rank 6, thread 0, on nid00028. (core affinity = 4-7) Hello from rank 7, thread 0, on nid00028. (core affinity = 4-7) The following command specifies no binding; CLE can migrate threads among all the CPUs of node 28: % aprun -n 8 -cc none ./xthi | sort Application 225116 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0-7) Hello from rank 1, thread 0, on nid00028. (core affinity = 0-7) Hello from rank 2, thread 0, on nid00028. (core affinity = 0-7) Hello from rank 3, thread 0, on nid00028. (core affinity = 0-7) Hello from rank 4, thread 0, on nid00028. (core affinity = 0-7) Hello from rank 5, thread 0, on nid00028. (core affinity = 0-7) Hello from rank 6, thread 0, on nid00028. (core affinity = 0-7) Hello from rank 7, thread 0, on nid00028. (core affinity = 0-7) In the following example, multiple cpu_lists are specified. Each PE is bound to the first CPU on a NUMA node, each PE creates one thread, and all odd-numbered CPUs are skipped: % aprun -n 4 -cc 0,2:4,6:8,10:12,14 ./xthi Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 0, thread 1, on nid00028. (core affinity = 2) Hello from rank 1, thread 0, on nid00028. (core affinity = 4) Hello from rank 1, thread 1, on nid00028. (core affinity = 6) Hello from rank 2, thread 0, on nid00028. (core affinity = 8) Hello from rank 2, thread 1, on nid00028. (core affinity = 10) Hello from rank 3, thread 0, on nid00028. (core affinity = 12) Hello from rank 3, thread 1, on nid00028. (core affinity = 14) Example 6: Optimizing NUMA-node memory references (-S option) This example uses the -S option to restrict placement of PEs to one per NUMA node. Two compute nodes are required, with one PE on NUMA node 0 and one PE on NUMA node 1: % aprun -n 4 -S 1 ./xthi | sort Application 225117 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00043. (core affinity = 0) Hello from rank 1, thread 0, on nid00043. (core affinity = 4) Hello from rank 2, thread 0, on nid00044. (core affinity = 0) Hello from rank 3, thread 0, on nid00044. (core affinity = 4) Example 7: Optimizing NUMA-node memory references (-sl option) This example runs all PEs on NUMA node 0; the PEs cannot allocate remote NUMA node memory: % aprun -n 8 -sl 0 ./xthi | sort Application 225118 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 1, thread 0, on nid00028. (core affinity = 1) Hello from rank 2, thread 0, on nid00028. (core affinity = 2) Hello from rank 3, thread 0, on nid00028. (core affinity = 3) Hello from rank 4, thread 0, on nid00029. (core affinity = 0) Hello from rank 5, thread 0, on nid00029. (core affinity = 1) Hello from rank 6, thread 0, on nid00029. (core affinity = 2) Hello from rank 7, thread 0, on nid00029. (core affinity = 3) This example runs all PEs on NUMA node 1: % aprun -n 8 -sl 1 ./xthi | sort Application 225119 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 4) Hello from rank 1, thread 0, on nid00028. (core affinity = 5) Hello from rank 2, thread 0, on nid00028. (core affinity = 6) Hello from rank 3, thread 0, on nid00028. (core affinity = 7) Hello from rank 4, thread 0, on nid00029. (core affinity = 4) Hello from rank 5, thread 0, on nid00029. (core affinity = 5) Hello from rank 6, thread 0, on nid00029. (core affinity = 6) Hello from rank 7, thread 0, on nid00029. (core affinity = 7) Example 8: Optimizing NUMA node-memory references (-sn option) This example runs four PEs on NUMA node 0 of node 28 and four PEs on NUMA node 0 of node 29: % aprun -n 8 -sn 1 ./xthi | sort Application 225120 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 1, thread 0, on nid00028. (core affinity = 1) Hello from rank 2, thread 0, on nid00028. (core affinity = 2) Hello from rank 3, thread 0, on nid00028. (core affinity = 3) Hello from rank 4, thread 0, on nid00029. (core affinity = 0) Hello from rank 5, thread 0, on nid00029. (core affinity = 1) Hello from rank 6, thread 0, on nid00029. (core affinity = 2) Hello from rank 7, thread 0, on nid00029. (core affinity = 3) Example 9: Optimizing NUMA-node memory references (-ss option) When the -ss option is used, a PE can allocate only the memory local to its assigned NUMA node. The default is to allow remote-NUMA-node memory allocation to all assigned NUMA nodes. For example, by default any PE running on NUMA node 0 can allocate NUMA node 1 memory. This example runs PEs 0-3 on NUMA node 0 and PEs 4-7 on NUMA node 1. PEs 0-3 cannot allocate NUMA node 1 memory, and PEs 4-7 cannot allocate NUMA node 0 memory. % aprun -n 8 -ss ./xthi | sort Application 225121 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 1, thread 0, on nid00028. (core affinity = 1) Hello from rank 2, thread 0, on nid00028. (core affinity = 2) Hello from rank 3, thread 0, on nid00028. (core affinity = 3) Hello from rank 4, thread 0, on nid00028. (core affinity = 4) Hello from rank 5, thread 0, on nid00028. (core affinity = 5) Hello from rank 6, thread 0, on nid00028. (core affinity = 6) Hello from rank 7, thread 0, on nid00028. (core affinity = 7) Example 10: Memory per PE (-m option) The -m option can affect application placement. This example runs all PEs on node 43. The amount of memory available per PE is 4000 MB: % aprun -n 8 -m4000m ./xthi | sort Application 225122 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00043. (core affinity = 0) Hello from rank 1, thread 0, on nid00043. (core affinity = 1) Hello from rank 2, thread 0, on nid00043. (core affinity = 2) Hello from rank 3, thread 0, on nid00043. (core affinity = 3) Hello from rank 4, thread 0, on nid00043. (core affinity = 4) Hello from rank 5, thread 0, on nid00043. (core affinity = 5) Hello from rank 6, thread 0, on nid00043. (core affinity = 6) Hello from rank 7, thread 0, on nid00043. (core affinity = 7) In this example, node 43 does not have enough memory to fulfill the request for 4001 MB per PE. PE 7 runs on node 44: % aprun -n 8 -m4001 ./xthi | sort Application 225123 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00043. (core affinity = 0) Hello from rank 1, thread 0, on nid00043. (core affinity = 1) Hello from rank 2, thread 0, on nid00043. (core affinity = 2) Hello from rank 3, thread 0, on nid00043. (core affinity = 3) Hello from rank 4, thread 0, on nid00043. (core affinity = 4) Hello from rank 5, thread 0, on nid00043. (core affinity = 5) Hello from rank 6, thread 0, on nid00043. (core affinity = 6) Hello from rank 7, thread 0, on nid00044. (core affinity = 0) Example 11: Using huge pages (-m h and hs suffixes) This example requests 4000 MB of huge pages per PE: % cc -o xthi xthi.c -lhugetlbfs % HUGETLB_MORECORE=yes aprun -n 8 -m4000h ./xthi | sort % Application 225124 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00043. (core affinity = 0) Hello from rank 1, thread 0, on nid00043. (core affinity = 1) Hello from rank 2, thread 0, on nid00043. (core affinity = 2) Hello from rank 3, thread 0, on nid00043. (core affinity = 3) Hello from rank 4, thread 0, on nid00043. (core affinity = 4) Hello from rank 5, thread 0, on nid00043. (core affinity = 5) Hello from rank 6, thread 0, on nid00043. (core affinity = 6) Hello from rank 7, thread 0, on nid00043. (core affinity = 7) The following example requests 4000 MB of hugepages per PE, and also specifies that a hugepage size of 16 MB is to be used. % HUGETLB_DEFAULT_PAGE_SIZE=16m aprun -n 8 -m4000h ./xthi | sort Application 225124 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00043. (core affinity = 0) Hello from rank 1, thread 0, on nid00043. (core affinity = 1) Hello from rank 2, thread 0, on nid00043. (core affinity = 2) Hello from rank 3, thread 0, on nid00043. (core affinity = 3) Hello from rank 4, thread 0, on nid00043. (core affinity = 4) Hello from rank 5, thread 0, on nid00043. (core affinity = 5) Hello from rank 6, thread 0, on nid00043. (core affinity = 6) Hello from rank 7, thread 0, on nid00043. (core affinity = 7) The following example terminates because the required 4000 MB of huge pages per PE are not available: % aprun -n 8 -m4000hs ./xthi | sort [NID 00043] 2009-04-09 07:58:28 Apid 379231: unable to acquire enough huge memory: desired 32000M, actual 31498M Example 12: Using node lists (-L option) You can specify candidate node lists through the aprun -L option for applications launched interactively and through the qsub -lmppnodes option for batch and interactive batch jobs. For an application launched interactively, use the cnselect command to get a list of all Cray XE5 compute nodes. Then use aprun -L option to specify the candidate list: % cnselect coremask.eq.255 28-95 % aprun -n 4 -N 2 -L 28-95 ./xthi | sort Application 225127 resources: utime ~0s, stime ~0s Hello from rank 0, thread 0, on nid00028. (core affinity = 0) Hello from rank 1, thread 0, on nid00028. (core affinity = 1) Hello from rank 2, thread 0, on nid00029. (core affinity = 0) Hello from rank 3, thread 0, on nid00029. (core affinity = 1) Example 13: Bypassing binary transfer (-b option) This aprun command runs the compute node grep command to find references to MemTotal in compute node file /proc/meminfo: % aprun -b /bin/ash -c "cat /proc/meminfo |grep MemTotal" MemTotal: 32909204 kB For further information about the commands you can use with the aprun -b option, see Workload Management and Application Placement for the Cray Linux Environment. SEE ALSO intro_alps(1), apkill(1), apstat(1), cnselect(1), qsub(1) CC(1), cc(1), ftn(1) Workload Management and Application Placement for the Cray Linux Environment Cray Application Developer's Environment User's Guide