Example Configurations¶
While Globus Compute is in use on various systems around the world, getting to a working configuration that matches the underlying system constraints and the requirements of the site-administrator often takes trial and error. Below are example user endpoint configuration templates for some well-known systems that are known to work. These serve as a reference for getting started.
If you would like to add your system to this list, please submit a PR with your configuration template and a brief description of the system to the Globus Compute repository.
Note
All configuration examples below must be customized for the user’s allocation, Python environment, file system, etc.
Anvil (RCAC, Purdue)¶
The following snippet shows an example configuration for executing remotely on
Anvil, a supercomputer at Purdue University’s Rosen Center for Advanced
Computing (RCAC). The configuration assumes the user is running on a login
node, uses the SlurmProvider to interface with the scheduler, and uses the
SrunLauncher to launch workers.
user_config_template.yaml.j2¶amqp_port: 443
engine:
type: GlobusComputeEngine
max_workers_per_node: 2
address:
type: address_by_interface
ifname: ib0
provider:
type: SlurmProvider
partition: debug
account: {{ ACCOUNT }}
launcher:
type: SrunLauncher
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --constraint=knl,quad,cache"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load anaconda; source activate gce_env
worker_init: {{ COMMAND }}
init_blocks: 1
max_blocks: 1
min_blocks: 0
walltime: 00:05:00
Delta (NCSA)¶
The following snippet shows an example configuration for executing remotely on
Delta, a supercomputer at the National Center for Supercomputing Applications.
The configuration assumes the user is running on a login node, uses the
SlurmProvider to interface with the scheduler, and uses the SrunLauncher
to launch workers.
user_config_template.yaml.j2¶amqp_port: 443
engine:
type: GlobusComputeEngine
max_workers_per_node: 2
address:
type: address_by_interface
ifname: eth6.560
provider:
type: SlurmProvider
partition: cpu
account: {{ ACCOUNT NAME }}
launcher:
type: SrunLauncher
# Command to be run before starting a worker
# e.g., "module load anaconda3; source activate gce_env"
worker_init: {{ COMMAND }}
init_blocks: 1
min_blocks: 0
max_blocks: 1
walltime: 00:30:00
Expanse (SDSC)¶
The following snippet shows an example configuration for executing remotely on
Expanse, a supercomputer at the San Diego Supercomputer Center. The
configuration assumes the user is running on a login node, uses the
SlurmProvider to interface with the scheduler, and uses the
SrunLauncher to launch workers.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
max_workers_per_node: 2
worker_debug: False
address:
type: address_by_interface
ifname: ib0
provider:
type: SlurmProvider
partition: compute
account: {{ ACCOUNT }}
launcher:
type: SrunLauncher
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --constraint=knl,quad,cache"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load anaconda3; source activate gce_env"
worker_init: {{ COMMAND }}
init_blocks: 0
min_blocks: 0
max_blocks: 1
walltime: 00:05:00
The GlobusMPIEngine adds support for running MPI applications. The following
snippet shows an example configuration for Expanse that uses the
SlurmProvider to provision batch jobs each with 4 nodes, which can be
dynamically partitioned to launch MPI functions with srun.
user_config_template.yaml.j2¶engine:
type: GlobusMPIEngine
mpi_launcher: srun
address:
type: address_by_interface
ifname: ib0
provider:
type: SlurmProvider
partition: compute
account: {{ ACCOUNT }}
launcher:
type: SimpleLauncher
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --constraint=knl,quad,cache"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load anaconda3; source activate gce_env"
worker_init: {{ COMMAND }}
nodes_per_block: 4
init_blocks: 0
min_blocks: 0
max_blocks: 1
walltime: 00:05:00
UChicago AI Cluster¶
The following snippet shows an example configuration for the University of
Chicago’s AI Cluster. The configuration assumes the user is running on a login
node and uses the SlurmProvider to interface with the scheduler and launch
onto the GPUs.
Link to docs.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
label: fe.cs.uchicago
worker_debug: False
address:
type: address_by_interface
ifname: ens2f1
provider:
type: SlurmProvider
partition: general
# This is a hack. We use hostname ; to terminate the srun command, and
# start our own.
launcher:
type: SrunLauncher
overrides: >
hostname; srun --ntasks={{ TOTAL_WORKERS }}
--ntasks-per-node={{ WORKERS_PER_NODE }}
--gpus-per-task=rtx2080ti:{{ GPUS_PER_WORKER }}
--gpu-bind=map_gpu:{{ GPU_MAP }} \
# To request a single gpu, use the following:
# hostname; srun --ntasks=1
# --ntasks-per-node=1
# --gres=gpu:1 \
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
Here is some Python that demonstrates how to compute the variables in the YAML example above:
# Launch 4 managers per node, each bound to 1 GPU
# Modify before use
NODES_PER_JOB = 2
GPUS_PER_NODE = 4
GPUS_PER_WORKER = 2
# DO NOT MODIFY
TOTAL_WORKERS = int((NODES_PER_JOB * GPUS_PER_NODE) / GPUS_PER_WORKER)
WORKERS_PER_NODE = int(GPUS_PER_NODE / GPUS_PER_WORKER)
GPU_MAP = ",".join([str(x) for x in range(1, TOTAL_WORKERS + 1)])
Midway (RCC, UChicago)¶
The Midway cluster is a campus cluster hosted by the Research Computing Center
at the University of Chicago. The snippet below shows an example configuration
for executing remotely on Midway. The configuration assumes the user is running
on a login node and uses the SlurmProvider to interface with the scheduler,
and uses the SrunLauncher to launch workers.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
label: Midway@RCC.UChicago
max_workers_per_node: 2
address:
type: address_by_interface
ifname: bond0
provider:
type: SlurmProvider
launcher:
type: SrunLauncher
# e.g., pi-compute
account: {{ ACCOUNT }}
# e.g., caslake
partition: {{ PARTITION }}
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --gres=gpu:4"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load Anaconda; source activate compute-env"
worker_init: {{ COMMAND }}
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
The following configuration example uses an Apptainer (formerly Singularity) container on Midway.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
label: Midway@RCC.UChicago
max_workers_per_node: 10
address:
type: address_by_interface
ifname: bond0
container_type: apptainer
container_cmd_options: -H /home/$USER
container_uri: {{ CONTAINER_URI }}
provider:
type: SlurmProvider
launcher:
type: SrunLauncher
# e.g., pi-compute
account: {{ ACCOUNT }}
# e.g., caslake
partition: {{ PARTITION }}
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --gres=gpu:4"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load Anaconda; source activate compute-env"
worker_init: {{ COMMAND }}
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
Kubernetes Clusters¶
Kubernetes is an open-source system for container management, such as automating
deployment and scaling of containers. The snippet below shows an example
configuration for deploying pods as workers on a Kubernetes cluster. The
KubernetesProvider exploits the Python Kubernetes API, which assumes that you
have kube config in ~/.kube/config.
user_config_template.yaml.j2¶heartbeat_period: 15
heartbeat_threshold: 200
engine:
type: GlobusComputeEngine
max_workers_per_node: 1
# Encryption is not currently supported for KubernetesProvider
encrypted: false
address:
type: address_by_route
provider:
type: KubernetesProvider
init_blocks: 0
min_blocks: 0
max_blocks: 2
init_cpu: 1
max_cpu: 4
init_mem: 1024Mi
max_mem: 4096Mi
# e.g., default
namespace: {{ NAMESPACE }}
# e.g., python:3.12-bookworm
image: {{ IMAGE }}
# The secret key to download the image
secret: {{ SECRET }}
# e.g., "pip install --force-reinstall globus-compute-endpoint"
worker_init: {{ COMMAND }}
Polaris (ALCF)¶
The following snippet shows an example configuration for executing on Argonne
Leadership Computing Facility’s Polaris cluster. This example uses the
GlobusComputeEngine and connects to Polaris’s PBS scheduler using the
PBSProProvider. This configuration assumes that the script is being
executed on the login node of Polaris.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
max_workers_per_node: 4
# Un-comment to give each worker exclusive access to a single GPU
# available_accelerators: 4
address:
type: address_by_interface
ifname: hsn0
provider:
type: PBSProProvider
launcher:
type: MpiExecLauncher
# Ensures 1 manager per node, work on all 64 cores
bind_cmd: --cpu-bind
overrides: --depth=64 --ppn 1
account: {{ POLARIS_ACCOUNT }}
queue: debug-scaling
cpus_per_node: 32
select_options: ngpus=4
# e.g., "#PBS -l filesystems=home:grand:eagle\n#PBS -k doe"
scheduler_options: "#PBS -l filesystems=home:grand:eagle"
# Node setup: activate necessary conda environment and such
worker_init: {{ COMMAND }}
walltime: 01:00:00
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 2
Perlmutter (NERSC)¶
The following snippet shows an example configuration for accessing NERSC’s
Perlmutter supercomputer. This example uses the GlobusComputeEngine and
connects to Perlmutters’s Slurm scheduler. It is configured to request 2 nodes
configured with 1 TaskBlock per node. Finally, it includes override information
to request a particular node type (GPU) and to configure a specific Python
environment on the worker nodes using Anaconda.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
worker_debug: False
address:
type: address_by_interface
ifname: hsn0
provider:
type: SlurmProvider
partition: debug
# We request all hyperthreads on a node.
# GPU nodes have 128 threads, CPU nodes have 256 threads
launcher:
type: SrunLauncher
overrides: -c 128
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# For GPUs in the debug qos eg: "#SBATCH --constraint=gpu\n#SBATCH --gpus-per-node=4"
scheduler_options: {{ OPTIONS }}
# Your NERSC account, eg: "m0000"
account: {{ NERSC_ACCOUNT }}
# Command to be run before starting a worker
# e.g., "module load Anaconda; source activate parsl_env"
worker_init: {{ COMMAND }}
# increase the command timeouts
cmd_timeout: 120
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 10 minutes
walltime: 00:10:00
Frontera (TACC)¶
The following snippet shows an example configuration for accessing the Frontera
system at TACC. The configuration below assumes that the user is running on a
login node, uses the SlurmProvider to interface with the scheduler, and uses
the SrunLauncher to launch workers.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
max_workers_per_node: 2
worker_debug: False
address:
type: address_by_interface
ifname: ib0
provider:
type: SlurmProvider
# e.g., EAR22001
account: {{ FRONTERA_ACCOUNT }}
# e.g., development
partition: {{ PARTITION }}
launcher:
type: SrunLauncher
# Enter scheduler_options if needed
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load Anaconda; source activate parsl_env"
worker_init: {{ COMMAND }}
# Add extra time for slow scheduler responses
cmd_timeout: 60
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
Bebop (LCRC, ANL)¶
The following snippet shows an example configuration for accessing the Bebop
system at Argonne’s LCRC. The configuration below assumes that the user is
running on a login node, uses the SlurmProvider to interface with the
scheduler, and uses the SrunLauncher to launch workers.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
max_workers_per_node: 2
worker_debug: False
address:
type: address_by_interface
ifname: ib0
provider:
type: SlurmProvider
partition: {{ PARTITION }} # e.g., bdws
launcher:
type: SrunLauncher
# Command to be run before starting a worker
# e.g., "module load anaconda; source activate gce_env"
worker_init: {{ COMMAND }}
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 1
walltime: 00:30:00
Bridges-2 (PSC)¶
The following snippet shows an example configuration for accessing the Bridges-2
system at PSC. The configuration below assumes that the user is running on a
login node, uses the SlurmProvider to interface with the scheduler, and uses
the SrunLauncher to launch workers.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
max_workers_per_node: 2
worker_debug: False
address:
type: address_by_interface
ifname: ens3f0
provider:
type: SlurmProvider
partition: RM-small
launcher:
type: SrunLauncher
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --constraint=knl,quad,cache"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., module load Anaconda; source activate parsl_env
worker_init: {{ COMMAND }}
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
FASTER (TAMU)¶
The following snippet shows an example configuration for accessing the FASTER
system at Texas A & M (TAMU). The configuration
below assumes that the user is running on a login node, uses the
SlurmProvider to interface with the scheduler, and uses the SrunLauncher
to launch workers.
user_config_template.yaml.j2¶amqp_port: 443
engine:
type: GlobusComputeEngine
worker_debug: False
strategy:
type: SimpleStrategy
max_idletime: 90
address:
type: address_by_interface
ifname: eno8303
provider:
type: SlurmProvider
partition: cpu
mem_per_node: 128
launcher:
type: SrunLauncher
scheduler_options: {{ OPTIONS }}
worker_init: {{ COMMAND }}
# increase the command timeouts
cmd_timeout: 120
# Scale between 0-1 blocks with 1 nodes per block
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 10 minutes
walltime: 00:10:00
Open Science Pool¶
The Open Science Pool
is a pool of opportunistic computing resources operated for all US-associated
open science by the OSG consortium.
Unlike traditional HPC clusters, these computational resources are offered from
campus and research cluster resources that are loosely connected. The
configuration below uses the CondorProvider to interface with the scheduler,
and uses apptainer to distribute the computational environment to the
workers.
Warning
GlobusComputeEngine relies on a shared-filesystem to distribute keys used
for encrypting communication between the endpoint and workers. Since OSPool
does not support a writable shared-filesystem, encryption is disabled in
the configuration below.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
max_workers_per_node: 1
# This config uses apptainer containerization to ensure a consistent
# python environment on the worker side. Since apptainer limits writable
# directory paths, set working directory paths paths used by the worker to /tmp
# P.S: These filepaths remain private to the container and will not be
# accessible on the host system
worker_logdir_root: /tmp/logs
working_dir: /tmp/tasks_dir
# GlobusComputeEngine relies on a shared-filesystem to distribute keys used
# for encrypting communication between the endpoint and workers.
# Since OSPool does not support a writable shared-filesystem,
# **encryption** is disabled in the configuration below.
encrypted: False
provider:
type: CondorProvider
init_blocks: 1
max_blocks: 1
min_blocks: 0
# Specify ProjectName and Apptainer image
scheduler_options: >
+ProjectName = {{ PROJECT_NAME }}
# To use apptainer on OSPool, build an apptainer image and copy it to
# OSDF and specify the full Specify the apptainer image path for eg.:
# "osdf:///ospool/ap20/data/USERNAME/globus_compute_py3.11.v1.sif"
+SingularityImage = {{ APPTAINER_IMAGE_PATH }}
# Add a condor requirement to guarantee that worker nodes support apptainer
Requirements = HAS_SINGULARITY == True && OSG_HOST_KERNEL_VERSION >= 31000
Stampede3 (TACC)¶
Stampede3 is a Dell technologies
and Intel based supercomputer at the Texas Advanced Computing Center (TACC). The following snippet shows an example
configuration that uses the SlurmProvider to interface with the batch
scheduler, and uses the SrunLauncher to launch workers across nodes.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
max_workers_per_node: 2
address:
type: address_by_interface
ifname: ibp10s0
provider:
type: SlurmProvider
# e.g., EAR22001
account: {{ TACC_ALLOCATION }}
# e.g., skx-dev
partition: {{ PARTITION }}
launcher:
type: SrunLauncher
# Enter scheduler_options if needed
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load Anaconda; source activate parsl_env"
worker_init: {{ COMMAND }}
# Add extra time for slow scheduler responses
cmd_timeout: 60
# Scale between 0-1 blocks with 1 node per block
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
Pinning Workers to devices¶
Many modern clusters provide multiple accelerators per compute note, yet many
applications are best suited to using a single accelerator per task. Globus
Compute supports pinning each worker to different accelerators using the
available_accelerators option of the GlobusComputeEngine. Provide
either the number of accelerators (Globus Compute will assume they are named in
integers starting from zero) or a list of the names of the accelerators
available on the node. Each Globus Compute worker will have the following
environment variables set to the worker specific identity assigned:
CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, SYCL_DEVICE_FILTER.
user_config_template.yaml.j2¶engine:
type: GlobusComputeEngine
max_workers_per_node: 4
# `available_accelerators` may be a natural number or a list of strings.
# If an integer, then each worker launched will have an automatically
# generated environment variable. In this case, one of 0, 1, 2, or 3.
# Alternatively, specific strings may be utilized.
available_accelerators: 4
# available_accelerators: ["opencl:gpu:1", "opencl:gpu:2"] # alternative
provider:
type: LocalProvider
init_blocks: 1
min_blocks: 0
max_blocks: 1