Example Globus Compute Endpoint Configurations¶
Globus Compute has been used on various systems around the world. Below are example configurations for commonly used systems. If you would like to add your system to this list please contact the Globus Compute Team via Slack.
Note
All configuration examples below must be customized for the user’s allocation, Python environment, file system, etc.
GlobusComputeEngine¶
GlobusComputeEngine is the execution backend that Globus Compute uses
to execute functions. To execute functions at scale, Globus Compute can be
configured to use a range of Providers which allows it to connect to Batch schedulers
like Slurm and PBSTorque to provision compute nodes dynamically in response to workload.
These capabilities are largely borrowed from Parsl’s HighThroughputExecutor and therefore
all of HighThroughputExecutor’s parameter options are supported as passthrough.
Note::
As of globus-compute-endpoint==2.12.0 GlobusComputeEngine, replaces the
HighThroughputEngine as the default executor.
Here are GlobusComputeEngine specific features:
Retries¶
Functions submitted to the GlobusComputeEngine can fail due to infrastructure
failures, for example, the worker executing the task might terminate due to it running
out of memory, or all workers under a batch job could fail due to the batch job
exiting as it reaches the walltime limit. GlobusComputeEngine can be configured
to automatically retry these tasks by setting max_retries_on_system_failure=N
where N is the number of retries allowed. The endpoint config sets default retries
to 0 since functions can be computationally expensive, not idempotent, or leave
side effects that affect subsequent retries.
Example config snippet:
amqp_port: 443
display_name: Retry_2_times
engine:
type: GlobusComputeEngine
max_retries_on_system_failure: 2 # Default=0
Auto-Scaling¶
GlobusComputeEngine by default automatically scales workers in response to workload.
Strategy configuration is limited to two options:
max_idletime: Maximum duration in seconds that workers are allowed to idle before they are marked for terminationstrategy_period: Set the # of seconds between strategy attempting auto-scaling events
The bounds for scaling are determined by the options to the Provider
(init_blocks, min_blocks, max_blocks). Please refer to the
https://parsl.readthedocs.io/en/stable/userguide/execution.html#elasticity
for more info.
Here’s an example configuration:
engine:
type: GlobusComputeEngine
job_status_kwargs:
max_idletime: 60.0 # Default = 120s
strategy_period: 120.0 # Default = 5s
Anvil (RCAC, Purdue)¶
The following snippet shows an example configuration for executing remotely on Anvil, a
supercomputer at Purdue University’s Rosen Center for Advanced Computing (RCAC). The
configuration assumes the user is running on a login node, uses the SlurmProvider to
interface with the scheduler, and uses the SrunLauncher to launch workers.
amqp_port: 443
display_name: Anvil CPU
engine:
type: GlobusComputeEngine
max_workers_per_node: 2
address:
type: address_by_interface
ifname: ib0
provider:
type: SlurmProvider
partition: debug
account: {{ ACCOUNT }}
launcher:
type: SrunLauncher
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --constraint=knl,quad,cache"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load anaconda; source activate gce_env
worker_init: {{ COMMAND }}
init_blocks: 1
max_blocks: 1
min_blocks: 0
walltime: 00:05:00
Delta (NCSA)¶
The following snippet shows an example configuration for executing remotely on Delta, a
supercomputer at the National Center for Supercomputing Applications. The configuration
assumes the user is running on a login node, uses the SlurmProvider to interface
with the scheduler, and uses the SrunLauncher to launch workers.
amqp_port: 443
display_name: NCSA Delta 2 CPU
engine:
type: GlobusComputeEngine
max_workers_per_node: 2
address:
type: address_by_interface
ifname: eth6.560
provider:
type: SlurmProvider
partition: cpu
account: {{ ACCOUNT NAME }}
launcher:
type: SrunLauncher
# Command to be run before starting a worker
# e.g., "module load anaconda3; source activate gce_env"
worker_init: {{ COMMAND }}
init_blocks: 1
min_blocks: 0
max_blocks: 1
walltime: 00:30:00
Expanse (SDSC)¶
The following snippet shows an example configuration for executing remotely on Expanse,
a supercomputer at the San Diego Supercomputer Center. The configuration assumes the
user is running on a login node, uses the SlurmProvider to interface with the
scheduler, and uses the SrunLauncher to launch workers.
display_name: Expanse@SDSC
engine:
type: GlobusComputeEngine
max_workers_per_node: 2
worker_debug: False
address:
type: address_by_interface
ifname: ib0
provider:
type: SlurmProvider
partition: compute
account: {{ ACCOUNT }}
launcher:
type: SrunLauncher
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --constraint=knl,quad,cache"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load anaconda3; source activate gce_env"
worker_init: {{ COMMAND }}
# Command to be run before starting a worker
# e.g., "module load anaconda3; source activate gce_env"
worker_init: "source ~/setup.sh"
init_blocks: 0
min_blocks: 0
max_blocks: 1
walltime: 00:05:00
UChicago AI Cluster¶
The following snippet shows an example configuration for the University of Chicago’s AI
Cluster. The configuration assumes the user is running on a login node and uses the
SlurmProvider to interface with the scheduler and launch onto the GPUs.
Link to docs.
display_name: AI Cluster CS@UChicago
engine:
type: GlobusComputeEngine
label: fe.cs.uchicago
worker_debug: False
address:
type: address_by_interface
ifname: ens2f1
provider:
type: SlurmProvider
partition: general
# This is a hack. We use hostname ; to terminate the srun command, and
# start our own.
launcher:
type: SrunLauncher
overrides: >
hostname; srun --ntasks={{ TOTAL_WORKERS }}
--ntasks-per-node={{ WORKERS_PER_NODE }}
--gpus-per-task=rtx2080ti:{{ GPUS_PER_WORKER }}
--gpu-bind=map_gpu:{{ GPU_MAP }} \
# To request a single gpu, use the following:
# hostname; srun --ntasks=1
# --ntasks-per-node=1
# --gres=gpu:1 \
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
Here is some Python that demonstrates how to compute the variables in the YAML example above:
# Launch 4 managers per node, each bound to 1 GPU
# Modify before use
NODES_PER_JOB = 2
GPUS_PER_NODE = 4
GPUS_PER_WORKER = 2
# DO NOT MODIFY
TOTAL_WORKERS = int((NODES_PER_JOB * GPUS_PER_NODE) / GPUS_PER_WORKER)
WORKERS_PER_NODE = int(GPUS_PER_NODE / GPUS_PER_WORKER)
GPU_MAP = ",".join([str(x) for x in range(1, TOTAL_WORKERS + 1)])
Midway (RCC, UChicago)¶
The Midway cluster is a campus cluster hosted by the Research Computing Center at the
University of Chicago. The snippet below shows an example configuration for executing
remotely on Midway. The configuration assumes the user is running on a login node and
uses the SlurmProvider to interface with the scheduler, and uses the
SrunLauncher to launch workers.
display_name: Midway3@rcc.uchicago.edu
engine:
type: HighThroughputEngine
max_workers_per_node: 2
worker_debug: False
address:
type: address_by_interface
ifname: bond0
provider:
type: SlurmProvider
partition: broadwl
launcher:
type: SrunLauncher
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --constraint=knl,quad,cache"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., module load Anaconda; source activate parsl_env
worker_init: {{ COMMAND }}
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
The following configuration is an example to use singularity container on Midway.
engine:
type: HighThroughputEngine
max_workers_per_node: 10
address:
type: address_by_interface
ifname: bond0
scheduler_mode: soft
worker_mode: singularity_reuse
container_type: singularity
container_cmd_options: -H /home/$USER
provider:
type: SlurmProvider
partition: broadwl
launcher:
type: SrunLauncher
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# eg: "#SBATCH --constraint=knl,quad,cache"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load Anaconda; source activate parsl_env"
worker_init: {{ COMMAND }}
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
Kubernetes Clusters¶
Kubernetes is an open-source system for container management, such as automating
deployment and scaling of containers. The snippet below shows an example configuration
for deploying pods as workers on a Kubernetes cluster. The KubernetesProvider exploits
the Python Kubernetes API, which assumes that you have kube config in
~/.kube/config.
heartbeat_period: 15
heartbeat_threshold: 200
log_dir: "."
engine:
type: HighThroughputEngine
label: Kubernetes_funcX
max_workers_per_node: 1
address:
type: address_by_route
scheduler_mode: hard
container_type: docker
strategy:
type: KubeSimpleStrategy
max_idletime: 3600
provider:
type: KubernetesProvider
init_blocks: 0
min_blocks: 0
max_blocks: 2
init_cpu: 1
max_cpu: 4
init_mem: 1024Mi
max_mem: 4096Mi
# e.g., python:3.8-buster
image: {{ IMAGE }}
# e.g., "pip install --force-reinstall globus_compute_endpoint>=2.0.1"
worker_init: {{ COMMAND }}
# e.g., default
namespace: {{ NAMESPACE }}
incluster_config: False
Polaris (ALCF)¶
The following snippet shows an example configuration for executing on Argonne Leadership
Computing Facility’s Polaris cluster. This example uses the HighThroughputEngine
and connects to Polaris’s PBS scheduler using the PBSProProvider. This
configuration assumes that the script is being executed on the login node of Polaris.
display_name: Polaris@ALCF
engine:
type: GlobusComputeEngine
max_workers_per_node: 4
# Un-comment to give each worker exclusive access to a single GPU
# available_accelerators: 4
address:
type: address_by_interface
ifname: bond0
provider:
type: PBSProProvider
launcher:
type: MpiExecLauncher
# Ensures 1 manger per node, work on all 64 cores
bind_cmd: --cpu-bind
overrides: --depth=64 --ppn 1
account: {{ YOUR_POLARIS_ACCOUNT }}
queue: debug-scaling
cpus_per_node: 32
select_options: ngpus=4
# e.g., "#PBS -l filesystems=home:grand:eagle\n#PBS -k doe"
scheduler_options: "#PBS -l filesystems=home:grand:eagle"
# Node setup: activate necessary conda environment and such
worker_init: {{ COMMAND }}
walltime: 01:00:00
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 2
Perlmutter (NERSC)¶
The following snippet shows an example configuration for accessing NERSC’s
Perlmutter supercomputer. This example uses the HighThroughputEngine and
connects to Perlmutters’s Slurm scheduler. It is configured to request 2 nodes
configured with 1 TaskBlock per node. Finally, it includes override information to
request a particular node type (GPU) and to configure a specific Python environment on
the worker nodes using Anaconda.
display_name: Permutter@NERSC
engine:
type: GlobusComputeEngine
worker_debug: False
address:
type: address_by_interface
ifname: hsn0
provider:
type: SlurmProvider
partition: debug
# We request all hyperthreads on a node.
# GPU nodes have 128 threads, CPU nodes have 256 threads
launcher:
type: SrunLauncher
overrides: -c 128
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# For GPUs in the debug qos eg: "#SBATCH --constraint=gpu"
scheduler_options: {{ OPTIONS }}
# Your NERSC account, eg: "m0000"
account: {{ NERSC_ACCOUNT }}
# Command to be run before starting a worker
# e.g., "module load Anaconda; source activate parsl_env"
worker_init: {{ COMMAND }}
# increase the command timeouts
cmd_timeout: 120
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 10 minutes
walltime: 00:10:00
Frontera (TACC)¶
The following snippet shows an example configuration for accessing the Frontera system
at TACC. The configuration below assumes that the user is running on a login node, uses
the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to
launch workers.
display_name: Frontera@TACC
engine:
type: GlobusComputeEngine
max_workers_per_node: 2
worker_debug: False
address:
type: address_by_interface
ifname: ib0
provider:
type: SlurmProvider
# e.g., EAR22001
account: {{ YOUR_FRONTERA_ACCOUNT }}
# e.g., development
partition: {{ PARTITION }}
launcher:
type: SrunLauncher
# Enter scheduler_options if needed
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., "module load Anaconda; source activate parsl_env"
worker_init: {{ COMMAND }}
# Add extra time for slow scheduler responses
cmd_timeout: 60
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
Bebop (LCRC, ANL)¶
The following snippet shows an example configuration for accessing the Bebop system at
Argonne’s LCRC. The configuration below assumes that the user is running on a login
node, uses the SlurmProvider to interface with the scheduler, and uses the
SrunLauncher to launch workers.
display_name: Bebop@ANL
engine:
type: GlobusComputeEngine
max_workers_per_node: 2
worker_debug: False
address:
type: address_by_interface
ifname: ib0
provider:
type: SlurmProvider
partition: {{ PARTITION }} # e.g., bdws
launcher:
type: SrunLauncher
# Command to be run before starting a worker
# e.g., "module load anaconda; source activate gce_env"
worker_init: {{ COMMAND }}
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 1
walltime: 00:30:00
Bridges-2 (PSC)¶
The following snippet shows an example configuration for accessing the Bridges-2 system
at PSC. The configuration below assumes that the user is running on a login node, uses
the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to
launch workers.
engine:
type: GlobusComputeEngine
max_workers_per_node: 2
worker_debug: False
address:
type: address_by_interface
ifname: ens3f0
provider:
type: SlurmProvider
partition: RM-small
launcher:
type: SrunLauncher
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
# e.g., "#SBATCH --constraint=knl,quad,cache"
scheduler_options: {{ OPTIONS }}
# Command to be run before starting a worker
# e.g., module load Anaconda; source activate parsl_env
worker_init: {{ COMMAND }}
# Scale between 0-1 blocks with 2 nodes per block
nodes_per_block: 2
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 30 minutes
walltime: 00:30:00
FASTER (TAMU)¶
The following snippet shows an example configuration for accessing the FASTER system at
Texas A & M (TAMU). The configuration below assumes that
the user is running on a login node, uses the SlurmProvider to interface with the
scheduler, and uses the SrunLauncher to launch workers.
amqp_port: 443
display_name: Access Tamu Faster
engine:
type: GlobusComputeEngine
worker_debug: False
strategy:
type: SimpleStrategy
max_idletime: 90
address:
type: address_by_interface
ifname: eno8303
provider:
type: SlurmProvider
partition: cpu
mem_per_node: 128
launcher:
type: SrunLauncher
scheduler_options: {{ OPTIONS }}
worker_init: {{ COMMAND }}
# increase the command timeouts
cmd_timeout: 120
# Scale between 0-1 blocks with 1 nodes per block
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 1
# Hold blocks for 10 minutes
walltime: 00:10:00
Pinning Workers to devices¶
Many modern clusters provide multiple accelerators per compute note, yet many
applications are best suited to using a single accelerator per task. Globus Compute
supports pinning each worker to different accelerators using the
available_accelerators option of the GlobusComputeEngine. Provide either the
number of accelerators (Globus Compute will assume they are named in integers starting
from zero) or a list of the names of the accelerators available on the node. Each
Globus Compute worker will have the following environment variables set to the worker
specific identity assigned: CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES,
SYCL_DEVICE_FILTER.
engine:
type: GlobusComputeEngine
max_workers_per_node: 4
# `available_accelerators` may be a natural number or a list of strings.
# If an integer, then each worker launched will have an automatically
# generated environment variable. In this case, one of 0, 1, 2, or 3.
# Alternatively, specific strings may be utilized.
available_accelerators: 4
# available_accelerators: ["opencl:gpu:1", "opencl:gpu:2"] # alternative
provider:
type: LocalProvider
init_blocks: 1
min_blocks: 0
max_blocks: 1