Example Globus Compute Endpoint Configurations

Globus Compute has been used on various systems around the world. Below are example configurations for commonly used systems. If you would like to add your system to this list please contact the Globus Compute Team via Slack.

Note

All configuration examples below must be customized for the user’s allocation, Python environment, file system, etc.

GlobusComputeEngine

GlobusComputeEngine is the execution backend that Globus Compute uses to execute functions. To execute functions at scale, Globus Compute can be configured to use a range of Providers which allows it to connect to Batch schedulers like Slurm and PBSTorque to provision compute nodes dynamically in response to workload. These capabilities are largely borrowed from Parsl’s HighThroughputExecutor and therefore all of HighThroughputExecutor’s parameter options are supported as passthrough.

Note:: As of globus-compute-endpoint==2.12.0 GlobusComputeEngine, replaces the HighThroughputEngine as the default executor.

Here are GlobusComputeEngine specific features:

Retries

Functions submitted to the GlobusComputeEngine can fail due to infrastructure failures, for example, the worker executing the task might terminate due to it running out of memory, or all workers under a batch job could fail due to the batch job exiting as it reaches the walltime limit. GlobusComputeEngine can be configured to automatically retry these tasks by setting max_retries_on_system_failure=N where N is the number of retries allowed. The endpoint config sets default retries to 0 since functions can be computationally expensive, not idempotent, or leave side effects that affect subsequent retries.

Example config snippet:

amqp_port: 443
display_name: Retry_2_times
engine:
    type: GlobusComputeEngine
    max_retries_on_system_failure: 2  # Default=0

Auto-Scaling

GlobusComputeEngine by default automatically scales workers in response to workload.

Strategy configuration is limited to two options:

  1. max_idletime: Maximum duration in seconds that workers are allowed to idle before they are marked for termination

  2. strategy_period: Set the # of seconds between strategy attempting auto-scaling events

The bounds for scaling are determined by the options to the Provider (init_blocks, min_blocks, max_blocks). Please refer to the https://parsl.readthedocs.io/en/stable/userguide/execution.html#elasticity for more info.

Here’s an example configuration:

engine:
    type: GlobusComputeEngine
    job_status_kwargs:
        max_idletime: 60.0      # Default = 120s
        strategy_period: 120.0  # Default = 5s

Anvil (RCAC, Purdue)

../_images/anvil.jpeg

The following snippet shows an example configuration for executing remotely on Anvil, a supercomputer at Purdue University’s Rosen Center for Advanced Computing (RCAC). The configuration assumes the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

amqp_port: 443
display_name: Anvil CPU
engine:
  type: GlobusComputeEngine
  max_workers_per_node: 2

  address:
    type: address_by_interface
    ifname: ib0

  provider:
    type: SlurmProvider
    partition: debug

    account: {{ ACCOUNT }}
    launcher:
        type: SrunLauncher

    # string to prepend to #SBATCH blocks in the submit
    # script to the scheduler
    # e.g., "#SBATCH --constraint=knl,quad,cache"
    scheduler_options: {{ OPTIONS }}

    # Command to be run before starting a worker
    # e.g., "module load anaconda; source activate gce_env
    worker_init: {{ COMMAND }}

    init_blocks: 1
    max_blocks: 1
    min_blocks: 0

    walltime: 00:05:00

Delta (NCSA)

../_images/delta_front.png

The following snippet shows an example configuration for executing remotely on Delta, a supercomputer at the National Center for Supercomputing Applications. The configuration assumes the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

amqp_port: 443
display_name: NCSA Delta 2 CPU
engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2

    address:
        type: address_by_interface
        ifname: eth6.560

    provider:
        type: SlurmProvider
        partition: cpu
        account: {{ ACCOUNT NAME }}

        launcher:
            type: SrunLauncher

        # Command to be run before starting a worker
        # e.g., "module load anaconda3; source activate gce_env"
        worker_init: {{ COMMAND }}

        init_blocks: 1
        min_blocks: 0
        max_blocks: 1

        walltime: 00:30:00

Expanse (SDSC)

../_images/expanse.jpeg

The following snippet shows an example configuration for executing remotely on Expanse, a supercomputer at the San Diego Supercomputer Center. The configuration assumes the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: Expanse@SDSC

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ib0

    provider:
        type: SlurmProvider
        partition: compute
        account: {{ ACCOUNT }}

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load anaconda3; source activate gce_env"
        worker_init: {{ COMMAND }}

        # Command to be run before starting a worker
        # e.g., "module load anaconda3; source activate gce_env"
        worker_init: "source ~/setup.sh"

        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        walltime: 00:05:00

UChicago AI Cluster

../_images/ai-science-web.jpeg

The following snippet shows an example configuration for the University of Chicago’s AI Cluster. The configuration assumes the user is running on a login node and uses the SlurmProvider to interface with the scheduler and launch onto the GPUs.

Link to docs.

display_name: AI Cluster CS@UChicago
engine:
    type: GlobusComputeEngine
    label: fe.cs.uchicago
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ens2f1

    provider:
        type: SlurmProvider
        partition: general

        # This is a hack. We use hostname ; to terminate the srun command, and
        # start our own.
        launcher:
            type: SrunLauncher
            overrides: >
                hostname; srun --ntasks={{ TOTAL_WORKERS }}
                --ntasks-per-node={{ WORKERS_PER_NODE }}
                --gpus-per-task=rtx2080ti:{{ GPUS_PER_WORKER }}
                --gpu-bind=map_gpu:{{ GPU_MAP }} \
            # To request a single gpu, use the following:
            #   hostname; srun --ntasks=1
            #   --ntasks-per-node=1
            #   --gres=gpu:1 \

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 1
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Here is some Python that demonstrates how to compute the variables in the YAML example above:

# Launch 4 managers per node, each bound to 1 GPU
# Modify before use
NODES_PER_JOB = 2
GPUS_PER_NODE = 4
GPUS_PER_WORKER = 2

# DO NOT MODIFY
TOTAL_WORKERS = int((NODES_PER_JOB * GPUS_PER_NODE) / GPUS_PER_WORKER)
WORKERS_PER_NODE = int(GPUS_PER_NODE / GPUS_PER_WORKER)
GPU_MAP = ",".join([str(x) for x in range(1, TOTAL_WORKERS + 1)])

Midway (RCC, UChicago)

../_images/20140430_RCC_8978.jpg

The Midway cluster is a campus cluster hosted by the Research Computing Center at the University of Chicago. The snippet below shows an example configuration for executing remotely on Midway. The configuration assumes the user is running on a login node and uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: Midway3@rcc.uchicago.edu

engine:
    type: HighThroughputEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
        type: address_by_interface
        ifname: bond0

    provider:
        type: SlurmProvider
        partition: broadwl

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., module load Anaconda; source activate parsl_env
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

The following configuration is an example to use singularity container on Midway.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 10

    address:
        type: address_by_interface
        ifname: bond0

    scheduler_mode: soft
    worker_mode: singularity_reuse
    container_type: singularity
    container_cmd_options: -H /home/$USER

    provider:
        type: SlurmProvider
        partition: broadwl

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # eg: "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate parsl_env"
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Kubernetes Clusters

../_images/kuberneteslogo.eabc6359f48c8e30b7a138c18177f3fd39338e05.png

Kubernetes is an open-source system for container management, such as automating deployment and scaling of containers. The snippet below shows an example configuration for deploying pods as workers on a Kubernetes cluster. The KubernetesProvider exploits the Python Kubernetes API, which assumes that you have kube config in ~/.kube/config.

heartbeat_period: 15
heartbeat_threshold: 200
log_dir: "."

engine:
    type: HighThroughputEngine
    label: Kubernetes_funcX
    max_workers_per_node: 1

    address:
      type: address_by_route

    scheduler_mode: hard
    container_type: docker

    strategy:
        type: KubeSimpleStrategy
        max_idletime: 3600

    provider:
        type: KubernetesProvider
        init_blocks: 0
        min_blocks: 0
        max_blocks: 2
        init_cpu: 1
        max_cpu: 4
        init_mem: 1024Mi
        max_mem: 4096Mi

        # e.g., python:3.8-buster
        image: {{ IMAGE }}

        # e.g., "pip install --force-reinstall globus_compute_endpoint>=2.0.1"
        worker_init: {{ COMMAND }}

        # e.g., default
        namespace: {{ NAMESPACE }}

        incluster_config: False

Polaris (ALCF)

../_images/ALCF_Polaris.jpeg

The following snippet shows an example configuration for executing on Argonne Leadership Computing Facility’s Polaris cluster. This example uses the HighThroughputEngine and connects to Polaris’s PBS scheduler using the PBSProProvider. This configuration assumes that the script is being executed on the login node of Polaris.

display_name: Polaris@ALCF

engine:
  type: GlobusComputeEngine
  max_workers_per_node: 4

  # Un-comment to give each worker exclusive access to a single GPU
  # available_accelerators: 4

  address:
    type: address_by_interface
    ifname: bond0

  provider:
    type: PBSProProvider

    launcher:
      type: MpiExecLauncher
      # Ensures 1 manger per node, work on all 64 cores
      bind_cmd: --cpu-bind
      overrides: --depth=64 --ppn 1

    account: {{ YOUR_POLARIS_ACCOUNT }}
    queue: debug-scaling
    cpus_per_node: 32
    select_options: ngpus=4

    # e.g., "#PBS -l filesystems=home:grand:eagle\n#PBS -k doe"
    scheduler_options: "#PBS -l filesystems=home:grand:eagle"

    # Node setup: activate necessary conda environment and such
    worker_init: {{ COMMAND }}

    walltime: 01:00:00
    nodes_per_block: 1
    init_blocks: 0
    min_blocks: 0
    max_blocks: 2

Perlmutter (NERSC)

../_images/Nersc9-image-compnew-sizer7-group-type-4-1.jpg

The following snippet shows an example configuration for accessing NERSC’s Perlmutter supercomputer. This example uses the HighThroughputEngine and connects to Perlmutters’s Slurm scheduler. It is configured to request 2 nodes configured with 1 TaskBlock per node. Finally, it includes override information to request a particular node type (GPU) and to configure a specific Python environment on the worker nodes using Anaconda.

display_name: Permutter@NERSC
engine:
    type: GlobusComputeEngine
    worker_debug: False

    address:
        type: address_by_interface
        ifname: hsn0

    provider:
        type: SlurmProvider
        partition: debug

        # We request all hyperthreads on a node.
        # GPU nodes have 128 threads, CPU nodes have 256 threads
        launcher:
            type: SrunLauncher
            overrides: -c 128

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # For GPUs in the debug qos eg: "#SBATCH --constraint=gpu"
        scheduler_options: {{ OPTIONS }}

        # Your NERSC account, eg: "m0000"
        account: {{ NERSC_ACCOUNT }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate parsl_env"
        worker_init: {{ COMMAND }}

        # increase the command timeouts
        cmd_timeout: 120

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 10 minutes
        walltime: 00:10:00

Frontera (TACC)

../_images/frontera-banner-home.jpg

The following snippet shows an example configuration for accessing the Frontera system at TACC. The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: Frontera@TACC

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
      type: address_by_interface
      ifname: ib0

    provider:
        type: SlurmProvider

        # e.g., EAR22001
        account: {{ YOUR_FRONTERA_ACCOUNT }}

        # e.g., development
        partition: {{ PARTITION }}

        launcher:
            type: SrunLauncher

        # Enter scheduler_options if needed
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate parsl_env"
        worker_init: {{ COMMAND }}

        # Add extra time for slow scheduler responses
        cmd_timeout: 60

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Bebop (LCRC, ANL)

../_images/Bebop.jpeg

The following snippet shows an example configuration for accessing the Bebop system at Argonne’s LCRC. The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: Bebop@ANL

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ib0

    provider:
        type: SlurmProvider
        partition: {{ PARTITION }}  # e.g., bdws
        launcher:
          type: SrunLauncher

        # Command to be run before starting a worker
        # e.g., "module load anaconda; source activate gce_env"
        worker_init: {{ COMMAND }}

        nodes_per_block: 1
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1
        walltime: 00:30:00

Bridges-2 (PSC)

../_images/bridges-2.png

The following snippet shows an example configuration for accessing the Bridges-2 system at PSC. The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
      type: address_by_interface
      ifname: ens3f0

    provider:
        type: SlurmProvider
        partition: RM-small

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., module load Anaconda; source activate parsl_env
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

FASTER (TAMU)

The following snippet shows an example configuration for accessing the FASTER system at Texas A & M (TAMU). The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

amqp_port: 443
display_name: Access Tamu Faster
engine:
    type: GlobusComputeEngine
    worker_debug: False

    strategy:
        type: SimpleStrategy
        max_idletime: 90

    address:
        type: address_by_interface
        ifname: eno8303

    provider:
        type: SlurmProvider
        partition: cpu
        mem_per_node: 128

        launcher:
            type: SrunLauncher

        scheduler_options: {{ OPTIONS }}

        worker_init: {{ COMMAND }}

        # increase the command timeouts
        cmd_timeout: 120

        # Scale between 0-1 blocks with 1 nodes per block
        nodes_per_block: 1
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 10 minutes
        walltime: 00:10:00

Pinning Workers to devices

Many modern clusters provide multiple accelerators per compute note, yet many applications are best suited to using a single accelerator per task. Globus Compute supports pinning each worker to different accelerators using the available_accelerators option of the GlobusComputeEngine. Provide either the number of accelerators (Globus Compute will assume they are named in integers starting from zero) or a list of the names of the accelerators available on the node. Each Globus Compute worker will have the following environment variables set to the worker specific identity assigned: CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, SYCL_DEVICE_FILTER.

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 4

    # `available_accelerators` may be a natural number or a list of strings.
    # If an integer, then each worker launched will have an automatically
    # generated environment variable. In this case, one of 0, 1, 2, or 3.
    # Alternatively, specific strings may be utilized.
    available_accelerators: 4
    # available_accelerators: ["opencl:gpu:1", "opencl:gpu:2"]  # alternative

    provider:
        type: LocalProvider
        init_blocks: 1
        min_blocks: 0
        max_blocks: 1