Example Configurations

While Globus Compute is in use on various systems around the world, getting to a working configuration that matches the underlying system constraints and the requirements of the site-administrator often takes trial and error. Below are example configurations for some well-known systems that are known to work. These serve as a reference for getting started.

If you would like to add your system to this list please contact the Globus Compute Team via Slack. (The #help channel is a good place to start.)

Note

All configuration examples below must be customized for the user’s allocation, Python environment, file system, etc.

Anvil (RCAC, Purdue)

../_images/anvil.jpeg

The following snippet shows an example configuration for executing remotely on Anvil, a supercomputer at Purdue University’s Rosen Center for Advanced Computing (RCAC). The configuration assumes the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

amqp_port: 443
display_name: Anvil CPU
engine:
  type: GlobusComputeEngine
  max_workers_per_node: 2

  address:
    type: address_by_interface
    ifname: ib0

  provider:
    type: SlurmProvider
    partition: debug

    account: {{ ACCOUNT }}
    launcher:
        type: SrunLauncher

    # string to prepend to #SBATCH blocks in the submit
    # script to the scheduler
    # e.g., "#SBATCH --constraint=knl,quad,cache"
    scheduler_options: {{ OPTIONS }}

    # Command to be run before starting a worker
    # e.g., "module load anaconda; source activate gce_env
    worker_init: {{ COMMAND }}

    init_blocks: 1
    max_blocks: 1
    min_blocks: 0

    walltime: 00:05:00

Delta (NCSA)

../_images/delta_front.png

The following snippet shows an example configuration for executing remotely on Delta, a supercomputer at the National Center for Supercomputing Applications. The configuration assumes the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

amqp_port: 443
display_name: NCSA Delta 2 CPU
engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2

    address:
        type: address_by_interface
        ifname: eth6.560

    provider:
        type: SlurmProvider
        partition: cpu
        account: {{ ACCOUNT NAME }}

        launcher:
            type: SrunLauncher

        # Command to be run before starting a worker
        # e.g., "module load anaconda3; source activate gce_env"
        worker_init: {{ COMMAND }}

        init_blocks: 1
        min_blocks: 0
        max_blocks: 1

        walltime: 00:30:00

Expanse (SDSC)

../_images/expanse.jpeg

The following snippet shows an example configuration for executing remotely on Expanse, a supercomputer at the San Diego Supercomputer Center. The configuration assumes the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: Expanse@SDSC

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ib0

    provider:
        type: SlurmProvider
        partition: compute
        account: {{ ACCOUNT }}

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load anaconda3; source activate gce_env"
        worker_init: {{ COMMAND }}

        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        walltime: 00:05:00

The GlobusMPIEngine adds support for running MPI applications. The following snippet shows an example configuration for Expanse that uses the SlurmProvider to provision batch jobs each with 4 nodes, which can be dynamically partitioned to launch MPI functions with srun.

display_name: ExpanseMPI@SDSC

engine:
    type: GlobusMPIEngine
    mpi_launcher: srun

    address:
        type: address_by_interface
        ifname: ib0

    provider:
        type: SlurmProvider
        partition: compute
        account: {{ ACCOUNT }}

        launcher:
            type: SimpleLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load anaconda3; source activate gce_env"
        worker_init: {{ COMMAND }}

        nodes_per_block: 4
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        walltime: 00:05:00

UChicago AI Cluster

../_images/ai-science-web.jpeg

The following snippet shows an example configuration for the University of Chicago’s AI Cluster. The configuration assumes the user is running on a login node and uses the SlurmProvider to interface with the scheduler and launch onto the GPUs.

Link to docs.

display_name: AI Cluster CS@UChicago
engine:
    type: GlobusComputeEngine
    label: fe.cs.uchicago
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ens2f1

    provider:
        type: SlurmProvider
        partition: general

        # This is a hack. We use hostname ; to terminate the srun command, and
        # start our own.
        launcher:
            type: SrunLauncher
            overrides: >
                hostname; srun --ntasks={{ TOTAL_WORKERS }}
                --ntasks-per-node={{ WORKERS_PER_NODE }}
                --gpus-per-task=rtx2080ti:{{ GPUS_PER_WORKER }}
                --gpu-bind=map_gpu:{{ GPU_MAP }} \
            # To request a single gpu, use the following:
            #   hostname; srun --ntasks=1
            #   --ntasks-per-node=1
            #   --gres=gpu:1 \

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 1
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Here is some Python that demonstrates how to compute the variables in the YAML example above:

# Launch 4 managers per node, each bound to 1 GPU
# Modify before use
NODES_PER_JOB = 2
GPUS_PER_NODE = 4
GPUS_PER_WORKER = 2

# DO NOT MODIFY
TOTAL_WORKERS = int((NODES_PER_JOB * GPUS_PER_NODE) / GPUS_PER_WORKER)
WORKERS_PER_NODE = int(GPUS_PER_NODE / GPUS_PER_WORKER)
GPU_MAP = ",".join([str(x) for x in range(1, TOTAL_WORKERS + 1)])

Midway (RCC, UChicago)

../_images/20140430_RCC_8978.jpg

The Midway cluster is a campus cluster hosted by the Research Computing Center at the University of Chicago. The snippet below shows an example configuration for executing remotely on Midway. The configuration assumes the user is running on a login node and uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

engine:
    type: GlobusComputeEngine
    label: Midway@RCC.UChicago
    max_workers_per_node: 2

    address:
        type: address_by_interface
        ifname: bond0

    provider:
        type: SlurmProvider

        launcher:
            type: SrunLauncher

        # e.g., pi-compute
        account: {{ ACCOUNT }}

        # e.g., caslake
        partition: {{ PARTITION }}

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --gres=gpu:4"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate compute-env"
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

The following configuration example uses an Apptainer (formerly Singularity) container on Midway.

engine:
    type: GlobusComputeEngine
    label: Midway@RCC.UChicago
    max_workers_per_node: 10

    address:
        type: address_by_interface
        ifname: bond0

    container_type: apptainer
    container_cmd_options: -H /home/$USER
    container_uri: {{ CONTAINER_URI }}

    provider:
        type: SlurmProvider

        launcher:
            type: SrunLauncher

        # e.g., pi-compute
        account: {{ ACCOUNT }}

        # e.g., caslake
        partition: {{ PARTITION }}

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --gres=gpu:4"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate compute-env"
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Kubernetes Clusters

../_images/kuberneteslogo.eabc6359f48c8e30b7a138c18177f3fd39338e05.png

Kubernetes is an open-source system for container management, such as automating deployment and scaling of containers. The snippet below shows an example configuration for deploying pods as workers on a Kubernetes cluster. The KubernetesProvider exploits the Python Kubernetes API, which assumes that you have kube config in ~/.kube/config.

heartbeat_period: 15
heartbeat_threshold: 200

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 1

    # Encryption is not currently supported for KubernetesProvider
    encrypted: false

    address:
      type: address_by_route

    provider:
        type: KubernetesProvider
        init_blocks: 0
        min_blocks: 0
        max_blocks: 2
        init_cpu: 1
        max_cpu: 4
        init_mem: 1024Mi
        max_mem: 4096Mi

        # e.g., default
        namespace: {{ NAMESPACE }}

        # e.g., python:3.12-bookworm
        image: {{ IMAGE }}

        # The secret key to download the image
        secret: {{ SECRET }}

        # e.g., "pip install --force-reinstall globus-compute-endpoint"
        worker_init: {{ COMMAND }}

Polaris (ALCF)

../_images/ALCF_Polaris.jpeg

The following snippet shows an example configuration for executing on Argonne Leadership Computing Facility’s Polaris cluster. This example uses the GlobusComputeEngine and connects to Polaris’s PBS scheduler using the PBSProProvider. This configuration assumes that the script is being executed on the login node of Polaris.

display_name: Polaris@ALCF

engine:
  type: GlobusComputeEngine
  max_workers_per_node: 4

  # Un-comment to give each worker exclusive access to a single GPU
  # available_accelerators: 4

  address:
    type: address_by_interface
    ifname: hsn0

  provider:
    type: PBSProProvider

    launcher:
      type: MpiExecLauncher
      # Ensures 1 manager per node, work on all 64 cores
      bind_cmd: --cpu-bind
      overrides: --depth=64 --ppn 1

    account: {{ YOUR_POLARIS_ACCOUNT }}
    queue: debug-scaling
    cpus_per_node: 32
    select_options: ngpus=4

    # e.g., "#PBS -l filesystems=home:grand:eagle\n#PBS -k doe"
    scheduler_options: "#PBS -l filesystems=home:grand:eagle"

    # Node setup: activate necessary conda environment and such
    worker_init: {{ COMMAND }}

    walltime: 01:00:00
    nodes_per_block: 1
    init_blocks: 0
    min_blocks: 0
    max_blocks: 2

Perlmutter (NERSC)

../_images/Nersc9-image-compnew-sizer7-group-type-4-1.jpg

The following snippet shows an example configuration for accessing NERSC’s Perlmutter supercomputer. This example uses the GlobusComputeEngine and connects to Perlmutters’s Slurm scheduler. It is configured to request 2 nodes configured with 1 TaskBlock per node. Finally, it includes override information to request a particular node type (GPU) and to configure a specific Python environment on the worker nodes using Anaconda.

display_name: Permutter@NERSC
engine:
    type: GlobusComputeEngine
    worker_debug: False

    address:
        type: address_by_interface
        ifname: hsn0

    provider:
        type: SlurmProvider
        partition: debug

        # We request all hyperthreads on a node.
        # GPU nodes have 128 threads, CPU nodes have 256 threads
        launcher:
            type: SrunLauncher
            overrides: -c 128

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # For GPUs in the debug qos eg: "#SBATCH --constraint=gpu\n#SBATCH --gpus-per-node=4"
        scheduler_options: {{ OPTIONS }}

        # Your NERSC account, eg: "m0000"
        account: {{ NERSC_ACCOUNT }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate parsl_env"
        worker_init: {{ COMMAND }}

        # increase the command timeouts
        cmd_timeout: 120

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 10 minutes
        walltime: 00:10:00

Frontera (TACC)

../_images/frontera-banner-home.jpg

The following snippet shows an example configuration for accessing the Frontera system at TACC. The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: Frontera@TACC

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
      type: address_by_interface
      ifname: ib0

    provider:
        type: SlurmProvider

        # e.g., EAR22001
        account: {{ YOUR_FRONTERA_ACCOUNT }}

        # e.g., development
        partition: {{ PARTITION }}

        launcher:
            type: SrunLauncher

        # Enter scheduler_options if needed
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate parsl_env"
        worker_init: {{ COMMAND }}

        # Add extra time for slow scheduler responses
        cmd_timeout: 60

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Bebop (LCRC, ANL)

../_images/Bebop.jpeg

The following snippet shows an example configuration for accessing the Bebop system at Argonne’s LCRC. The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: Bebop@ANL

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ib0

    provider:
        type: SlurmProvider
        partition: {{ PARTITION }}  # e.g., bdws
        launcher:
          type: SrunLauncher

        # Command to be run before starting a worker
        # e.g., "module load anaconda; source activate gce_env"
        worker_init: {{ COMMAND }}

        nodes_per_block: 1
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1
        walltime: 00:30:00

Bridges-2 (PSC)

../_images/bridges-2.png

The following snippet shows an example configuration for accessing the Bridges-2 system at PSC. The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
      type: address_by_interface
      ifname: ens3f0

    provider:
        type: SlurmProvider
        partition: RM-small

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., module load Anaconda; source activate parsl_env
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

FASTER (TAMU)

The following snippet shows an example configuration for accessing the FASTER system at Texas A & M (TAMU). The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

amqp_port: 443
display_name: Access Tamu Faster
engine:
    type: GlobusComputeEngine
    worker_debug: False

    strategy:
        type: SimpleStrategy
        max_idletime: 90

    address:
        type: address_by_interface
        ifname: eno8303

    provider:
        type: SlurmProvider
        partition: cpu
        mem_per_node: 128

        launcher:
            type: SrunLauncher

        scheduler_options: {{ OPTIONS }}

        worker_init: {{ COMMAND }}

        # increase the command timeouts
        cmd_timeout: 120

        # Scale between 0-1 blocks with 1 nodes per block
        nodes_per_block: 1
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 10 minutes
        walltime: 00:10:00

Open Science Pool

The Open Science Pool is a pool of opportunistic computing resources operated for all US-associated open science by the OSG consortium. Unlike traditional HPC clusters, these computational resources are offered from campus and research cluster resources that are loosely connected. The configuration below uses the CondorProvider to interface with the scheduler, and uses apptainer to distribute the computational environment to the workers.

Warning

GlobusComputeEngine relies on a shared-filesystem to distribute keys used for encrypting communication between the endpoint and workers. Since OSPool does not support a writable shared-filesystem, encryption is disabled in the configuration below.

display_name: OSPool
engine:
  type: GlobusComputeEngine
  max_workers_per_node: 1

  # This config uses apptainer containerization to ensure a consistent
  # python environment on the worker side. Since apptainer limits writable
  # directory paths, set working directory paths paths used by the worker to /tmp
  # P.S: These filepaths remain private to the container and will not be
  #      accessible on the host system
  worker_logdir_root: /tmp/logs
  working_dir: /tmp/tasks_dir

  # GlobusComputeEngine relies on a shared-filesystem to distribute keys used
  # for encrypting communication between the endpoint and workers.
  # Since OSPool does not support a writable shared-filesystem,
  # **encryption** is disabled in the configuration below.
  encrypted: False

  provider:

    type: CondorProvider
    init_blocks: 1
    max_blocks: 1
    min_blocks: 0

    # Specify ProjectName and Apptainer image
    scheduler_options: >
      +ProjectName = {{ PROJECT_NAME }}

      # To use apptainer on OSPool, build an apptainer image and copy it to
      # OSDF and specify the full Specify the apptainer image path for eg.:
      # "osdf:///ospool/ap20/data/USERNAME/globus_compute_py3.11.v1.sif"

      +SingularityImage = {{ APPTAINER_IMAGE_PATH }}

      # Add a condor requirement to guarantee that worker nodes support apptainer

      Requirements = HAS_SINGULARITY == True && OSG_HOST_KERNEL_VERSION >= 31000

Stampede3 (TACC)

../_images/stampede3.jpg

Stampede3 is a Dell technologies and Intel based supercomputer at the Texas Advanced Computing Center (TACC). The following snippet shows an example configuration that uses the SlurmProvider to interface with the batch scheduler, and uses the SrunLauncher to launch workers across nodes.

display_name: Stampede3@TACC

engine:
  type: GlobusComputeEngine
  max_workers_per_node: 2

  address:
    type: address_by_interface
    ifname: ibp10s0

  provider:
    type: SlurmProvider

    # e.g., EAR22001
    account: {{ YOUR_TACC_ALLOCATION }}

    # e.g., skx-dev
    partition: {{ PARTITION }}

    launcher:
      type: SrunLauncher

    # Enter scheduler_options if needed
    scheduler_options: {{ OPTIONS }}

    # Command to be run before starting a worker
    # e.g., "module load Anaconda; source activate parsl_env"
    worker_init: {{ COMMAND }}

    # Add extra time for slow scheduler responses
    cmd_timeout: 60

    # Scale between 0-1 blocks with 1 node per block
    nodes_per_block: 1
    init_blocks: 0
    min_blocks: 0
    max_blocks: 1

    # Hold blocks for 30 minutes
    walltime: 00:30:00

Pinning Workers to devices

Many modern clusters provide multiple accelerators per compute note, yet many applications are best suited to using a single accelerator per task. Globus Compute supports pinning each worker to different accelerators using the available_accelerators option of the GlobusComputeEngine. Provide either the number of accelerators (Globus Compute will assume they are named in integers starting from zero) or a list of the names of the accelerators available on the node. Each Globus Compute worker will have the following environment variables set to the worker specific identity assigned: CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, SYCL_DEVICE_FILTER.

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 4

    # `available_accelerators` may be a natural number or a list of strings.
    # If an integer, then each worker launched will have an automatically
    # generated environment variable. In this case, one of 0, 1, 2, or 3.
    # Alternatively, specific strings may be utilized.
    available_accelerators: 4
    # available_accelerators: ["opencl:gpu:1", "opencl:gpu:2"]  # alternative

    provider:
        type: LocalProvider
        init_blocks: 1
        min_blocks: 0
        max_blocks: 1