Simulation works when run directly on a compute node but fails silently under srun and under sbatch?

I’m running a Flower simulation on a SLURM HPC cluster and getting inconsistent behavior depending on how I launch it. Hoping someone with SLURM experience can help.

Setup:

  • Flower [1.28.0], Ray [2.51.1], Python 3.12, conda environment

  • Plain FedAvg, 5 clients, small PyTorch MLP on a tabular dataset

  • Simulation backend (Ray), flwr run . --stream

What works:

  • ssh directly onto a compute node, activate my conda env, flwr run . --stream → completes all 3 rounds in ~2 min.

  • Also works on the login node

What fails:

  • Inside an interactive srun --pty and with sbatch allocation, the run reaches:

[ROUND 1]
configure_fit: strategy sampled 5 clients (out of 5) and stops

No error message

The content of the sbatch file

#!/bin/bash

#SBATCH --job-name=fl_run

#SBATCH --cpus-per-task=32

#SBATCH --mem=64G

#SBATCH --partition=long

#SBATCH --time=04:00:00

#SBATCH --output=logs/fl_%j.out

#SBATCH --account=2001786

# Activate your environment

spack load miniconda3

source activate Masterthesis

export RAY_DISABLE_RUNTIME_ENV_FILE_DEPTH_LIMIT=1

flwr federation simulation-config --num-supernodes 5 --client-resources-num-cpus 5 --client-resources-num-gpus 0.0

# Run

flwr run . --stream

echo “>>> Done. DB saved as mlflow_${SLURM_JOB_ID}.db”

output

:white_check_mark: Updated simulation configuration.
:confetti_ball: Successfully started run 2901953363831888993
e[92mINFO e[0m: Starting logstream for run_id 2901953363831888993
e[92mINFO e[0m: Starting Flower Simulation
DEBUG:flwr:Initialising: RayBackend
DEBUG:flwr:Backend config: {‘client_resources’: {‘num_cpus’: 5, ‘num_gpus’: 0.0}, ‘init_args’: {‘logging_level’: ‘WARNING’, ‘log_to_driver’: True}, ‘actor’: {‘tensorflow’: 0}}
e[92mINFO e[0m: Starting Flower ServerApp, config: num_rounds=3, no round_timeout
INFO:flwr:Starting Flower ServerApp, config: num_rounds=3, no round_timeout
e[92mINFO e[0m:
INFO:flwr:
e[92mINFO e[0m: [INIT]
INFO:flwr:[INIT]
e[92mINFO e[0m: Using initial global parameters provided by strategy
INFO:flwr:Using initial global parameters provided by strategy
e[92mINFO e[0m: Starting evaluation of initial global parameters
INFO:flwr:Starting evaluation of initial global parameters
e[92mINFO e[0m: Evaluation returned no results (None)
INFO:flwr:Evaluation returned no results (None)
e[92mINFO e[0m:
INFO:flwr:
e[92mINFO e[0m: [ROUND 1]
INFO:flwr:[ROUND 1]
e[92mINFO e[0m: configure_fit: strategy sampled 5 clients (out of 5)
INFO:flwr:configure_fit: strategy sampled 5 clients (out of 5)
Done. DB saved as mlflow_745334.db