Simulation works when run directly on a compute node but fails silently under srun and under sbatch?

I’m running a Flower simulation on a SLURM HPC cluster and getting inconsistent behavior depending on how I launch it. Hoping someone with SLURM experience can help.

Setup:

  • Flower [1.28.0], Ray [2.51.1], Python 3.12, conda environment

  • Plain FedAvg, 5 clients, small PyTorch MLP on a tabular dataset

  • Simulation backend (Ray), flwr run . --stream

What works:

  • ssh directly onto a compute node, activate my conda env, flwr run . --stream → completes all 3 rounds in ~2 min.

  • Also works on the login node

What fails:

  • Inside an interactive srun --pty and with sbatch allocation, the run reaches:

[ROUND 1]
configure_fit: strategy sampled 5 clients (out of 5) and stops

No error message

The content of the sbatch file

#!/bin/bash

#SBATCH --job-name=fl_run

#SBATCH --cpus-per-task=32

#SBATCH --mem=64G

#SBATCH --partition=long

#SBATCH --time=04:00:00

#SBATCH --output=logs/fl_%j.out

#SBATCH --account=2001786

# Activate your environment

spack load miniconda3

source activate Masterthesis

export RAY_DISABLE_RUNTIME_ENV_FILE_DEPTH_LIMIT=1

flwr federation simulation-config --num-supernodes 5 --client-resources-num-cpus 5 --client-resources-num-gpus 0.0

# Run

flwr run . --stream

echo “>>> Done. DB saved as mlflow_${SLURM_JOB_ID}.db”

output

:white_check_mark: Updated simulation configuration.
:confetti_ball: Successfully started run 2901953363831888993
e[92mINFO e[0m: Starting logstream for run_id 2901953363831888993
e[92mINFO e[0m: Starting Flower Simulation
DEBUG:flwr:Initialising: RayBackend
DEBUG:flwr:Backend config: {‘client_resources’: {‘num_cpus’: 5, ‘num_gpus’: 0.0}, ‘init_args’: {‘logging_level’: ‘WARNING’, ‘log_to_driver’: True}, ‘actor’: {‘tensorflow’: 0}}
e[92mINFO e[0m: Starting Flower ServerApp, config: num_rounds=3, no round_timeout
INFO:flwr:Starting Flower ServerApp, config: num_rounds=3, no round_timeout
e[92mINFO e[0m:
INFO:flwr:
e[92mINFO e[0m: [INIT]
INFO:flwr:[INIT]
e[92mINFO e[0m: Using initial global parameters provided by strategy
INFO:flwr:Using initial global parameters provided by strategy
e[92mINFO e[0m: Starting evaluation of initial global parameters
INFO:flwr:Starting evaluation of initial global parameters
e[92mINFO e[0m: Evaluation returned no results (None)
INFO:flwr:Evaluation returned no results (None)
e[92mINFO e[0m:
INFO:flwr:
e[92mINFO e[0m: [ROUND 1]
INFO:flwr:[ROUND 1]
e[92mINFO e[0m: configure_fit: strategy sampled 5 clients (out of 5)
INFO:flwr:configure_fit: strategy sampled 5 clients (out of 5)
Done. DB saved as mlflow_745334.db

Hi @mariza, thanks for the question.
I would say this is almost a ray and SLURM launch problem, not FedAvg.

In here:
configure_fit: strategy sampled 5 clients (out of 5)
Done. DB saved ...

Flower should next log something like aggregate_fit: received... and since it does not, the run is dying while dispatching work to Ray ClientApp actors. Your script then prints Done regardless of whether flwr run failed.

I’d first make the script fail with:

set -euo pipefail
flwr run . --stream
echo ">>> Done"

and then check the SLURM exit code plus Ray logs.