Simulation works when run directly on a compute node but fails silently under srun and under sbatch?

mariza · June 4, 2026, 2:54pm

I’m running a Flower simulation on a SLURM HPC cluster and getting inconsistent behavior depending on how I launch it. Hoping someone with SLURM experience can help.

Setup:

Flower [1.28.0], Ray [2.51.1], Python 3.12, conda environment
Plain FedAvg, 5 clients, small PyTorch MLP on a tabular dataset
Simulation backend (Ray), flwr run . --stream

What works:

ssh directly onto a compute node, activate my conda env, flwr run . --stream → completes all 3 rounds in ~2 min.
Also works on the login node

What fails:

Inside an interactive srun --pty and with sbatch allocation, the run reaches:

[ROUND 1]
configure_fit: strategy sampled 5 clients (out of 5) and stops

No error message

The content of the sbatch file

#!/bin/bash

#SBATCH --job-name=fl_run

#SBATCH --cpus-per-task=32

#SBATCH --mem=64G

#SBATCH --partition=long

#SBATCH --time=04:00:00

#SBATCH --output=logs/fl_%j.out

#SBATCH --account=2001786

# Activate your environment

spack load miniconda3

source activate Masterthesis

export RAY_DISABLE_RUNTIME_ENV_FILE_DEPTH_LIMIT=1

flwr federation simulation-config --num-supernodes 5 --client-resources-num-cpus 5 --client-resources-num-gpus 0.0

# Run

flwr run . --stream

echo “>>> Done. DB saved as mlflow_${SLURM_JOB_ID}.db”

output

Updated simulation configuration.
Successfully started run 2901953363831888993
e[92mINFO e[0m: Starting logstream for run_id 2901953363831888993
e[92mINFO e[0m: Starting Flower Simulation
DEBUG:flwr:Initialising: RayBackend
DEBUG:flwr:Backend config: {‘client_resources’: {‘num_cpus’: 5, ‘num_gpus’: 0.0}, ‘init_args’: {‘logging_level’: ‘WARNING’, ‘log_to_driver’: True}, ‘actor’: {‘tensorflow’: 0}}
e[92mINFO e[0m: Starting Flower ServerApp, config: num_rounds=3, no round_timeout
INFO:flwr:Starting Flower ServerApp, config: num_rounds=3, no round_timeout
e[92mINFO e[0m:
INFO:flwr:
e[92mINFO e[0m: [INIT]
INFO:flwr:[INIT]
e[92mINFO e[0m: Using initial global parameters provided by strategy
INFO:flwr:Using initial global parameters provided by strategy
e[92mINFO e[0m: Starting evaluation of initial global parameters
INFO:flwr:Starting evaluation of initial global parameters
e[92mINFO e[0m: Evaluation returned no results (None)
INFO:flwr:Evaluation returned no results (None)
e[92mINFO e[0m:
INFO:flwr:
e[92mINFO e[0m: [ROUND 1]
INFO:flwr:[ROUND 1]
e[92mINFO e[0m: configure_fit: strategy sampled 5 clients (out of 5)
INFO:flwr:configure_fit: strategy sampled 5 clients (out of 5)
Done. DB saved as mlflow_745334.db

mohammad · June 12, 2026, 7:32pm

Hi @mariza, thanks for the question.
I would say this is almost a ray and SLURM launch problem, not FedAvg.

In here:
configure_fit: strategy sampled 5 clients (out of 5)
Done. DB saved ...

Flower should next log something like aggregate_fit: received... and since it does not, the run is dying while dispatching work to Ray ClientApp actors. Your script then prints Done regardless of whether flwr run failed.

I’d first make the script fail with:

set -euo pipefail
flwr run . --stream
echo ">>> Done"

and then check the SLURM exit code plus Ray logs.

Topic		Replies	Views
How to launch Flower Next style simulation in multinodes Flower Help - Beginners	1	158	September 18, 2024
Simulation succeeding, but only showing eval metric (no train metric) Flower Help - Intermediate flower , metrics	2	203	January 17, 2026
New to flwr and having problem in running my first program. Flower Help - Beginners	1	8	June 25, 2026
Issue: Client loses connection to Flower server after long HPC runs Flower Help - Intermediate	8	329	November 10, 2025
Server Automatically shutdown during implement Flower on 3 devices Flower Framework	8	292	June 11, 2024

Simulation works when run directly on a compute node but fails silently under srun and under sbatch?

Related topics