I’m running a Flower simulation on a SLURM HPC cluster and getting inconsistent behavior depending on how I launch it. Hoping someone with SLURM experience can help.
Setup:
-
Flower [1.28.0], Ray [2.51.1], Python 3.12, conda environment
-
Plain FedAvg, 5 clients, small PyTorch MLP on a tabular dataset
-
Simulation backend (Ray),
flwr run . --stream
What works:
-
ssh directly onto a compute node, activate my conda env,
flwr run . --stream→ completes all 3 rounds in ~2 min. -
Also works on the login node
What fails:
- Inside an interactive
srun --ptyand with sbatch allocation, the run reaches:
[ROUND 1]
configure_fit: strategy sampled 5 clients (out of 5) and stops
No error message
The content of the sbatch file
#!/bin/bash
#SBATCH --job-name=fl_run
#SBATCH --cpus-per-task=32
#SBATCH --mem=64G
#SBATCH --partition=long
#SBATCH --time=04:00:00
#SBATCH --output=logs/fl_%j.out
#SBATCH --account=2001786
# Activate your environment
spack load miniconda3
source activate Masterthesis
export RAY_DISABLE_RUNTIME_ENV_FILE_DEPTH_LIMIT=1
flwr federation simulation-config --num-supernodes 5 --client-resources-num-cpus 5 --client-resources-num-gpus 0.0
# Run
flwr run . --stream
echo “>>> Done. DB saved as mlflow_${SLURM_JOB_ID}.db”
output
Updated simulation configuration.
Successfully started run 2901953363831888993
e[92mINFO e[0m: Starting logstream for run_id 2901953363831888993
e[92mINFO e[0m: Starting Flower Simulation
DEBUG:flwr:Initialising: RayBackend
DEBUG:flwr:Backend config: {‘client_resources’: {‘num_cpus’: 5, ‘num_gpus’: 0.0}, ‘init_args’: {‘logging_level’: ‘WARNING’, ‘log_to_driver’: True}, ‘actor’: {‘tensorflow’: 0}}
e[92mINFO e[0m: Starting Flower ServerApp, config: num_rounds=3, no round_timeout
INFO:flwr:Starting Flower ServerApp, config: num_rounds=3, no round_timeout
e[92mINFO e[0m:
INFO:flwr:
e[92mINFO e[0m: [INIT]
INFO:flwr:[INIT]
e[92mINFO e[0m: Using initial global parameters provided by strategy
INFO:flwr:Using initial global parameters provided by strategy
e[92mINFO e[0m: Starting evaluation of initial global parameters
INFO:flwr:Starting evaluation of initial global parameters
e[92mINFO e[0m: Evaluation returned no results (None)
INFO:flwr:Evaluation returned no results (None)
e[92mINFO e[0m:
INFO:flwr:
e[92mINFO e[0m: [ROUND 1]
INFO:flwr:[ROUND 1]
e[92mINFO e[0m: configure_fit: strategy sampled 5 clients (out of 5)
INFO:flwr:configure_fit: strategy sampled 5 clients (out of 5)
Done. DB saved as mlflow_745334.db