Server still waiting while all clients crashes?

Hello,

I am trying to use Flower in real computer cluster. I realise that with the setting of strategy for example fedAvg, the server will waiting until get the feedback from at least a number of clients min_clients (fit, eva, avai,…).

It means the server still waiting even all clients was crashed (no responde).

Right now I am using manual kill to stop the server. I wrote a script to kill server/clients in failure by frequency check the output of log file but it is a bit annoying to use.

I tried to see the topic about “drop_out” but seem it does not solve my issue. Do we have any “auto dectect and kill task” in this case? For example set a time limit for server waiting?

Hello @huongdm, welcome to Flower Discuss! Apologies for the late reply.

Can I find out a bit more about your setup? Are you running Flower in simulation mode on the cluster (without spinning up SuperNodes)? In our flwr == 1.14.0 release yesterday, we introduced the flwr stop command that you can run to terminate a specific run-id. Presently, users need to explicitly run the flwr stop command.

One possibility is for you to track your experiments using W&B or TensorBoard, and if any experiment is taking longer than expected, you could ssh to the cluster and execute flwr stop for that run.