I am trying to use Flower in real computer cluster. I realise that with the setting of strategy for example fedAvg, the server will waiting until get the feedback from at least a number of clients min_clients (fit, eva, avai,…).
It means the server still waiting even all clients was crashed (no responde).
Right now I am using manual kill to stop the server. I wrote a script to kill server/clients in failure by frequency check the output of log file but it is a bit annoying to use.
I tried to see the topic about “drop_out” but seem it does not solve my issue. Do we have any “auto dectect and kill task” in this case? For example set a time limit for server waiting?
Hello @huongdm, welcome to Flower Discuss! Apologies for the late reply.
Can I find out a bit more about your setup? Are you running Flower in simulation mode on the cluster (without spinning up SuperNodes)? In our flwr == 1.14.0 release yesterday, we introduced the flwr stop command that you can run to terminate a specific run-id. Presently, users need to explicitly run the flwr stop command.
One possibility is for you to track your experiments using W&B or TensorBoard, and if any experiment is taking longer than expected, you could ssh to the cluster and execute flwr stop for that run.
Thank you for your reply, I dont use supernodes. Actually it is not problem of Flower, it belongs to deploying in our cluster. I found the solution for my case: use a mpi parralel processes in stead of subprocess to control our exp. then i can handle my exp by mpi.