Server Automatically shutdown during implement Flower on 3 devices

manhbui · June 6, 2024, 8:32pm

Hi guys,
I am implementing 1 server and 2 clients on 3 different machines with the FL for the encrypted data. Generally, I tested my code when I tried to simulate locally, and it ran perfectly. However, when I implemented it in three different devices, the Flower framework could only run in 5 epochs, and the client had the following errors:

When I checked the log on the server side, it had no errors. Can someone help me with this issue?

javier · June 11, 2024, 5:50pm

Hi @manhbui, by “could only run in 5 epochs” do you mean your experiment ran for 5 FL rounds but then this error appeared a the client side? Could you share a bit more details about your setup? Was your code similar to the one in the examples/quickstart-pytorch or maybe even examples/embedded-devices ?

If you could share the log generated at the server side maybe that gives us some hints of what went wrong.

manhbui · June 11, 2024, 7:00pm

Hi @javier , yes I can only run in 5 round of FL training. When the client receive the train message from server, client start training but after about 10 minutes, the server shutdown and client’s grpc channel change to IDLE.

When I faced the error in 5 round, I used the default setup related to keepalive ping and the round_timeout = None. When I changed the round_timeout to 3600, it can run in 6 rounds and then face the same errors but when I increase the round_timeout more, it still only can run in 6 rounds.
It is noted I am running on the machines on IHPC environment (which run under the industry wifi and network)

manhbui · June 11, 2024, 7:03pm

@javier , it is quite weird that the log on server sides have no errors at all. From my view, it may automatically shutdown the server

javier · June 11, 2024, 7:18pm

It seems your rounds take quite a long time to complete (just over an hour fro Round 2 to finish for example). Can you try with a faster workload to see your setup is working fine ? Are you using SLURM?

Another question is: how are you running your experiment, with start_server and 2x start_client?

manhbui · June 11, 2024, 7:33pm

Yes, I have tried another simple setup and the code can run perfectly, but because it is simple so it run very fast to finish the learning rounds. That’s why I’m thinking it may related to timeout setup and keepalive ping setup.

Yes I am running each machine on ubuntu environment and run via start_server and start_client

javier · June 11, 2024, 7:38pm

Ok. then it should be safe to increase the timeout to, let’s say 10000 ? We set it to 3600 since it’s not too big, not too small. But it should be adjusted based on the application.

manhbui · June 11, 2024, 7:52pm

Yeah I tried to increase the round_timeout for server to 7200 or 10000 but it still seems cannot run reliably. The server still shutdown in round 6. If I set timeout to None, it can only run for 5 rounds

javier · June 11, 2024, 8:11pm

Is there a maximum time limit in the HPC system you are running these experiments? Some compute clusters have maximum time limits for the experiments if people submit to a queue of jobs…

Topic		Replies	Views
Server still waiting while all clients crashes? Flower Help - Intermediate flower	3	75	March 5, 2025
How to avoid Flower Next from destroying my model on every fit and every evaluate Flower Help - Beginners	5	128	December 21, 2024
Issue with updating config_train in Flower simulation causing client failures Flower Framework	2	69	December 2, 2024
How to dynamically update num_rounds on the server and ensure clients keep training Flower Help - Intermediate	3	38	February 21, 2025
Launching multiple clients in simulation environnement Flower Help - Beginners	1	163	November 3, 2024

Server Automatically shutdown during implement Flower on 3 devices

Related topics