Server Automatically shutdown during implement Flower on 3 devices

Hi guys,
I am implementing 1 server and 2 clients on 3 different machines with the FL for the encrypted data. Generally, I tested my code when I tried to simulate locally, and it ran perfectly. However, when I implemented it in three different devices, the Flower framework could only run in 5 epochs, and the client had the following errors:

When I checked the log on the server side, it had no errors. Can someone help me with this issue?

2 Likes

Hi @manhbui, by “could only run in 5 epochs” do you mean your experiment ran for 5 FL rounds but then this error appeared a the client side? Could you share a bit more details about your setup? Was your code similar to the one in the examples/quickstart-pytorch or maybe even examples/embedded-devices ?

If you could share the log generated at the server side maybe that gives us some hints of what went wrong.

1 Like

Hi @Javier , yes I can only run in 5 round of FL training. When the client receive the train message from server, client start training but after about 10 minutes, the server shutdown and client’s grpc channel change to IDLE.

When I faced the error in 5 round, I used the default setup related to keepalive ping and the round_timeout = None. When I changed the round_timeout to 3600, it can run in 6 rounds and then face the same errors but when I increase the round_timeout more, it still only can run in 6 rounds.
It is noted I am running on the machines on IHPC environment (which run under the industry wifi and network)

1 Like

@Javier , it is quite weird that the log on server sides have no errors at all. From my view, it may automatically shutdown the server

1 Like

It seems your rounds take quite a long time to complete (just over an hour fro Round 2 to finish for example). Can you try with a faster workload to see your setup is working fine ? Are you using SLURM?

Another question is: how are you running your experiment, with start_server and 2x start_client?

1 Like

Yes, I have tried another simple setup and the code can run perfectly, but because it is simple so it run very fast to finish the learning rounds. That’s why I’m thinking it may related to timeout setup and keepalive ping setup.

Yes I am running each machine on ubuntu environment and run via start_server and start_client

1 Like

Ok. then it should be safe to increase the timeout to, let’s say 10000 :laughing: ? We set it to 3600 since it’s not too big, not too small. But it should be adjusted based on the application.

1 Like

Yeah I tried to increase the round_timeout for server to 7200 or 10000 but it still seems cannot run reliably. The server still shutdown in round 6. If I set timeout to None, it can only run for 5 rounds

1 Like

Is there a maximum time limit in the HPC system you are running these experiments? Some compute clusters have maximum time limits for the experiments if people submit to a queue of jobs…

1 Like