Hi!
In flwr 1.18, I noticed that when using isolation process mode to launch the server and clients, a client process continues running even if the training fails (it does not terminate automatically, same behaviour with the server). I have to execute flwr stop manually, and in practice the cleanest workaround has been to restart the processes.
Is there an existing mechanism - or one currently under development - that supports automatic termination of all components upon failure? For example, if the server receives at least one failure response from any client, it should trigger a coordinated shutdown of all serverapp/clientapp processes and mark the run as stopped.
Is such behavior already supported when using subprocess isolation mode?
P.S.: I would like to implement such a mechanism if it is not already under development. If you have any ideas on how this could be done properly and in the most logical way, I would appreciate it.