Hello
Im trying to run a training with 1 client that uses 2 or more gpus with DDP strategy.
I have this issue where the DDP strategy interferes with the ClientApp, it spawns/forks additional processes which re-enter the ClientApp runtime and interfere with the grpc ports/tls flags.
The only solution that I found at this point is to move the training logic from the client into a separate script and run it inside the ClientApp using torch run.
Anybody else encountered this issue? What other solutions do I have?
Hi @paul75 , the solution you proposes of launching a subprocess with torch run that waits until its completed is probably the easiest approach right now. But i’m curious also to hear if others have attempted something else.