ClientApp with multigpu training from pytorch lightning

Hello
Im trying to run a training with 1 client that uses 2 or more gpus with DDP strategy.

I have this issue where the DDP strategy interferes with the ClientApp, it spawns/forks additional processes which re-enter the ClientApp runtime and interfere with the grpc ports/tls flags.

The only solution that I found at this point is to move the training logic from the client into a separate script and run it inside the ClientApp using torch run.

Anybody else encountered this issue? What other solutions do I have?

2 Likes

Hi @paul75 , the solution you proposes of launching a subprocess with torch run that waits until its completed is probably the easiest approach right now. But i’m curious also to hear if others have attempted something else.

2 Likes

Another issue I encountered its how to run the clientapp with multinode gpus

at the moment I dont have a solution

I am also following this thread, as this is also something I want to implement. @javier I will try your suggestion!