ClientApp with multigpu training from pytorch lightning

Hello
Im trying to run a training with 1 client that uses 2 or more gpus with DDP strategy.

I have this issue where the DDP strategy interferes with the ClientApp, it spawns/forks additional processes which re-enter the ClientApp runtime and interfere with the grpc ports/tls flags.

The only solution that I found at this point is to move the training logic from the client into a separate script and run it inside the ClientApp using torch run.

Anybody else encountered this issue? What other solutions do I have?

1 Like

Hi @paul75 , the solution you proposes of launching a subprocess with torch run that waits until its completed is probably the easiest approach right now. But i’m curious also to hear if others have attempted something else.

1 Like

Another issue I encountered its how to run the clientapp with multinode gpus

at the moment I dont have a solution