Unable to start federated training on CUDA11.8 when using YOLOv8 model on NVIDIA A10G GPU

shubham.ecebtech14 · March 12, 2024, 8:52am

I wish to report one strange observation. I was able to run my federated code with available CPU. Today, I installed CUDA 11.8 and tried to run the code. It is getting stuck and not able to start training.
To reproduce the issue, I again tried running with CPU only and it was working fine.
I have installed cuda and cudaCNN compatible with pytorch after checking on official pytorch website and they were installed successfully without any warning or error message. I am using an AWS instance windows XP 2022 server with NVIDIA A10G gpu. When i run any other centralized learning code on cuda, it works fine. But when i run federated learning code, it gives error. The training does not start and then grpc channel closes after waiting for sometime.

javier · March 20, 2024, 8:33pm

Hi @shubham.ecebtech14, could you paste here the error you obtain ? What version of flower are you using ? and what version of Python? Have you tried running the examples/simulation-pytorch ?

Topic		Replies	Views
How can I implement a YOLO model using the Flower framework? Flower Help - Intermediate	6	274	April 2, 2025
Announcing Flower 1.11 General	8	230	October 17, 2024
Client not getting selected? How to debug? Flower Help - Beginners flower	2	72	April 4, 2025
Fine-tuning LLMs with Flower General llm	1	21	July 13, 2025
BlockFL implementation Flower Help - Intermediate	0	26	March 3, 2025

Unable to start federated training on CUDA11.8 when using YOLOv8 model on NVIDIA A10G GPU

Related topics