Environment:
Flower
Ray version: 2.40.0
Python version: 3.12
HPC cluster with Slurm , 2 GPU nodes(L40s)
I’m running a simulation with Flower. After a few rounds, I get this error:
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
e[93mWARNING e[0m: Actor(570fcc632143567f7c821bb801000000) will be remove from pool.
e[91mERROR e[0m: Traceback (most recent call last):
File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 94, in _submit_job
out_mssg, updated_context = self.actor_pool.get_client_result(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 398, in get_client_result
return self._fetch_future_result(cid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 288, in _fetch_future_result
raise ex
File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 279, in _fetch_future_result
res_cid, out_mssg, updated_context = ray.get(
^^^^^^^^
File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/ray/_private/worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/ray/_private/worker.py", line 908, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
but I don’t see any OOM or obvious Python exceptions in /tmp/ray/session_latest/logs
. Sometimes the HPC environment kills processes if resources are overused, but memory usage seems fine.
I try to Lower concurrency (fewer clients, larger num_gpus
fraction) to avoid resource conflicts but the error continues.
Has anyone encountered Ray actor dying with connection error code 2
on HPC when using Flower?
I really need help, thank you