Actor dies unexpectedly in Flower simulation on HPC

Environment:

Flower
Ray version: 2.40.0
Python version: 3.12
HPC cluster with Slurm , 2 GPU nodes(L40s)

I’m running a simulation with Flower. After a few rounds, I get this error:

The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
e[93mWARNING e[0m:   Actor(570fcc632143567f7c821bb801000000) will be remove from pool.
e[91mERROR e[0m:     Traceback (most recent call last):
  File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 94, in _submit_job
    out_mssg, updated_context = self.actor_pool.get_client_result(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 398, in get_client_result
    return self._fetch_future_result(cid)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 288, in _fetch_future_result
    raise ex
  File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 279, in _fetch_future_result
    res_cid, out_mssg, updated_context = ray.get(
                                         ^^^^^^^^
  File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xi.han/.conda/envs/pytorch/lib/python3.12/site-packages/ray/_private/worker.py", line 908, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.

but I don’t see any OOM or obvious Python exceptions in /tmp/ray/session_latest/logs. Sometimes the HPC environment kills processes if resources are overused, but memory usage seems fine.
I try to Lower concurrency (fewer clients, larger num_gpus fraction) to avoid resource conflicts but the error continues.
Has anyone encountered Ray actor dying with connection error code 2 on HPC when using Flower?
I really need help, thank you

2 Likes

Hi @xixilikesunshine , are you getting allocated the entire nodes when you submit your job to SLURM? or are these nodes shared with others (e.g. for example you use half of the node and someone else is using the other half)?

1 Like