Launching multiple clients in simulation environnement

Hi,

I’ve been trying out the FedAvg strategy with the pytorchexample sample, i see a fraction_fit parameter that can change fraction of clients used during training. But when i’m printing the partition_id used in client_app.py : client_fn(context: Context) it only prints me one partition_id even if i ask for ten or more clients.

I’ve been wondering if it’s a work around in the flower simulation environnement, but if i print the number of time fit function is used for each round it’s also one time only.

Is this expected behavior ? I’m missing something ?

Best regards,
Mr. Sunshine.

TLDR: To reproduce my problem i used flwr new, created a new Pytorch sample and i inlaid print("Launching client : " + str(partition_id)) before returning FlowerClient(trainloader, valloader, local_epochs, learning_rate).to_client()

Update :

I’ve trying out few solutions. At first, i’ve been wondering if it’s not a problem associated with the usage of print function in the flower simulation environnement. So i changed my print function with the official from flwr.common.logger import log.

I inlaid log(INFO, f"Client {self.cid} is doing fit()") while adding cid in my FlowerClient(NumpyClient) (keep in mind i still use the basic flwr new sample)

And i see some weirdness occurs in the logs, down below what i see in a single simulation :


Success
INFO :      Starting Flower ServerApp, config: num_rounds=5, no round_timeout
INFO :      
INFO :      [INIT]
INFO :      Using initial global parameters provided by strategy
INFO :      Starting evaluation of initial global parameters
INFO :      Evaluation returned no results (`None`)
INFO :      
INFO :      [ROUND 1]
INFO :      configure_fit: strategy sampled 20 clients (out of 100)
(ClientAppActor pid=31520) C:\Users\thoma\miniconda3\Lib\site-packages\datasets\utils\_dill.py:379: DeprecationWarning: co_lnotab is deprecated, use co_lines instead.
(ClientAppActor pid=31520)   obj.co_lnotab,  # for < python 3.10 [not counted in args]
(ClientAppActor pid=31520) INFO :      Client 7 is doing fit()
(ClientAppActor pid=34488) C:\Users\thoma\miniconda3\Lib\site-packages\datasets\utils\_dill.py:379: DeprecationWarning: co_lnotab is deprecated, use co_lines instead. [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ClientAppActor pid=34488)   obj.co_lnotab,  # for < python 3.10 [not counted in args] [repeated 7x across cluster]
(ClientAppActor pid=35644) INFO :      Client 98 is doing fit() [repeated 19x across cluster]
INFO :      aggregate_fit: received 20 results and 0 failures
WARNING :   No fit_metrics_aggregation_fn provided
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.1518
INFO :
INFO :      [ROUND 2]
INFO :      configure_fit: strategy sampled 20 clients (out of 100)
(ClientAppActor pid=34488) INFO :      Client 23 is doing fit()
(ClientAppActor pid=35104) INFO :      Client 18 is doing fit()
(ClientAppActor pid=35104) INFO :      Client 94 is doing fit() [repeated 17x across cluster]
INFO :      aggregate_fit: received 20 results and 0 failures
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.1838
INFO :
INFO :      [ROUND 3]
INFO :      configure_fit: strategy sampled 20 clients (out of 100)
(ClientAppActor pid=34488) INFO :      Client 18 is doing fit() [repeated 2x across cluster]
(ClientAppActor pid=35104) INFO :      Client 87 is doing fit() [repeated 18x across cluster]
INFO :      aggregate_fit: received 20 results and 0 failures
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.26336
INFO :
INFO :      [ROUND 4]
INFO :      configure_fit: strategy sampled 20 clients (out of 100)
(ClientAppActor pid=34488) INFO :      Client 29 is doing fit() [repeated 2x across cluster]
(ClientAppActor pid=37928) INFO :      Client 92 is doing fit() [repeated 18x across cluster]
INFO :      aggregate_fit: received 20 results and 0 failures
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.3157
INFO :
INFO :      [ROUND 5]
INFO :      configure_fit: strategy sampled 20 clients (out of 100)
(ClientAppActor pid=34488) INFO :      Client 34 is doing fit() [repeated 2x across cluster]
(ClientAppActor pid=37928) INFO :      Client 99 is doing fit() [repeated 19x across cluster]
INFO :      aggregate_fit: received 20 results and 0 failures
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.37408
INFO :
INFO :      [SUMMARY]
INFO :      Run finished 5 round(s) in 88.34s
INFO :          History (loss, distributed):
INFO :                  round 1: 2.280168843911468
INFO :                  round 2: 2.1135510445972994
INFO :                  round 3: 1.9379437228041336
INFO :                  round 4: 1.787573513222865
INFO :                  round 5: 1.6535548150626296
INFO :          History (metrics, distributed, evaluate):
INFO :          {'accuracy': [(1, 0.1518),
INFO :                        (2, 0.1838),
INFO :                        (3, 0.26336),
INFO :                        (4, 0.3157),
INFO :                        (5, 0.37408)]}
INFO :

If i change my fraction fit to 0.1 (for 10 clients) i’ve a single client running each time :

Success
INFO :      Starting Flower ServerApp, config: num_rounds=5, no round_timeout
INFO :      
INFO :      [INIT]
INFO :      Using initial global parameters provided by strategy
INFO :      Starting evaluation of initial global parameters
INFO :      Evaluation returned no results (`None`)
INFO :      
INFO :      [ROUND 1]
INFO :      configure_fit: strategy sampled 10 clients (out of 100)
(ClientAppActor pid=35456) C:\Users\thoma\miniconda3\Lib\site-packages\datasets\utils\_dill.py:379: DeprecationWarning: co_lnotab is deprecated, use co_lines instead.
(ClientAppActor pid=35456)   obj.co_lnotab,  # for < python 3.10 [not counted in args]
(ClientAppActor pid=35456) INFO :      Client 81 is doing fit()
INFO :      aggregate_fit: received 10 results and 0 failures
WARNING :   No fit_metrics_aggregation_fn provided
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.13336
INFO :
INFO :      [ROUND 2]
INFO :      configure_fit: strategy sampled 10 clients (out of 100)
(ClientAppActor pid=33152) C:\Users\thoma\miniconda3\Lib\site-packages\datasets\utils\_dill.py:379: DeprecationWarning: co_lnotab is deprecated, use co_lines instead. [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ClientAppActor pid=33152)   obj.co_lnotab,  # for < python 3.10 [not counted in args] [repeated 7x across cluster]
(ClientAppActor pid=14188) INFO :      Client 82 is doing fit() [repeated 10x across cluster]
INFO :      aggregate_fit: received 10 results and 0 failures
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.19632
INFO :
INFO :      [ROUND 3]
INFO :      configure_fit: strategy sampled 10 clients (out of 100)
(ClientAppActor pid=14188) INFO :      Client 75 is doing fit() [repeated 10x across cluster]
INFO :      aggregate_fit: received 10 results and 0 failures
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.26988
INFO :
INFO :      [ROUND 4]
INFO :      configure_fit: strategy sampled 10 clients (out of 100)
(ClientAppActor pid=14188) INFO :      Client 23 is doing fit() [repeated 10x across cluster]
INFO :      aggregate_fit: received 10 results and 0 failures
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.29572
INFO :
INFO :      [ROUND 5]
INFO :      configure_fit: strategy sampled 10 clients (out of 100)
(ClientAppActor pid=14188) INFO :      Client 73 is doing fit() [repeated 10x across cluster]
INFO :      aggregate_fit: received 10 results and 0 failures
INFO :      configure_evaluate: strategy sampled 100 clients (out of 100)
INFO :      aggregate_evaluate: received 100 results and 0 failures
accuracy : 0.30388
INFO :
INFO :      [SUMMARY]
INFO :      Run finished 5 round(s) in 70.18s
INFO :          History (loss, distributed):
INFO :                  round 1: 2.3023559468114967
INFO :                  round 2: 2.143152470584345
INFO :                  round 3: 1.9923679619058787
INFO :                  round 4: 1.8736832554891765
INFO :                  round 5: 1.8385567528231124
INFO :          History (metrics, distributed, evaluate):
INFO :          {'accuracy': [(1, 0.13336),
INFO :                        (2, 0.19632),
INFO :                        (3, 0.26988),
INFO :                        (4, 0.29572),
INFO :                        (5, 0.30388)]}
INFO :

The only thing that changes is the [repeated x across cluster]

After investigating on the flower.server.server.py code, i see that

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        submitted_fs = {
            executor.submit(fit_client, client_proxy, ins, timeout, group_id)
            for client_proxy, ins in client_instructions
        }
        finished_fs, _ = concurrent.futures.wait(
            fs=submitted_fs,
            timeout=None,  # Handled in the respective communication stack
        )

len(finished_fs) = 10. I assume that the logging info is wrong, or it is the same partition that is repeated x time ?

Have a good sunday.

Final update (It’s a good news, :slight_smile:) :

I’ve seen that the message printed by the log system is inconsitent and doesn’t print all informations asked. When i do a log on ten clients it will only show the last… If you add more clients it will show some more. It’s maybe because the log system doesn’t support synchronized client trying to log at the same time. But when i did a unit test with files created by all sampled clients each rounds it will create files for each client ids with the correct amount of clients sampled. This test saved my sunday !