Model aggregation before all clients have finished the round

Hi
I am simulating 5 clients, and I have set their ressources up like this to ensure only one client is running at a time in order to max out the batchsize:

[tool.flwr.federations.local-sim-gpu]
options.num-supernodes = 5
options.backend.client-resources.num-cpus = 4
options.backend.client-resources.num-gpus = 1.0
options.backend.init_args.num_cpus = 4 # Only expose 4 CPU to the simulation
options.backend.init_args.num_gpus = 1 # Expose a single GPU to the simulation

I have set all these different parameters in hope of making sure not to end a round and aggregate a model before all clients are done:

    strategy = CustomFedAvg(
        run_config=context.run_config,
        use_wandb=context.run_config["use-wandb"],
        project_name=project_name,
        fraction_fit=1.0,
        fraction_evaluate=1.0,
        min_fit_clients=5,
        min_evaluate_clients=5,
        min_available_clients=5,
        initial_parameters=parameters,
        on_fit_config_fn=on_fit_config,
        accept_failures=False,
        evaluate_fn=gen_evaluate_fn(testloader, device=server_device),
        evaluate_metrics_aggregation_fn=weighted_average,

Now when running the simulation the first two clients are running all 10 epochs for the first round without problem. But then before running the third client, it begins aggregating and then it shows a log of only two results received and three failures (the exception is raised if any failures are detected in the aggregate_fit):

I cannot figure out why the server is not waiting for all clients to finish the round. I would have thought it would be some kind of timeout, but the round_timeout in the ServerConfig is None.

Does someone know how to fix this?

Hope to hear from anyone!

// Johan

1 Like