Issue with updating config_train in Flower simulation causing client failures

matskrit · November 26, 2024, 8:15am

I am working on a Flower simulation where I dynamically update the config_train dictionary using a layers_template that is passed through an API. Despite updating config_train successfully, my simulation results in all clients failing during the fit phase, as seen in the terminal logs below. I’m trying to figure out why this is happening.

Here is my code:

`run_simulation_flower` Function

def run_simulation_flower(path_metrics , model_json=None):
    global metrics_file, net, config_train
    metrics_file = path_metrics
    net = create_model_from_config(model_json['layers']).to(DEVICE)
    config_train = update_config_from_template(config_train, model_json)
    server = ServerApp(server_fn=server_fn)
    backend_config = {"client_resources": {"num_cpus": 1, "num_gpus": 0.0}}
    if DEVICE.type == "cuda":
        backend_config = {"client_resources": {"num_cpus": 1, "num_gpus": 1.0}}

    if not os.path.exists(metrics_file):
        with open(metrics_file, "w") as f:
            data = {"train": []}
            json.dump(data, f)
    run_simulation(
            server_app=server,
            client_app=client,
            num_supernodes=4,
            backend_config=backend_config,
        )

`update_config_from_template` Function


def update_config_from_template(config_train: dict, layers_template: dict) -> dict:
    """
    Update the config_train dictionary to include all relevant information from layers_template.
    """
    try:
        # Parámetros generales
        config_train.update({
            "layers": layers_template.get("layers", []),
            "optimizer": layers_template.get("optimizer", {}),
            "loss": layers_template.get("loss", {}),
            "federated_name": layers_template.get("federated", {}).get("name", "FedAvg"),
            "federated": layers_template.get("federated", {}).get("parameters", {}),
            "epochs": layers_template.get("train", {}).get("epochs", config_train.get("epochs", 10)),
            "batch_size": layers_template.get("train", {}).get("batch_size", config_train.get("batch_size", 32)),
            "rounds": layers_template.get("train", {}).get("rounds", config_train.get("num_rounds", 5)),
            "fraction_fit": layers_template.get("federated", {}).get("parameters", {}).get("fraction_fit", 1.0),
            "fraction_evaluate": layers_template.get("federated", {}).get("parameters", {}).get("fraction_eval", 0.3),
            "min_fit_clients": layers_template.get("federated", {}).get("parameters", {}).get("min_fit_clients", 2),
            "min_evaluate_clients": layers_template.get("federated", {}).get("parameters", {}).get("min_eval_clients", 2),
            "min_available_clients": layers_template.get("federated", {}).get("parameters", {}).get("min_available_clients", 2),
        })
        print("Updated config_train:", config_train)
        return config_train
    except Exception as e:
        print("Error updating config_train from layers_template:", e)
        return {}

Sample `layers_template`

layers_template = {
    "layers": [
        {"type": "Conv2d", "in_channels": 3, "out_channels": 6, "kernel_size": 5},
        {"type": "ReLU"},
        {"type": "MaxPool2d", "kernel_size": 2, "stride": 2},
        {"type": "Conv2d", "in_channels": 6, "out_channels": 16, "kernel_size": 5},
        {"type": "ReLU"},
        {"type": "MaxPool2d", "kernel_size": 2, "stride": 2},
        {"type": "Flatten"},
        {"type": "Linear", "in_features": 16 * 5 * 5, "out_features": 120},
        {"type": "ReLU"},
        {"type": "Linear", "in_features": 120, "out_features": 84},
        {"type": "ReLU"},
        {"type": "Linear", "in_features": 84, "out_features": 10},
    ],
    "optimizer": {
        "name": "Adam",
        "parameters": {
            "lr": 0.001,
            "betas": (0.9, 0.999),
            "eps": 1e-08,
            "weight_decay": 0,
            "amsgrad": False
        }
    },
    "loss": {"name": "CrossEntropyLoss", "parameters": {}},
    "federated": {
        "name": "test2000",
        "parameters": {
            "fraction_fit": 1.0,
            "fraction_eval": 0.3,
            "min_fit_clients": 4,
            "min_eval_clients": 2,
            "min_available_clients": 4
        }
    },
    "train": {"epochs": 10, "batch_size": 32, "rounds": 5},
    "connections": [{}]
}

Terminal output


11/25/2024 16:41:43:INFO:Received initial parameters from one random client
INFO :      Evaluating initial global parameters
11/25/2024 16:41:43:INFO:Evaluating initial global parameters
INFO :
11/25/2024 16:41:43:INFO:
INFO :      [ROUND 1]
11/25/2024 16:41:43:INFO:[ROUND 1]
INFO :      configure_fit: strategy sampled 4 clients (out of 4)
11/25/2024 16:41:43:INFO:configure_fit: strategy sampled 4 clients (out of 4)
INFO :      aggregate_fit: received 0 results and 4 failures
11/25/2024 16:41:43:INFO:aggregate_fit: received 0 results and 4 failures

What could be causing the client failures during the fit phase? Is there an issue with how config_train is updated or passed to the run_simulation function?

Any insights or suggestions would be appreciated. Thank you!

javier · November 30, 2024, 8:00am

Hi @matskrit, could you provide more details on how config_train is actually being used? From your run_simulation_flower() function it seems it’s not used after calling the update_config_from_template function?

Also, please note that the recommended way of running simulations is not run_simulation() but launching them from the command line with flwr run. In addition, it is recommended having the code in a file structure like that present in most of the examples in the repo (e.g. examples/quickstart-pytorch, or it’s more advanced version examples/advanced-pytorch). Naturally you can add more files as needed.

The run_simulation is mostly designed for environments where access to the CLI isn’t possible. For example, in a Jupyter Notebook or Google Colab setup. As a result, it’s support for simulations is restricted.

matskrit · December 2, 2024, 9:02am

Hi i could solve the problem. The problem was the configt fit… i didn’t read that the dict only use simple types.

Thanks for your help.

Topic		Replies	Views
Simulating Client Failure Flower Help - Beginners simulation	2	120	November 18, 2024
Integrating Flower in Discrete Event Simulation Flower Help - Beginners	0	16	November 13, 2024
Launching multiple clients in simulation environnement Flower Help - Beginners	1	140	November 3, 2024
How to avoid Flower Next from destroying my model on every fit and every evaluate Flower Help - Beginners	5	118	December 21, 2024
Debugging Flower Flower Help - Beginners flower	3	162	January 8, 2025

Issue with updating config_train in Flower simulation causing client failures

run_simulation_flower Function

update_config_from_template Function

Sample layers_template

Related topics

`run_simulation_flower` Function

`update_config_from_template` Function

Sample `layers_template`