Metrics_distributed always stays empty {}

*This question was migrated from Github Discussions.

Original Question:

I am experimenting the first time with a federated ML library and noticed probably a bug in version 1.0.0 of flower that is related to the output of metrics_distributed at the end of a federation run.

Right now, you see just at the end in the log:
INFO flower 2022-07-29 23:49:51,942 | app.py:181 | app_fit: metrics_distributed {}

Though, you may be adding a distributed metric in the FlowerClient.evaluate function.

When implementing a FlowerClient according to fl.client.NumPyClient you have to return a tuple with 3 values, namely Tuple[float, int, Dict[str, Scalar]].
As suggested in the current tutorial for Flower with PyTorch, one can return the loss and the accuracy in the evaluate function:
return float(loss), len(self.valloader), {"accuracy": float(accuracy)}
This does match the fl.client.NumPyClient interface.

Now, let’s look at the Flower server code inserver/server.py.
In every federation round the centralized metrics and loss are determined, followed by the client-side evaluation.

# Evaluate model on a sample of available clients
res_fed = self.evaluate_round(server_round=current_round, timeout=timeout)
if res_fed:
    loss_fed, evaluate_metrics_fed, _ = res_fed
    if loss_fed:
        history.add_loss_distributed(
            server_round=current_round, loss=loss_fed
        )
        history.add_metrics_distributed(
            server_round=current_round, metrics=evaluate_metrics_fed
        )

So, the evaluate function in the FlowerClient is invoked and the 2nd element is assumed to be the metrics dict, however it is just a scalar — remember the evaluate return type is Tuple[float, int, Dict[str, Scalar]]. The 3rd element in res_fed, which would be the actual dictionary of metrics containing the accuracy, is ignored though.
As a result, the history.py add_metrics_distributed will not add any metric to metrics_distributed, since it expects as argument a dictionary as 2nd argument, i.e. server_round: int, metrics: Dict[str, Scalar].

If you would change the return value of your Flower Client’s evaluate function to return the metrics dictionary as 2nd argument, you get the following error, because it does not match the signature.

    raise Exception(EXCEPTION_MESSAGE_WRONG_RETURN_TYPE_EVALUATE)
Exception: 
NumPyClient.evaluate did not return a tuple with 3 elements.
The returned values should have the following type signature:

    Tuple[float, int, Dict[str, Scalar]]

Example
-------

    0.5, 10, {"accuracy": 0.95}

In my opinion, one has to adjust either the server.py or the numpy_client.py to support the collection of distributed metrics.
But right now, you always get an empty set.

Best regards,
Bernhard

Steps/Code to Reproduce

install flwr==1.0.0.
Try the Flower 1 with PyTorch Example, which uses Federated (client-side) evaluation. https://github.com/adap/flower/blob/main/tutorials/Flower-1-Intro-to-FL-PyTorch.ipynb

Expected Results

After flower has finished the federation runs, it should output the distributed metrics from the client-side evaluation.

For example, if your Flower Client evaluate function returns the loss and the accuracy, like below, it should be logged after the flower run.
return float(loss), len(self.valloader), {"accuracy": float(accuracy)}

Actual Results

Right now, you are unable to get any metrics_distributed, they are always an empty set.

INFO flower 2022-07-29 23:49:51,942 | server.py:144 | FL finished in 50.397818416999996
INFO flower 2022-07-29 23:49:51,942 | app.py:180 | app_fit: losses_distributed [(1, 1.9442849159240723), (2, 1.5206300020217896), (3, 1.4368793964385986), (4, 1.3635241985321045), (5, 1.311201572418213)]
INFO flower 2022-07-29 23:49:51,942 | app.py:181 | app_fit: metrics_distributed {}
INFO flower 2022-07-29 23:49:51,942 | app.py:182 | app_fit: losses_centralized [(0, 2.305103033114546), (1, 1.9442848614610422), (2, 1.5206300107815776), (3, 1.4368794017706434), (4, 1.3635242449970673), (5, 1.3112015478527204)]
INFO flower 2022-07-29 23:49:51,942 | app.py:183 | app_fit: metrics_centralized {'accuracy': [(0, 0.1011), (1, 0.3091), (2, 0.4434), (3, 0.4785), (4, 0.5115), (5, 0.5285)]}

Hi, thank you for the detailed report!

The behaviour described above is expected behaviour.

Let me explain why: the History object only records aggregated metrics, but not individual metrics dicts coming from single clients. There are currently four types of metrics recorded:

  • loss centralized: no need to aggregate, just a single value
  • metrics centralized: no need to aggregate, just a single value for each key
  • loss distributed: can be automatically aggregated, strategy knows how
  • metrics distributed: must be aggregated, but can not be done automatically

Distributed metrics are the “odd one out”, they cannot be automatically aggregated because the strategy can not know which keys (and value types) to expect. This is why metrics_distributed is empty by default.

How can custom metric dicts be aggregated on the server-side?

The built-in strategies all support passing both a fit_metrics_aggregation_fn and an evaluate_metrics_aggregation_fn. The concept is quite easy to understand: Flower calls these functions and hands them the metrics dicts it received from the clients, the functions aggregate those dictionaries, and Flower records the aggregated result in the History. Here’s an example: flower/examples/quickstart-pytorch/server.py at main · adap/flower · GitHub

PS

Regarding the code in question: evaluate_round returns the aggregated loss and the aggregated metrics dict, which means that at this point in the code aggregation has already happened. The type signature reflects that:

    def evaluate_round(
        self,
        server_round: int,
        timeout: Optional[float],
    ) -> Optional[
        Tuple[Optional[float], Dict[str, Scalar], EvaluateResultsAndFailures]
    ]:
1 Like