Metrics_distributed always stays empty {}

*This question was migrated from Github Discussions.

Original Question:

I am experimenting the first time with a federated ML library and noticed probably a bug in version 1.0.0 of flower that is related to the output of metrics_distributed at the end of a federation run.

Right now, you see just at the end in the log:
INFO flower 2022-07-29 23:49:51,942 | app.py:181 | app_fit: metrics_distributed {}

Though, you may be adding a distributed metric in the FlowerClient.evaluate function.

When implementing a FlowerClient according to fl.client.NumPyClient you have to return a tuple with 3 values, namely Tuple[float, int, Dict[str, Scalar]].
As suggested in the current tutorial for Flower with PyTorch, one can return the loss and the accuracy in the evaluate function:
return float(loss), len(self.valloader), {"accuracy": float(accuracy)}
This does match the fl.client.NumPyClient interface.

Now, let’s look at the Flower server code inserver/server.py.
In every federation round the centralized metrics and loss are determined, followed by the client-side evaluation.

# Evaluate model on a sample of available clients
res_fed = self.evaluate_round(server_round=current_round, timeout=timeout)
if res_fed:
    loss_fed, evaluate_metrics_fed, _ = res_fed
    if loss_fed:
        history.add_loss_distributed(
            server_round=current_round, loss=loss_fed
        )
        history.add_metrics_distributed(
            server_round=current_round, metrics=evaluate_metrics_fed
        )

So, the evaluate function in the FlowerClient is invoked and the 2nd element is assumed to be the metrics dict, however it is just a scalar — remember the evaluate return type is Tuple[float, int, Dict[str, Scalar]]. The 3rd element in res_fed, which would be the actual dictionary of metrics containing the accuracy, is ignored though.
As a result, the history.py add_metrics_distributed will not add any metric to metrics_distributed, since it expects as argument a dictionary as 2nd argument, i.e. server_round: int, metrics: Dict[str, Scalar].

If you would change the return value of your Flower Client’s evaluate function to return the metrics dictionary as 2nd argument, you get the following error, because it does not match the signature.

    raise Exception(EXCEPTION_MESSAGE_WRONG_RETURN_TYPE_EVALUATE)
Exception: 
NumPyClient.evaluate did not return a tuple with 3 elements.
The returned values should have the following type signature:

    Tuple[float, int, Dict[str, Scalar]]

Example
-------

    0.5, 10, {"accuracy": 0.95}

In my opinion, one has to adjust either the server.py or the numpy_client.py to support the collection of distributed metrics.
But right now, you always get an empty set.

Best regards,
Bernhard

Steps/Code to Reproduce

install flwr==1.0.0.
Try the Flower 1 with PyTorch Example, which uses Federated (client-side) evaluation. https://github.com/adap/flower/blob/main/tutorials/Flower-1-Intro-to-FL-PyTorch.ipynb

Expected Results

After flower has finished the federation runs, it should output the distributed metrics from the client-side evaluation.

For example, if your Flower Client evaluate function returns the loss and the accuracy, like below, it should be logged after the flower run.
return float(loss), len(self.valloader), {"accuracy": float(accuracy)}

Actual Results

Right now, you are unable to get any metrics_distributed, they are always an empty set.

INFO flower 2022-07-29 23:49:51,942 | server.py:144 | FL finished in 50.397818416999996
INFO flower 2022-07-29 23:49:51,942 | app.py:180 | app_fit: losses_distributed [(1, 1.9442849159240723), (2, 1.5206300020217896), (3, 1.4368793964385986), (4, 1.3635241985321045), (5, 1.311201572418213)]
INFO flower 2022-07-29 23:49:51,942 | app.py:181 | app_fit: metrics_distributed {}
INFO flower 2022-07-29 23:49:51,942 | app.py:182 | app_fit: losses_centralized [(0, 2.305103033114546), (1, 1.9442848614610422), (2, 1.5206300107815776), (3, 1.4368794017706434), (4, 1.3635242449970673), (5, 1.3112015478527204)]
INFO flower 2022-07-29 23:49:51,942 | app.py:183 | app_fit: metrics_centralized {'accuracy': [(0, 0.1011), (1, 0.3091), (2, 0.4434), (3, 0.4785), (4, 0.5115), (5, 0.5285)]}