Hi, thank you for the detailed report!
The behaviour described above is expected behaviour.
Let me explain why: the History
object only records aggregated metrics, but not individual metrics dicts coming from single clients. There are currently four types of metrics recorded:
- loss centralized: no need to aggregate, just a single value
- metrics centralized: no need to aggregate, just a single value for each key
- loss distributed: can be automatically aggregated, strategy knows how
- metrics distributed: must be aggregated, but can not be done automatically
Distributed metrics are the “odd one out”, they cannot be automatically aggregated because the strategy can not know which keys (and value types) to expect. This is why metrics_distributed
is empty by default.
How can custom metric dicts be aggregated on the server-side?
The built-in strategies all support passing both a fit_metrics_aggregation_fn
and an evaluate_metrics_aggregation_fn
. The concept is quite easy to understand: Flower calls these functions and hands them the metrics dicts it received from the clients, the functions aggregate those dictionaries, and Flower records the aggregated result in the History
. Here’s an example: flower/examples/quickstart-pytorch/server.py at main · adap/flower · GitHub
PS
Regarding the code in question: evaluate_round
returns the aggregated loss and the aggregated metrics dict, which means that at this point in the code aggregation has already happened. The type signature reflects that:
def evaluate_round(
self,
server_round: int,
timeout: Optional[float],
) -> Optional[
Tuple[Optional[float], Dict[str, Scalar], EvaluateResultsAndFailures]
]: