How to return aggregated confusion matrix and classification report


I wanted to ask how to return an aggregated confusion matrix and an aggregated classification report. I see that the evaluate function in my defined Flower Client class expects three return values: (loss: float, num_examples: int, and metrics: Dict[str, Scalar]). This is from the EvaluateRes definition.

This means that it is easy to return scalar values in the evaluate class, such as {'accuracy': acc, 'f1-score': f1, ...}, but it doesn’t work for the classification report from sklearn and the confusion matrix. For example, the classification report has the structure, if output_dict=True, of {'0': {'precision': 0....., 'recall': 0...., 'f1-score': 0...., 'support': xxxx}, '1': {'precision': xxxxx, ...}, so this doesn’t work because it isn’t compatible with Dict[str, Scalar].

My idea was to extend the allowed return values from the evaluate function in my Flower Client class to four. To do this, many changes in the Python files are required, such as in,,, etc. For example, in, my idea was to create a new definition for metrics_custom. As mentioned before, this requires many changes in many files in Flower. Of course, you could save each value as a key-value pair, resulting in numerous values, but I wanted to ask if there is a different way to achieve this.

class EvaluateRes:
    """Evaluate response from a client."""

    status: Status
    loss: float
    num_examples: int
    metrics: Dict[str, Scalar]
    ****metrics_custom = Dict[str,Dict[str,float]] ****

We plan to keep the metrics (here and in the newer versions of it) simple to provide some out-of-the-box aggregation in the future.

I see a few solutions to your current problem:

  1. Convert the nested dict to a single dict
    The conversation could look such that the keys are concatenated e.g. “0_precission”: value, …

  2. Serialize the dict and return in metrics
    Serialize it using: pickle.dumps(my_dict) and deserialize it using pickle.loads(serialized_dict) return as e.g. “confusion_matrix” : serialized_dict

  3. Serialize the dict and save it in ConfigsRecord (when using low-level API)

I hope it helps :slight_smile:


Also, note that averaging f1 in the standard way (simple/weighted avg) will give incorrect results. (Same case as in cross-validation). I’d recommend recalculating the F1 score based on TP, FP, FN, and TN values.


This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.