Simulation succeeding, but only showing eval metric (no train metric)

kochjo · January 8, 2026, 6:30pm

Hi, I am observing some weird behavior when executing the flower local simulation on a HPC cluster. I am forced to develop the project and test locally with synthetic data, since I am not allowed to access the real data. The synthetic data of course has the same shape and properties as the real data, but was generated.

To verify that everything is also working with real data, a person who is allowed to access the real data executes the code with the real data. It may also be worth to mention that we are both executing the exact same git commit on the same HPC cluster machine. The only difference is thus in the data with which the training is executed.

The project depends on the following flower packages:
flwr 1.23.0
flwr-datasets 0.5.0

and python 3.11.

Here’s what happens: When the flower simulation is executed on the HPC cluster with the real data, following outcome is printed (this was a simplified run with only one client and one round; so basically a central training, but the exact same thing happens in a federated simulation)

e[92mINFO e[0m: Starting FedAvg strategy:
e[92mINFO e[0m: ├── Number of rounds: 1
e[92mINFO e[0m: ├── ArrayRecord (0.25 MB)
e[92mINFO e[0m: ├── ConfigRecord (train): {‘lr’: 0.0005, ‘local-epochs’: 200, ‘weight-decay’: 1e-06, ‘patience’: 10, ‘patience-early’: 30}
e[92mINFO e[0m: ├── ConfigRecord (evaluate): (empty!)
e[92mINFO e[0m: ├──> Sampling:
e[92mINFO e[0m: │ ├──Fraction: train (0.50) | evaluate ( 1.00)
e[92mINFO e[0m: │ ├──Minimum nodes: train (1) | evaluate (1)
e[92mINFO e[0m: │ └──Minimum available nodes: 1
e[92mINFO e[0m: └──> Keys in records:
e[92mINFO e[0m: ├── Weighted by: ‘num-examples’
e[92mINFO e[0m: ├── ArrayRecord key: ‘arrays’
e[92mINFO e[0m: └── ConfigRecord key: ‘config’
e[92mINFO e[0m:
e[92mINFO e[0m:
e[92mINFO e[0m: [ROUND 1/1]
e[92mINFO e[0m: configure_train: Sampled 1 nodes (out of 1)
e[92mINFO e[0m: configure_evaluate: Sampled 1 nodes (out of 1)
e[92mINFO e[0m: aggregate_evaluate: Received 1 results and 0 failures
e[92mINFO e[0m: └──> Aggregated MetricRecord: {‘eval_loss’: 0.13906299602005615}
e[92mINFO e[0m:
e[92mINFO e[0m: Strategy execution finished in 4006.91s
e[92mINFO e[0m:
e[92mINFO e[0m: Final results:
e[92mINFO e[0m:
e[92mINFO e[0m: Global Arrays:
e[92mINFO e[0m: ArrayRecord (0.000 MB)
e[92mINFO e[0m:
e[92mINFO e[0m: Aggregated ClientApp-side Train Metrics:
e[92mINFO e[0m: {}
e[92mINFO e[0m:
e[92mINFO e[0m: Aggregated ClientApp-side Evaluate Metrics:
e[92mINFO e[0m: {1: {‘eval_loss’: ‘1.3906e-01’}}
e[92mINFO e[0m:
e[92mINFO e[0m: ServerApp-side Evaluate Metrics:
e[92mINFO e[0m: {}
e[92mINFO e[0m:

As you can see, it shows and “eval_loss”, but no “train_loss”.
The really strange thing is, that this only happens with the real data, but not with the synthetic data. As with the latter one, everything gets printed and executed as expected (with train loss), no matter whether it is run on the HPC cluster, or directly on the Macbook I am working with, or whether I execute it or my colleague. The only situation where it shows this strange behavior is when my colleague runs it with the real data.

Does anyone know what could cause this?
Unfortunately, this is a tricky situation, since I cannot access the real data to debug it, and my colleague who is allowed to access the data is not a computer scientist and hence not able to debug it.

Any help or guesses about the cause of this are much appreciated!

williamlm · January 12, 2026, 2:19pm

hi @kochjo, great to have you here in the community!

Flower will only print train_loss if the client’s train() method returns metrics in (as of latest stable version - 1.25) a MetricRecord. If train() returns {} (or exits early), you’ll see exactly what you’re seeing: eval metrics but no train metrics.

Given that this only happens with real data, the most likely causes are:

Training is skipped or exits early on real data

Empty train dataset / dataloader after filtering
Early stopping triggers immediately
No valid batches

Data-specific failure inside training

NaNs/Infs, unexpected dtypes, missing labels
Exception swallowed in the training loop, but evaluation still runs

Could you check for above and let me know if there are no such issues?

Best regards,
William

kochjo · January 17, 2026, 12:04pm

Hi! Thanks so much for the answer!
Luckily, we found the cause of the issue. The Problem was that the local epochs with the real data take significantly longer than with the synthetic data (because there are more datapoints), and this caused the local training to time out.

I should have double checked whether there is such a thing as a local training time out, but I didn’t think of it, because there was no error message.

Maybe it would be an idea to add a log to the final report that train iterations were skipped due to time out? I think that would be helpful

EDIT: If anyone also stumbles over this problem: You have to increase the `round_timeout` in the server app to prevent skipped training rounds.

Topic		Replies	Views
No metrics being shown at all in the simulations Flower Help - Beginners metrics	1	251	March 1, 2024
Metrics_distributed always stays empty {} Flower Framework	1	396	March 5, 2024
Not clear how loss distributed is calculated Flower Help - Beginners metrics	1	201	June 11, 2024
flower-NLP-Raspberry PI General flower	7	157	October 9, 2025
Launching multiple clients in simulation environnement Flower Help - Beginners	1	315	November 3, 2024

Simulation succeeding, but only showing eval metric (no train metric)

Related topics