Hi, I am observing some weird behavior when executing the flower local simulation on a HPC cluster. I am forced to develop the project and test locally with synthetic data, since I am not allowed to access the real data. The synthetic data of course has the same shape and properties as the real data, but was generated.
To verify that everything is also working with real data, a person who is allowed to access the real data executes the code with the real data. It may also be worth to mention that we are both executing the exact same git commit on the same HPC cluster machine. The only difference is thus in the data with which the training is executed.
The project depends on the following flower packages:
flwr 1.23.0
flwr-datasets 0.5.0
and python 3.11.
Here’s what happens: When the flower simulation is executed on the HPC cluster with the real data, following outcome is printed (this was a simplified run with only one client and one round; so basically a central training, but the exact same thing happens in a federated simulation)
e[92mINFO e[0m: Starting FedAvg strategy:
e[92mINFO e[0m: ├── Number of rounds: 1
e[92mINFO e[0m: ├── ArrayRecord (0.25 MB)
e[92mINFO e[0m: ├── ConfigRecord (train): {‘lr’: 0.0005, ‘local-epochs’: 200, ‘weight-decay’: 1e-06, ‘patience’: 10, ‘patience-early’: 30}
e[92mINFO e[0m: ├── ConfigRecord (evaluate): (empty!)
e[92mINFO e[0m: ├──> Sampling:
e[92mINFO e[0m: │ ├──Fraction: train (0.50) | evaluate ( 1.00)
e[92mINFO e[0m: │ ├──Minimum nodes: train (1) | evaluate (1)
e[92mINFO e[0m: │ └──Minimum available nodes: 1
e[92mINFO e[0m: └──> Keys in records:
e[92mINFO e[0m: ├── Weighted by: ‘num-examples’
e[92mINFO e[0m: ├── ArrayRecord key: ‘arrays’
e[92mINFO e[0m: └── ConfigRecord key: ‘config’
e[92mINFO e[0m:
e[92mINFO e[0m:
e[92mINFO e[0m: [ROUND 1/1]
e[92mINFO e[0m: configure_train: Sampled 1 nodes (out of 1)
e[92mINFO e[0m: configure_evaluate: Sampled 1 nodes (out of 1)
e[92mINFO e[0m: aggregate_evaluate: Received 1 results and 0 failures
e[92mINFO e[0m: └──> Aggregated MetricRecord: {‘eval_loss’: 0.13906299602005615}
e[92mINFO e[0m:
e[92mINFO e[0m: Strategy execution finished in 4006.91s
e[92mINFO e[0m:
e[92mINFO e[0m: Final results:
e[92mINFO e[0m:
e[92mINFO e[0m: Global Arrays:
e[92mINFO e[0m: ArrayRecord (0.000 MB)
e[92mINFO e[0m:
e[92mINFO e[0m: Aggregated ClientApp-side Train Metrics:
e[92mINFO e[0m: {}
e[92mINFO e[0m:
e[92mINFO e[0m: Aggregated ClientApp-side Evaluate Metrics:
e[92mINFO e[0m: {1: {‘eval_loss’: ‘1.3906e-01’}}
e[92mINFO e[0m:
e[92mINFO e[0m: ServerApp-side Evaluate Metrics:
e[92mINFO e[0m: {}
e[92mINFO e[0m:
As you can see, it shows and “eval_loss”, but no “train_loss”.
The really strange thing is, that this only happens with the real data, but not with the synthetic data. As with the latter one, everything gets printed and executed as expected (with train loss), no matter whether it is run on the HPC cluster, or directly on the Macbook I am working with, or whether I execute it or my colleague. The only situation where it shows this strange behavior is when my colleague runs it with the real data.
Does anyone know what could cause this?
Unfortunately, this is a tricky situation, since I cannot access the real data to debug it, and my colleague who is allowed to access the data is not a computer scientist and hence not able to debug it.
Any help or guesses about the cause of this are much appreciated! ![]()