Increasing or in general suspicious high loss after first round of training

Hi
I have been simulating my federated setup a couple of times now, trying to investigate performance changes due to different stuff.

My question though, is more related to a suspicious tendency I have noticed. So before training, the model is initiated with random weights, and tested on the test data, which gives a poor initial performance, representing round 0 in the pictures below. Then the federated training begins, and the loss of the aggregated model on the test data decreases. But what I have noticed is an increasing loss from first initial model to the aggregated model after round 1. This seems a bit odd to me. I have reviewed the code a lot of times, and can’t really see anything that should lead to wrong kind of aggregation after the first round…

But maybe someone, has experienced the same or is able to figure out what is going on?

Hope to hear from someone!

// Johan

1 Like

Seems weird. May I know how did you calculate the federated loss? Was it aggregated from local test losses? Could you pick the data of one client and let it train on its own data (ignore FL for now). Will its loss go up and then down too?

2 Likes

The federated loss is calculated from evaluating the aggregated model on a testset after each round. When just training a normal centralized model the loss is decreasing and converging in a normal way. I am tracking the progress while training on each client aswell on a local train and val dataset. Each client has normal decrease in loss for train and val aswell.

1 Like

I believe this is likely due to divergences among local models. In the first round, since the model is initialized randomly, training on different local datasets can cause the local models to diverge significantly. As a result, the test error increases after the first round. One potential solution is to reduce the number of local epochs, particularly for the first round. However, it’s not so surprising to see the loss increase in the first round especially with non-IID data, so no need to worry too much about it, I think.

2 Likes

Thanks for the response! That makes sense! Another thing i have noticed, which I can’t really explain is something related to experimenting with one bad client.

So i am simulating 5 clients, one of them is considered bad, so they have drastically worse labels to train on than the others. So the bad properties learned from the bad client is aggregated, and makes the aggregated model (federated loss) worse on the testset as expected. But what is quite weird is when i track the training progress on each client, the bad performance is shifting between clients for each round, which can be seen in the image.

But the way i have set it up, all clients is getting the newest same aggregated model before training next round, so it doesn’t really make sense, why the bad performance is shifting between clients and not just the same (worse) ish for all clients?

That’s indeed an interesting case. I’m not entirely sure, but it might be because your model was trying to learn from the hard cases, which caused the test loss on another client to increase. But I’m not sure why it seems to show a periodic pattern. Perhaps you could try setting the number of local epochs to a very small number, say 1, to see if the pattern persists. I can’t really tell too much from a single plot, I’m afraid.

2 Likes

Btw, have you registered for the Flower Summit 2025? Welcome to join us online or in persion! Flower AI Summit 2025

2 Likes

Sounds nice, will definetly consider it, thanks!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.