Problem with BatchNormalization more precise with num_batches_tracked

Hello everyone,

I have a problem with the training, the first round goes great. however, a problem occurs in the second one. Could someone help me?
The first round works well

INFO :      Flower ECE: gRPC server running (20 rounds), SSL is disabled
INFO :      [INIT]
INFO :      Requesting initial parameters from one random client
Number of availbale Clients 1
INFO :      Received initial parameters from one random client
INFO :      Evaluating initial global parameters
INFO :      
INFO :      [ROUND 1]
INFO :      configure_fit: strategy sampled 2 clients (out of 2)
Number of availbale Clients 2
INFO :      aggregate_fit: received 2 results and 0 failures

In the Server file i get the error with

    hist = run_fl(
  File ".../miniconda/envs/f1/lib/python3.10/site-packages/flwr/server/server.py", line 483, in run_fl
    hist, elapsed_time = server.fit(
  File ".../miniconda/envs/f1/lib/python3.10/site-packages/flwr/server/server.py", line 113, in fit
    res_fit = self.fit_round(
  File ".../multisensor_data_preparation_federated_learning/examples/flowers/testtt/testscaffold/server_scaffold.py", line 237, in fit_round
    aggregated_result_combined = parameters_to_ndarrays(aggregated_result[0])
  File "/.../miniconda/envs/f1/lib/python3.10/site-packages/flwr/common/parameter.py", line 34, in parameters_to_ndarrays
    return [bytes_to_ndarray(tensor) for tensor in parameters.tensors]
AttributeError: 'NoneType' object has no attribute 'tensors

These error is occurs because the num_batches_tracking is empty

  File ".../miniconda/envs/f1/lib/python3.10/site-packages/flwr/client/client.py", line 234, in maybe_call_fit
    return client.fit(fit_ins)
  File "/home/mwalczewski/miniconda/envs/f1/lib/python3.10/site-packages/flwr/client/numpy_client.py", line 238, in _fit
    results = self.numpy_client.fit(parameters, ins.config)  # type: ignore
  File ".../multisensor_data_preparation_federated_learning/examples/flowers/testtt/testscaffold/client_scaffold.py", line 451, in fit
    self.set_parameters(model_parameter)
  File ".../multisensor_data_preparation_federated_learning/examples/flowers/testtt/testscaffold/client_scaffold.py", line 428, in set_parameters
    raise ValueError(f"Parameter {k} is empty.")
ValueError: Parameter conv.batch.num_batches_tracked is empty.

Here is my code in the client file

class FlowerClient1(fl.client.NumPyClient):

    def __init__(self, client_index):
        args = parse_args()
        self.client_index = args.client_index
        self.model = model
        train1,test2 = train_loader, test_loader
        self.train = train1[self.client_index]
        self.test = test2[self.client_index]
        self.client_cvalue = []
        for param in self.model.parameters():
            self.client_cvalue.append(torch.zeros(param.shape))
        #save_dir = ""
        #if save_dir == "":
        #    save_dir = "clients_cvs"
        self.dir = "client_cvs"
        if not os.path.exists(self.dir):
            os.makedirs(self.dir)

    def get_parameters(self, config):
        return [val.cpu().numpy() for _, val in self.model.state_dict().items()]

    def set_parameters(self, parameters):
        params_dict = zip(self.model.state_dict().keys(), parameters)

        state_dict = OrderedDict({k: torch.tensor(v) for k, v in params_dict})
            # Check for empty tensors
        for k, v in state_dict.items():
            if v.numel() == 0:
                raise ValueError(f"Parameter {k} is empty.")
            print(f"Parameter {k} shape: {v.shape}")
        self.model.load_state_dict(state_dict, strict=True)

The error is because as mention before the num_batches_tracked is empty. an excerpt is shown here

Parameter resblock5.diconv1.weight shape: torch.Size([256, 64, 3])
Parameter resblock5.diconv1.bias shape: torch.Size([256])
Parameter resblock5.batch1.weight shape: torch.Size([256])
Parameter resblock5.batch1.bias shape: torch.Size([256])
Parameter resblock5.batch1.running_mean shape: torch.Size([256])
Parameter resblock5.batch1.running_var shape: torch.Size([256])
Parameter resblock5.batch1.num_batches_tracked shape: torch.Size([])

its empty. but the first round works great. Have someone a idea what to make. …num_batches_tracked is essential for the functioning BatchNorm layer. I dont know what can be done. If further code is needed please let us know. Outside the flower framework the model works well. It also works when i only set in the server 1 round and in the clients more rounds. But it doesnt work when i what e.g. in the server more rounds

Hi @matt08,

Thanks for posting your question here. Is there a specific example that you are following from our GitHub?

Hi,

there is no specific example from the GitHub. I want to use my Scaffold strategy implementation it inspereded on the implementation in yours GitHub baselines/ ( (flower/baselines/niid_bench/niid_bench at main · adap/flower · GitHub) with a few changes. However, I’m encountering an error that I can’t seem to resolve. The issue is that the num_batches_tracked value in the model’s batch norm buffer is empty. Interestingly, when I train the model using standard (non-federated) training without Flower, everything works perfectly. I also tried manually setting the buffer values during training, and in that case, the model runs for three rounds before the issue reappears. I find this behavior unusual. The results after three rounds are promising, so I’m eager to solve this problem.

Here is a example model that is used. But i also tried this with my implemented models

model = models.resnet34(pretrained=False)
model.conv1 = nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
model.fc = nn.Linear(in_features=512, out_features=25, bias=True)
model = model.to(DEVICE)

If someone need more code i can share it