How to avoid Flower Next from destroying my model on every fit and every evaluate

I wanted to use Flower NEXT for setting up a secure network with my partners. When migrating my federated learning setup from Flower to Flower NEXT using the guide I noticed that my app = my_script is frequently reexecuted making training impractical.

Then, I reproduced the issue using the quickstart template using flwr new and selected PyTorch. I added two print statements that should only be printed once for a training:

# line 41
def client_fn(context: Context):
    # Load model and data
    print("INITIALIZING CLIENTFN")
    net = Net()

and

class FlowerClient(NumPyClient):
    def __init__(self, net, trainloader, valloader, local_epochs):
        print("INITIALIZING CLIENT")
        self.net = net

How can I achieve these print statements to be only printed once per training? In other words, how can I use Flower Next without destroying and rebuilding my app for every fit and evaluate?

1 Like

Hello @telcrome, glad to hear you’re using the latest Flower APIs and thanks for your feedback. You’re correct that the app is launched every round of training in the example. The reason for this behaviour is that when you run these templates, the ClientApp is a short-lived process that runs only when required, which optimizes resource utilizations on the clients’ compute environment.

One alternative to have a long-running ClientApp is to deploy the SuperNodes and ClientApps in Docker containers. In this mode, which we call the deployment mode, the SuperNode is launched with the additional --isolation process option.

To start, follow this method of running the flwr/supernode here and replace the --isolation subprocess option with --isolation process. Then, launch another container using the flwr/clientapp image - this is the long-running ClientApp.

For reference, the list of published flwr Docker images can be found here: https://hub.docker.com/u/flwr.

Hope that helps!

1 Like

Thanks for the quick reply!

Unfortunately, I already had a problem when trying the solution in my development environment with flower version 1.12.0. To simulate a “real” setup, I used flower-superlink --insecure and flower-server-app . --insecure.

First, I started my supernodes with flower-supernode . --insecure --isolation process. After starting the sample server app (again, I used flwr init), both nodes got stuck with:

$ flower-supernode . --insecure --isolation process
INFO :      Starting Flower SuperNode
WARNING :   Option `--insecure` was set. Starting insecure HTTP client connected to 0.0.0.0:9092.
INFO :      Starting Flower ClientAppIo gRPC server on 0.0.0.0:9094
INFO :
INFO :      [RUN 14290138308533441979, ROUND 1]
INFO :      Received: train message c52ab732-bf0b-4c77-9878-2c5010b6673a
INFO :      Sent reply

Then, I tried the subprocess option: flower-supernode . --insecure --isolation subprocess. This resulted in an error:

$ flower-supernode . --insecure --isolation subprocess
INFO :      Starting Flower SuperNode
WARNING :   Option `--insecure` was set. Starting insecure HTTP client connected to 0.0.0.0:9092.
INFO :      Starting Flower ClientAppIo gRPC server on 0.0.0.0:9094
INFO :
INFO :      [RUN 10533597333898177981, ROUND 1]
INFO :      Received: train message 6981bcde-4c13-4eaf-b45c-2ea909ad9c9a
INFO :      Starting Flower ClientApp
INFO :      Pulling ClientAppInputs for token 12402382628435006975
Traceback (most recent call last):
  File "/usr/local/bin/flwr-clientapp", line 8, in <module>
    sys.exit(flwr_clientapp())
  File "/usr/local/lib/python3.10/dist-packages/flwr/client/clientapp/app.py", line 82, in flwr_clientapp
    run_clientapp(supernode=args.supernode, token=args.token)
  File "/usr/local/lib/python3.10/dist-packages/flwr/client/clientapp/app.py", line 125, in run_clientapp
    install_from_fab(fab.content, flwr_dir=None, skip_prompt=True)
  File "/usr/local/lib/python3.10/dist-packages/flwr/cli/install.py", line 105, in install_from_fab
    with zipfile.ZipFile(fab_file_archive, "r") as zipf:
  File "/usr/lib/python3.10/zipfile.py", line 1271, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.10/zipfile.py", line 1338, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
ERROR :     ClientApp raised an exception

Is there a way to resolve those issues?

Hi @telcrome, I see the problem.

To execute a Flower run, you need to use the CLI command flwr run. Running using flower-server-app is no longer supported.

Here are the steps to run an example using the Deployment Engine (this is where you spin up several SuperNodes for training) - I’ll use our flwr new templates as an example:

  1. Create a new Flower project from the template using flwr new. Follow the prompts to select the app name, username, and framework.
  2. In the pyproject.toml, add the following lines. They specify the address of the SuperLink that you’re going to connect to (0.0.0.0 is for a locally hosted SuperLink), at the mylocalfederation federation, with the insecure parameter.
[tool.flwr.federations.mylocalfederation]
address = "0.0.0.0:9093"
insecure = "true"
  1. Now, start a SuperLink with flower-superlink --insecure.
  2. Start a SuperExec with flower-superexec --insecure.
  3. Then, in two separate terminals (inside the project that you created earlier), start the SuperNodes as follows:
# In the first terminal
flower-supernode /path/to/app/ --insecure --node-config "partition-id=0 num-partitions=2"
# In the second terminal
flower-supernode /path/to/app/ --insecure --node-config "partition-id=1 num-partitions=2"
  1. Finally, you run it using flwr run /path/to/app mylocalfederation.

An important footnote: From flwr>1.12, you will no longer have to launch the SuperExec which makes it far easier for you to set up. So point 4 above will no longer be necessary.

Let me know if that helps.

hi @chongshenng , thanks for your reply!

I was able to run your instructions with both --isolation subprocess and --isolation process using flwr==1.13.1. Unfortunately, with both options all state is destroyed and rebuilt with every fit. I am loading large static resources (logger and a foundation model), which makes this a big problem. Can you help me once more and maybe spot my mistake?

  1. Setup the app flwr init, set print statement to confirm that static resources are rebuilt, and add your snippet for setting up mylocalfederation.
  2. Start infrastructure as flower-supernode --insecure --node-config "partition-id=0 num-partitions=2" --isolation process, flower-supernode --insecure --node-config "partition-id=1 num-partitions=2" --clientappio-api-address 0.0.0.0:9096 --isolation process and superlink flower-superlink --insecure --isolation process
  3. Start processes: flwr-serverapp --insecure --serverappio-api-address 0.0.0.0:9091, flower-supernode --insecure --node-config "partition-id=0 num-partitions=2" --isolation process, flower-supernode --insecure --node-config "partition-id=1 num-partitions=2" --clientappio-api-address 0.0.0.0:9096 --isolation process.
  4. flwr run . mylocalfederation

Is there maybe a different way to load large static resources only once and then keep them in client memory forever?

Hello @telcrome, apologies for the late reply. We released this guide a few weeks back that can walk you through the steps for implementing stateful clients. I think this is exactly what you’re looking for. Can you try it out and let me know if you still run into issues?