Experiment architecture with the new flwr tool

Hello there!

I have been using Flower for a while now, probably back to 0.15 or so. However, I have recently been falling behind with the new flwr tool and the architectural changes that went with it. Notably, I am wandering about the best way to test and compare multiple experiments, including existing baselines that seem to use now the new Flower project architecture (centered around pyproject.toml and {server,client}_app.py).

So here comes my main question I guess: how do you organize and automate experiments using this new tooling? I used to implement everything myself and use Hydra for configuration and experiment management, but it feels like reinventing the wheel. What would be a typical project where you replicate experiments over your own contribution and one or more baselines, obviously using the same parameters, model and dataset? Likewise, how do you collect metrics now that the History object has disappeared?

I’m very interested in hearing about your experience, and most importantly, how you planned the tool to be used because I currently have trouble wrapping my head around that. Thank you for your time!

2 Likes

Hello @leo-l, great question !

You are right that before we made the transition to the new flwr run style, it was quite neat to use tools like Hydra to parameterize pretty much all aspects in the experiments at hand (strategy, client, data partitioning, etc). This is exactly what motivated us to coordinate the first round of baselines (which made use of Hydra). However, as Flower evolves, that way of parameterizing experiments was a bit of a dead end: it was great for simulation but not applicable (or only partially at best) for real-world deployments.

Now answering your question. With the new structure in Flower Apps (comprised of ClientApp + ServerApp + pyproject.toml and that get executed via flwr run), you can override settings in the [tool.flwr.app.config] section in your Apps’s pyproject.toml from the CLI. You can do this by passing --run-config="....." to your flwr run command. For example, as shown in the quickstart-pytorch example, doing this:

# override the default number of rounds and learning rate
# defaults are defined in the pyproject.toml
flwr run . --run-config "num-server-rounds=5 learning-rate=0.05"

This config system is not yet as powerful as what Hydra offers but we are working on making it better. If welcome feedback if you have some after you give it a try ! :folded_hands:

W.r.t. the History object: no, it hasn’t gone away but now it’s only an internal object and it’s not being returned at the end of, let’s say, the simulation. The History object after all is quite minimal and not so useful on its own. A more versatile way of storing metrics and other results from the Runs is by means of a custom Strategy. For example, one that inherits from an existing strategy and modifies a couple of methods to save results to a JSON, or as a torch checkpoint. We put together an example showing this (and more) a little while ago: flower/examples/advanced-pytorch at main · adap/flower · GitHub

Happy to elaborate on the points above!

Thank you @javier for getting back to me! I would be interested in having some details.

I have actually already found the solution you mention, but it seems to me that it is more tailored for configuring one experiment rather than multiple at the same time. I see that, assuming that all your experiments share the same codebase, you could replicate the parameters in each of your simulations, but this seems convoluted. And what about swapping pieces of the architecture altogether? I heavily relied on the instantiation mechanism in Hydra to change model, dataset, or strategies, coupled with parameter sweeping to replay the entire experiment on variations of the architecture. Would it be possible with this new setup?

More generally, I wander about comparing multiple baselines in this context. Do you use multiple flower projects (i.e., multiple pyproject.toml)? If not, how do you merge existing baselines in your setup for comparison with your own algorithm?

About the History, this is indeed what I ended up doing. Thank you for confirming that this is the recommended way. Any reason the default Strategy does not already collect metrics though? I think it would simplify the UX, both for simulation and real deployments, to have metrics historized if (obviously) the user has implemented methods like evaluate or provided the {fit,evaluate}_metrics_aggregation_fn parameters.