How to resume flwr run from a checkpoint?

How do you resume training from a checkpoint with flwr run? For various reasons beyond my control. my system reboots. I’d like to have flwr run resume from the latest checkpoint (or from a named checkpoint). What is the best way to handle this?

Hi @griffith! I think one simple solution is to always save the checkpoints. Then, in the ServerApp code, you can load the checkpoint from the path provided in context.run_config["checkpoint"] (Or any other key names you like instead of "checkpoint").

For example, you could run this to continue from a named checkpoint:


flwr run ... --run-config 'checkpoint="path/to/your/saved/checkpoint"'