How do you resume training from a checkpoint with flwr run
? For various reasons beyond my control. my system reboots. I’d like to have flwr run
resume from the latest checkpoint (or from a named checkpoint). What is the best way to handle this?
Hi @griffith! I think one simple solution is to always save the checkpoints. Then, in the ServerApp
code, you can load the checkpoint from the path provided in context.run_config["checkpoint"]
(Or any other key names you like instead of "checkpoint"
).
For example, you could run this to continue from a named checkpoint:
flwr run ... --run-config 'checkpoint="path/to/your/saved/checkpoint"'