Export metrics for prometheus

Hello,

I followed the guide at Quickstart with Docker - Flower Framework and everything is running smoothly.
I would like to know if it’s possible to expose the metrics for Prometheus.

2 Likes

Hi @fbcale , great to hear you got it working smoothly! What do you mean by metrics ? are you referring to for example the training loss etc ? or are you asking about system-level metrics like RAM/CPU/IO each container consumes?

1 Like

Hi @Javier,

I try to be more specific, and I’ll write down what I’ve tried in the meantime.

I deployed a federated learning infrastructure with one superlink and three supernodes.
I need to display the following information in Grafana:

  • CPU/GPU usage
  • Aggregated metrics

For the first one, I used cAdvisor, and for the second one, I implemented prometheus_client on the client app side using the Gauge class. I then used the start_http_server method to expose the /metrics endpoint.

Finally, I used Prometheus to scrape both.

However, I noticed that once I run the remote simulation, the client app remains idle (I assume due to the daemon thread created with start_http_server).
If I run the simulation again, the client app raises an exception because the exposed port is already in use.

Is there a better way to handle this?

1 Like

Hey @fbcale , I have created simple setups with Grafana, Prometheus and cAdvisor in the past. I also helped putting together this (now outdated) example in our repository.

I’m not sure I quite understand the problem you are describing. Do you have this code in a public repository I could take a look into? if not, could you create a repository with a very simple setup?

1 Like

Hi @javier,

I published an example on github, using your sklearn quickstarter.

When i run the remote simulation, once it ends, the serverapp continue to stay in idle due the start_http method.
Is there a way to handle prometheus with the deploy engine?

1 Like

Hi @fbcale , thanks for sharing the example. I think it would be better to put also supernode/superlink etc as services in your compose file. This is so everything can be spawned with a single command.

It is expected that all components remain running (but idle) once the run has finished. The “infrastructure” in Flower is technically detached from the “application”.