In a time where Green Computing is becoming ever important, it is desirable for the users to be aware of the energy consumed by their compute jobs. Measuring the energy consumption of whole compute platforms has always been a challenging effort, let alone measuring the consumption of individual batch jobs in a multi-tenant systems like HPC platforms. Compute Energy & Emissions Monitoring Stack (CEEMS) [1] has been created to address this challenge that is capable of reporting energy usage and equivalent emissions of individual compute units.
Most of the resource managers (SLURM, libvirt, Kubernetes, etc) in Linux use cgroups to allocate and manage resources of individual compute units. Thus, usage of CPU, memory, IO, etc of each compute unit are readily available in cgroups pseudo filesystem. For the case of energy usage, Running Average Power Limit (RAPL) metrics are available on most of the modern processors but they only CPU and DRAM consumption. On the other hand, Intelligent Platform Management Interface (IPMI) is capable of reporting the energy usage at the node level and hence, offer a more complete overview of energy usage. GPU vendors provide equivalent tools as IPMI for retrieving the energy usage of GPUs. In order to estimate emissions, the general practice is to rely on emission factors. Real time emission factors based on electricity mix are available from RTE eCO2 mix (only for France) [2], Electricity Maps [3] to name a few. Thus, mixing different data sources from cgroups, RAPL, IPMI, GPUs, emission factors and Power Usage Effectiveness (PUE) it is possible to estimate total energy and emissions of individual compute units.
The above described strategy has been implemented in CEEMS and it is meant to be used with other open source tools like Prometheus and Grafana. The attractive feature of CEEMS is that it uses a very simple deployment model which consists of a Prometheus exporter, a REST API server and an optional load balancer. CEEMS enables users to have a time series data of the energy consumption of their compute units. At the same time, system administrators and operators can have a global overview of the energy consumed by their compute platforms, users/projects that have consumed the most energy in a given time period, etc. Although the work has been started in the context of HPC platforms, this framework can be easily extended to any other resource manager.
References:
[1] https://github.com/mahendrapaipuri/ceems
[2] https://www.rte-france.com/en/eco2mix/co2-emissions
[3] https://api-portal.electricitymaps.com/
Anne Cadiou et Pierre Navaro