Kubernetes 集群中暴露和可视化 GPU 指标

在上一篇文章中，我写了如何让 Docker 和 Kubernetes识别 GPU 节点并使用它们。在繁重的训练阶段，ML和深度学习程序也需要了解 GPU 利用率、内存使用情况以及温度统计数据。公开 GPU 指标以便对其进行监控，需要做一些工作。在本文中，我想分享我在 Kubernetes集群中为 GPU 指标监控所做版权声明：本文遵循 CC 4.0 BY-SA 版权协议，若要转载请务必附上原文出处链接及本声明，谢谢合作！的工作。

我的 Kubernetes 版本是1.20.2，而且我有Tesla V100GPU。这些说明已经试过了centos7。

Metrics通常是通过收集，存储和可视化的指标服务器。Prometheus（开源监控，时间序列数据库）以及Grafana可视化最适合此目的。您之前可能已经看过这种架构，很高兴在这种情况下再次查看它。

我们将为监控解决方案安装Prometheus堆栈。然后我们将使用 NVIDIAData Center GPU Manager (DCGM)来公开 Prometheus的 GPU 指标。

一、安装 Prometheus

这很容易通过helm. 我们只需要在安装时做一些小的配置添加。添加prometheus helm存储库。

$ helm repo add prometheus-community \
   https://prometheus-community.github.io/helm-charts

更新。

$ helm repo update

现在我们需要修改一些设置。为此，我们将inspectcharts在文件中获取这些值。

helm inspect values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values

更改这些值，以便我们可以通过 NodePort公开服务，以便我们可以在浏览器上访问集群外的仪表板。在kube-prometheus-stack.values文件中查找此部分。

    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30903
    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
## Additional ports to open for Alertmanager service
    additionalPorts: []
externalIPs: []
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    ## Service type
    ##
    type: ClusterIP

更改nodePort和的值type。

nodePort: 30090
....
type: NodePort

## If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector will cause the
    ## prometheus resource to be created with selectors based on values in the helm deployment,
    ## which will also match the servicemonitors created
    ##
    serviceMonitorSelectorNilUsesHelmValues: false

## AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
## are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
## as specified in the official Prometheus documentation:
## https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. As scrape configs are
## appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
## to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
## scrape configs are going to break Prometheus after the upgrade.
##
## The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
## port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
##
additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - gpu-operator-resources
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node

现在保存这个文件。我们现在准备安装 Prometheus 和 Grafana。请注意，我们正在创建一个新的命名空间prometheus并安装到它。

$ helm install prometheus-community/kube-prometheus-stack \
   --create-namespace --namespace prometheus \
   --generate-name \
   --values /tmp/kube-prometheus-stack.values

您将看到安装成功。您还可以通过检查 pod 进行验证。

$ kubectl get pods -n prometheus
NAME                                                              READY   STATUS    RESTARTS   AGE
alertmanager-kube-prometheus-stack-1614-alertmanager-0            2/2     Running   0         2m
kube-prometheus-stack-1614-operator-5887789674-brbzf              1/1     Running   0         2m
kube-prometheus-stack-1614100923-grafana-6f6c5999f9-gspmp         2/2     Running   0         2m
kube-prometheus-stack-1614100923-kube-state-metrics-69bcc9zbj76   1/1     Running   0         2m
kube-prometheus-stack-1614100923-prometheus-node-exporter-6t9ln   1/1     Running   0         2m
kube-prometheus-stack-1614100923-prometheus-node-exporter-pp5pn   1/1     Running   0        2m
prometheus-kube-prometheus-stack-1614-prometheus-0                2/2     Running   0         2m

您还可以检查服务以查看 prometheus 在30090我们想要的端口上是否可用。


$ kubectl get svc -n prometheus
NAME                                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                                       ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   3m
kube-prometheus-stack-1614-alertmanager                     ClusterIP   10.97.150.200    <none>        9093/TCP                     3m
kube-prometheus-stack-1614-operator                         ClusterIP   10.107.120.24    <none>        443/TCP                      3m
kube-prometheus-stack-1614-prometheus                       NodePort    10.106.243.247   <none>        9090:30090/TCP               3m
kube-prometheus-stack-1614100923-grafana                    NodePort    10.100.30.205    <none>        80/TCP                 3m
kube-prometheus-stack-1614100923-kube-state-metrics         ClusterIP   10.111.133.205   <none>        8080/TCP                     3m
kube-prometheus-stack-1614100923-prometheus-node-exporter   ClusterIP   10.110.160.60    <none>        9100/TCP                     3m
prometheus-operated                                         ClusterIP   None             <none>        9090/TCP                     3m

二、安装 DCGM 导出器

git clone https://github.com/NVIDIA/gpu-monitoring-tools.git

现在，编辑default-counters.csv和dcp-metrics-include.csv下的文件gpu-monitoring-tools/etc/dcgm-exporter夹，添加我们想要的指标。

在 csv文件中的部分DCGM_FI_DEV_GPU_UTIL下查找标记Utilization。取消注释此行。在此期间，您可以评论此文件中不需要的任何其他指标。一些指标的收集需要大量计算，因此我们应该通过评论不需要的指标来减少系统负载。

现在在文件夹下gpu-monitoring-tools/deployment，执行以下命令安装dcgm-exporter。

helm install — generate-name dcgm-exporter/

您可以查看defaultdcgm-exporter 名称空间中的 pod。请注意，我的集群中有两个 GPU 节点。

$ kubectl get pods
NAME                             READY   STATUS      RESTARTS   AGE
dcgm-exporter-1615588874-l8pzq   1/1     Running     0          2m
dcgm-exporter-1615588874-sdj56   1/1     Running     0          2m

三、Prometheus仪表板

现在让我们通过打开http://<machine-ip-address>:30090 来验证 Prometheus 仪表板中公开的 GPU指标。您可以DCGM_GPU_UTIL在框中输入（或您感兴趣的任何其他指标）并查看结果。

四、配置 Grafana

与我们对 prometheus 仪表板所做的类似，我们还可以将ClusterIPGrafana 使用的默认暴露更改为NodePort，以便我们也可以在集群外的浏览器上访问 Grafana仪表板。让我们制作一个补丁文件。

$ cat << EOF | tee grafana-patch.yaml
spec:
  type: NodePort
  nodePort: 32322
EOF

现在用这个 yaml修补服务。您将看到一条消息，说明补丁已成功。

$ kubectl patch svc kube-prometheus-stack-1603211794-grafana \
   -n prometheus \
   --patch "$(cat grafana-patch.yaml)"

记下外部可访问端口。

$ kubectl get svc -n prometheus|grep grafana
kube-prometheus-stack-1614100923-grafana                    NodePort    10.100.30.205    <none>        80:32197/TCP                 3m

在我的情况下，端口是32197. 在浏览器上打开http://<machine-ip-address>:32197。系统将提示您登录。您将用作用admin户名。密码在我们之前更改的值文件中可用。

$ cat /tmp/kube-prometheus-stack.values|grep -i password
  adminPassword: prom-operator

五、Grafana 仪表板

我们可以使用 NVIDIA 提供的 DCGM仪表板。要导入它，在 Grafana 页面中，转到仪表板 -> 管理 -> 导入。在结果页面中，添加https://grafana.com/grafana/dashboards/12239 作为 url。

单击Load并选择Prometheus作为数据源。导入后，您可以观看新的仪表板。

您当然可以创建自定义仪表板来可视化更多指标。在下一篇文章中，我们将看到如何使用这些指标来自动扩展我们的工作负载。

参考文献

https://medium.com/@rajupavuluri/how-to-expose-and-visualize-gpu-metrics-in-your-kubernetes-cluster-2520a7c3ba37

Kubernetes 集群中暴露和可视化 GPU 指标

一、安装 Prometheus

二、安装 DCGM 导出器

三、Prometheus仪表板

四、配置 Grafana

五、Grafana 仪表板

参考文献

虚机网格(istio)管理实战篇

Kubeflow Volcano 实现典型 AI 训练任务

Cilium 网络模型之关键配置

Cilium Pod-to-Service 实地探索转发路径及 BPF 处理逻辑