Kubernetes / Linux Note / Monitoring / 运维笔记

Kubernetes Prometheus 

Einic Yeo · 9月2日 · 2021年 · ·

一、简介

Promethues是一款原生云计算基金项目,完全开源用于监控系统、服务、容器、数据库等,收集根据客户端的target配置监控项,依据监控判断表达式,进行数据展示及警报阀值进行告版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!

与其它监控系统比较有如下特点

  1.   多维度数据模型(时间序列,KEY/VALUE)
  2.   灵活的查询语言
  3.   不依赖其它分布式存储 ,可自主独立存储
  4.   时间序列收集通过HTTP PULL MODEL
  5.   通过服务发现&配置文件来发现客户端
  6.   支持多种模型的dashboard

可以采用 push gateway 的方式把时间序列数据推送至 Prometheus server 端

采集方式

由于数据采集可能会有丢失,所以 Prometheus 不适用对采集数据要 100% 准确的情形。但如果用于记录时间序列数据,Prometheus 具有很大的查询优势,此外,Prometheus 适用于微服务的体系架构

pull方式

Prometheus采集数据是用的pull也就是拉模型,通过HTTP协议去采集指标,只要应用系统能够提供HTTP接口就可以接入监控系统,相比于私有协议或二进制协议来说开发、简单。

push方式

对于定时任务这种短周期的指标采集,如果采用pull模式,可能造成任务结束了,Prometheus还没有来得及采集,这个时候可以使用加一个中转层,客户端推数据到Push Gateway缓存一下,由Prometheus从push gateway pull指标过来。(需要额外搭建Push Ga版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!teway,同时需要新增job去从gateway采数据)

二、组件架构


Prometheus组件

  • Prometheus Server 数据库,数据处理,数据存储,数据查询,报警规则等数据图表展示功能
  • alertmanager 报警接收及发送管理,提供报警定义模板配置,报警发送,报警路由定义
  • Pushgateway 数据采集中转站,数据缓存管理
  • exporters 数据采集
  • Client library 

架构图

三、部署方式

部署方式helm

引用文档,如下

官方,如何配置服务发现来监控kubernetes

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config

 Kubernetes service discoveries暴露以下role以供Prometheus采集(Prometheu版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!s从kubernetes的REST API接口来scrape以下目标资源信息,并同时保持状态的同步)

  • node
  • endpoint
  • service
  • pod
  • ingress

安装方式

  1. 在集群外部安装Prometheus,需要在prometheus.yml配置集群的ca证书
  2. 在集群内部安装Prometheus,需要创建rbac鉴权策略,具体安装可参考helm官方安装,本指南主要使用helm安装

helm安装共有二个链接,一个官方helm目前已经弃用,其来源指向也是artifacthub,一个是artifacthub专业kubernetes安装集成部署CNCF projects

  1. https://github.com/helm/charts
  2. https://artifacthub.io

prometheus artifacthub.io 安装地址

https://artifacthub.io/packages/helm/prometheus-community/prometheus

链接已经详细的指出如何使用helm安装Prometheus,及预安装的必须条件

为什么必须依赖于kube-state-metrics

因为kube-state-metrics主要用来监控kubernetes的副本集的活动状态

  1. 添加repo,如下
<root@PROD-K8S-CP1 ~># helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
"prometheus-community" has been added to your repositories
<root@PROD-K8S-CP1 ~># helm repo add kube-state-metrics https://kubernetes.github.io/kube-state-metrics
"kube-state-metrics" has been added to your repositories
<root@PROD-K8S-CP1 ~># helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "kube-state-metrics" chart repository
...Successfully got an update from the "cilium" chart repository
...Successfully got an update from the "prometheus-community" chart repository
Update Complete. ⎈ Happy Helming!⎈

2. 安装Prometheus

<root@PROD-K8S-CP1 ~>#  helm install prometheus prometheus-community/prometheus --version 14.1.0
NAME: prometheus
LAST DEPLOYED: Thu Sep  2 15:57:19 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Prometheus server can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-server.default.svc.cluster.local


Get the Prometheus server URL by running these commands in the same shell:
  export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace default port-forward $POD_NAME 9090


The Prometheus alertmanager can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-alertmanager.default.svc.cluster.local


Get the Alertmanager URL by running these commands in the same shell:
  export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=alertmanager" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace default port-forward $POD_NAME 9093
#################################################################################
######   WARNING: Pod Security Policy has been moved to a global property.  #####
######            use .Values.podSecurityPolicy.enabled with pod-based      #####
######            annotations                                               #####
######            (e.g. .Values.nodeExporter.podSecurityPolicy.annotations) #####
#################################################################################


The Prometheus PushGateway can be accessed via port 9091 on the following DNS name from within your cluster:
prometheus-pushgateway.default.svc.cluster.local


Get the PushGateway URL by running these commands in the same shell:
  export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=pushgateway" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace default port-forward $POD_NAME 9091

For more information on running Prometheus, visit:
https://prometheus.io/

3. 查看部署状态

<root@PROD-K8S-CP1 ~># kubectl get pods --all-namespaces  -o wide| grep node
default         prometheus-node-exporter-9rm22                   1/1     Running   0          51m     10.1.17.237     prod-be-k8s-wn7     <none>           <none>
default         prometheus-node-exporter-qfxgz                   1/1     Running   0          51m     10.1.17.238     prod-be-k8s-wn8     <none>           <none>
default         prometheus-node-exporter-z5znx                   1/1     Running   0          51m     10.1.17.236     prod-be-k8s-wn6     <none>           <none>
<root@PROD-K8S-CP1 ~># kubectl get pods --all-namespaces  -o wide| grep prometheus
default         prometheus-alertmanager-755d84cf4f-rfz4n         0/2     Pending   0          51m     <none>          <none>              <none>           <none>
default         prometheus-kube-state-metrics-86dc6bb59f-wlpcl   1/1     Running   0          51m     172.21.12.167   prod-be-k8s-wn8     <none>           <none>
default         prometheus-node-exporter-9rm22                   1/1     Running   0          51m     10.1.17.237     prod-be-k8s-wn7     <none>           <none>
default         prometheus-node-exporter-qfxgz                   1/1     Running   0          51m     10.1.17.238     prod-be-k8s-wn8     <none>           <none>
default         prometheus-node-exporter-z5znx                   1/1     Running   0          51m     10.1.17.236     prod-be-k8s-wn6     <none>           <none>
default         prometheus-pushgateway-745d67dd5f-7ckvv          1/1     Running   0          51m     172.21.12.2     prod-be-k8s-wn7     <none>           <none>
default         prometheus-server-867f854484-lcrq6 # 默认部署,启用了持久化存储,由于默认的持久化存储没有指定详细的PVC,所以在安装完需要调整持久化配置段,当然也可以选择不需要持久化存储,可以选择将本地卷映射入Pod中

4. 自定义安装

helm install prometheus prometheus-community/prometheus -f prometheus-values.yaml
<root@PROD-K8S-CP1 ~># helm show values prometheus-community/prometheus --version 14.1.0 > prometheus-values.yaml

修订版
rbac:
  create: true

podSecurityPolicy:
  enabled: false

imagePullSecrets:
# - name: "image-pull-secret"

## Define serviceAccount names for components. Defaults to component's fully qualified name.
##
serviceAccounts:
  alertmanager:
    create: true
    name:
    annotations: {}
  nodeExporter:
    create: true
    name:
    annotations: {}
  pushgateway:
    create: true
    name:
    annotations: {}
  server:
    create: true
    name:
    annotations: {}

alertmanager:
  ## If false, alertmanager will not be installed
  ##
  enabled: true

  ## Use a ClusterRole (and ClusterRoleBinding)
  ## - If set to false - we define a Role and RoleBinding in the defined namespaces ONLY
  ## This makes alertmanager work - for users who do not have ClusterAdmin privs, but wants alertmanager to operate on their own namespaces, instead of clusterwide.
  useClusterRole: true

  ## Set to a rolename to use existing role - skipping role creating - but still doing serviceaccount and rolebinding to the rolename set here.
  useExistingRole: false

  ## alertmanager container name
  ##
  name: alertmanager

  ## alertmanager container image
  ##
  image:
    repository: quay.io/prometheus/alertmanager
    tag: v0.21.0
    pullPolicy: IfNotPresent

  ## alertmanager priorityClassName
  ##
  priorityClassName: ""

  ## Additional alertmanager container arguments
  ##
  extraArgs: {}

  ## Additional InitContainers to initialize the pod
  ##
  extraInitContainers: []

  ## The URL prefix at which the container can be accessed. Useful in the case the '-web.external-url' includes a slug
  ## so that the various internal URLs are still able to access as they are in the default case.
  ## (Optional)
  prefixURL: ""

  ## External URL which can access alertmanager
  baseURL: "http://localhost:9093"

  ## Additional alertmanager container environment variable
  ## For instance to add a http_proxy
  ##
  extraEnv: {}

  ## Additional alertmanager Secret mounts
  # Defines additional mounts with secrets. Secrets must be manually created in the namespace.
  extraSecretMounts: []
    # - name: secret-files
    #   mountPath: /etc/secrets
    #   subPath: ""
    #   secretName: alertmanager-secret-files
    #   readOnly: true

  ## ConfigMap override where fullname is {{.Release.Name}}-{{.Values.alertmanager.configMapOverrideName}}
  ## Defining configMapOverrideName will cause templates/alertmanager-configmap.yaml
  ## to NOT generate a ConfigMap resource
  ##
  configMapOverrideName: ""

  ## The name of a secret in the same kubernetes namespace which contains the Alertmanager config
  ## Defining configFromSecret will cause templates/alertmanager-configmap.yaml
  ## to NOT generate a ConfigMap resource
  ##
  configFromSecret: ""

  ## The configuration file name to be loaded to alertmanager
  ## Must match the key within configuration loaded from ConfigMap/Secret
  ##
  configFileName: alertmanager.yml

  ingress:
    ## If true, alertmanager Ingress will be created
    ##
    enabled: false

    ## alertmanager Ingress annotations
    ##
    annotations: {}
    #   kubernetes.io/ingress.class: nginx
    #   kubernetes.io/tls-acme: 'true'

    ## alertmanager Ingress additional labels
    ##
    extraLabels: {}

    ## alertmanager Ingress hostnames with optional path
    ## Must be provided if Ingress is enabled
    ##
    hosts: []
    #   - alertmanager.domain.com
    #   - domain.com/alertmanager

    ## Extra paths to prepend to every host configuration. This is useful when working with annotation based services.
    extraPaths: []
    # - path: /*
    #   backend:
    #     serviceName: ssl-redirect
    #     servicePort: use-annotation

    ## alertmanager Ingress TLS configuration
    ## Secrets must be manually created in the namespace
    ##
    tls: []
    #   - secretName: prometheus-alerts-tls
    #     hosts:
    #       - alertmanager.domain.com

  ## Alertmanager Deployment Strategy type
  # strategy:
  #   type: Recreate

  ## Node tolerations for alertmanager scheduling to nodes with taints
  ## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
  ##
  ##  配置alertmanager污点容忍
  tolerations:
    - key: resource
    #   operator: "Equal|Exists"
      value: base
    #   effect: "NoSchedule|PreferNoSchedule|NoExecute(1.6 only)"
      effect: NoExecute

  ## Node labels for alertmanager pod assignment
  ## Ref: https://kubernetes.io/docs/user-guide/node-selection/
  ##
  nodeSelector:
    kubernetes.io/hostname: prod-sys-k8s-wn3

  ## Pod affinity
  ##
  ## 配置alertmanager节点亲和性
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/resource
                operator: In
                values:
                  - base

  ## PodDisruptionBudget settings
  ## ref: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
  ##
  podDisruptionBudget:
    enabled: false
    maxUnavailable: 1

  ## Use an alternate scheduler, e.g. "stork".
  ## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
  ##
  # schedulerName:

  persistentVolume:
    ## If true, alertmanager will create/use a Persistent Volume Claim
    ## If false, use emptyDir
    ##
    enabled: true

    ## alertmanager data Persistent Volume access modes
    ## Must match those of existing PV or dynamic provisioner
    ## Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
    ##
    accessModes:
      - ReadWriteOnce

    ## alertmanager data Persistent Volume Claim annotations
    ##
    annotations: {}

    ## alertmanager data Persistent Volume existing claim name
    ## Requires alertmanager.persistentVolume.enabled: true
    ## If defined, PVC must be created manually before volume will be bound
    existingClaim: ""

    ## alertmanager data Persistent Volume mount root path
    ##
    mountPath: /data

    ## alertmanager data Persistent Volume size
    ##
    size: 20Gi

    ## alertmanager data Persistent Volume Storage Class
    ## If defined, storageClassName: <storageClass>
    ## If set to "-", storageClassName: "", which disables dynamic provisioning
    ## If undefined (the default) or set to null, no storageClassName spec is
    ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
    ##   GKE, AWS & OpenStack)
    ##
    # storageClass: "-"
    ## 配置alertmanager持久化存储
    storageClass: "alicloud-disk-essd"

    ## alertmanager data Persistent Volume Binding Mode
    ## If defined, volumeBindingMode: <volumeBindingMode>
    ## If undefined (the default) or set to null, no volumeBindingMode spec is
    ##   set, choosing the default mode.
    ##
    # volumeBindingMode: ""

    ## Subdirectory of alertmanager data Persistent Volume to mount
    ## Useful if the volume's root directory is not empty
    ##
    subPath: ""

  emptyDir:
    ## alertmanager emptyDir volume size limit
    ##
    sizeLimit: ""

  ## Annotations to be added to alertmanager pods
  ##
  podAnnotations: {}
    ## Tell prometheus to use a specific set of alertmanager pods
    ## instead of all alertmanager pods found in the same namespace
    ## Useful if you deploy multiple releases within the same namespace
    ##
    ## prometheus.io/probe: alertmanager-teamA

  ## Labels to be added to Prometheus AlertManager pods
  ##
  podLabels: {}

  ## Specify if a Pod Security Policy for node-exporter must be created
  ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
  ##
  podSecurityPolicy:
    annotations: {}
      ## Specify pod annotations
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#apparmor
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl
      ##
      # seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
      # seccomp.security.alpha.kubernetes.io/defaultProfileName: 'docker/default'
      # apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default'

  ## Use a StatefulSet if replicaCount needs to be greater than 1 (see below)
  ##
  replicaCount: 1

  ## Annotations to be added to deployment
  ##
  deploymentAnnotations: {}

  statefulSet:
    ## If true, use a statefulset instead of a deployment for pod management.
    ## This allows to scale replicas to more than 1 pod
    ##
    enabled: false

    annotations: {}
    labels: {}
    podManagementPolicy: OrderedReady

    ## Alertmanager headless service to use for the statefulset
    ##
    headless:
      annotations: {}
      labels: {}

      ## Enabling peer mesh service end points for enabling the HA alert manager
      ## Ref: https://github.com/prometheus/alertmanager/blob/master/README.md
      enableMeshPeer: false

      servicePort: 80

  ## alertmanager resource requests and limits
  ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
  ##
  ## 配置资源限额
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 10m
      memory: 32Mi

  # Custom DNS configuration to be added to alertmanager pods
  dnsConfig: {}
    # nameservers:
    #   - 1.2.3.4
    # searches:
    #   - ns1.svc.cluster-domain.example
    #   - my.dns.search.suffix
    # options:
    #   - name: ndots
    #     value: "2"
  #   - name: edns0

  ## 配置网络模式
  hostNetwork: false

  ## Security context to be added to alertmanager pods
  ##
  securityContext:
    runAsUser: 65534
    runAsNonRoot: true
    runAsGroup: 65534
    fsGroup: 65534

  service:
    annotations: {}
    labels: {}
    clusterIP: ""

    ## Enabling peer mesh service end points for enabling the HA alert manager
    ## Ref: https://github.com/prometheus/alertmanager/blob/master/README.md
    # enableMeshPeer : true

    ## List of IP addresses at which the alertmanager service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    ## 配置service外部地址
    externalIPs:
      - 10.1.0.11
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 9093
    # nodePort: 30000
    sessionAffinity: None
    type: ClusterIP

## Monitors ConfigMap changes and POSTs to a URL
## Ref: https://github.com/jimmidyson/configmap-reload
##
configmapReload:
  prometheus:
    ## If false, the configmap-reload container will not be deployed
    ##
    enabled: true

    ## configmap-reload container name
    ##
    name: configmap-reload

    ## configmap-reload container image
    ##
    image:
      repository: jimmidyson/configmap-reload
      tag: v0.5.0
      pullPolicy: IfNotPresent

    ## Additional configmap-reload container arguments
    ##
    extraArgs: {}
    ## Additional configmap-reload volume directories
    ##
    extraVolumeDirs: []


    ## Additional configmap-reload mounts
    ##
    extraConfigmapMounts: []
      # - name: prometheus-alerts
      #   mountPath: /etc/alerts.d
      #   subPath: ""
      #   configMap: prometheus-alerts
      #   readOnly: true


    ## configmap-reload resource requests and limits
    ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
    ##
    resources: {}
  alertmanager:
    ## If false, the configmap-reload container will not be deployed
    ##
    enabled: true

    ## configmap-reload container name
    ##
    name: configmap-reload

    ## configmap-reload container image
    ##
    image:
      repository: jimmidyson/configmap-reload
      tag: v0.5.0
      pullPolicy: IfNotPresent

    ## Additional configmap-reload container arguments
    ##
    extraArgs: {}
    ## Additional configmap-reload volume directories
    ##
    extraVolumeDirs: []


    ## Additional configmap-reload mounts
    ##
    extraConfigmapMounts: []
      # - name: prometheus-alerts
      #   mountPath: /etc/alerts.d
      #   subPath: ""
      #   configMap: prometheus-alerts
      #   readOnly: true


    ## configmap-reload resource requests and limits
    ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
    ##
    resources: {}

kubeStateMetrics:
  ## If false, kube-state-metrics sub-chart will not be installed
  ##
  enabled: true
  ## 配置节点亲和性
  nodeSelector:
    kubernetes.io/hostname: prod-sys-k8s-wn3
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/resource
                operator: In
                values:
                  - base
  priorityClassName: "monitor-service"
  resources:
    limits:
      cpu: 1
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 30Mi
  tolerations:
    - key: resource
    #   operator: "Equal|Exists"
      value: base
      effect: NoExecute

## kube-state-metrics sub-chart configurable values
## Please see https://github.com/kubernetes/kube-state-metrics/tree/master/charts/kube-state-metrics
##
# kube-state-metrics:

nodeExporter:
  ## If false, node-exporter will not be installed
  ##
  enabled: true

  ## If true, node-exporter pods share the host network namespace
  ##
  hostNetwork: true

  ## If true, node-exporter pods share the host PID namespace
  ##
  hostPID: true

  ## If true, node-exporter pods mounts host / at /host/root
  ##
  hostRootfs: true

  ## node-exporter container name
  ##
  name: node-exporter

  ## node-exporter container image
  ##
  image:
    repository: quay.io/prometheus/node-exporter
    tag: v1.1.2
    pullPolicy: IfNotPresent

  ## 优先级配置
  priorityClassName: "monitor-service"

  ## Specify if a Pod Security Policy for node-exporter must be created
  ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
  ##
  podSecurityPolicy:
    annotations: {}
      ## Specify pod annotations
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#apparmor
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl
      ##
      # seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
      # seccomp.security.alpha.kubernetes.io/defaultProfileName: 'docker/default'
      # apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default'

  ## node-exporter priorityClassName
  ## 配置node-exporter优先级
  priorityClassName: "monitor-service"

  ## Custom Update Strategy
  ## 配置滚动更新策略
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1

  ## Additional node-exporter container arguments
  ##
  extraArgs: {}

  ## Additional InitContainers to initialize the pod
  ##
  extraInitContainers: []

  ## Additional node-exporter hostPath mounts
  ##
  extraHostPathMounts: []
    # - name: textfile-dir
    #   mountPath: /srv/txt_collector
    #   hostPath: /var/lib/node-exporter
    #   readOnly: true
    #   mountPropagation: HostToContainer

  extraConfigmapMounts: []
    # - name: certs-configmap
    #   mountPath: /prometheus
    #   configMap: certs-configmap
    #   readOnly: true

  ## Node tolerations for node-exporter scheduling to nodes with taints
  ## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
  ##
  ## 配置node-exporter污点容忍
  tolerations:
    # - key: "key"
    #   operator: "Equal|Exists"
    - operator: Exists
    #   value: "value"
    #   effect: "NoSchedule|PreferNoSchedule|NoExecute(1.6 only)"

  ## Node labels for node-exporter pod assignment
  ## Ref: https://kubernetes.io/docs/user-guide/node-selection/
  ##
  nodeSelector: {}

  ## Annotations to be added to node-exporter pods
  ##
  podAnnotations: {}

  ## Labels to be added to node-exporter pods
  ##
  pod:
    labels: {}

  ## PodDisruptionBudget settings
  ## ref: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
  ##
  podDisruptionBudget:
    enabled: false
    maxUnavailable: 1

  ## node-exporter resource limits & requests
  ## Ref: https://kubernetes.io/docs/user-guide/compute-resources/
  ##
  ## 配置node-exporter资源限额
  resources:
    limits:
      cpu: 1
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 30Mi

  # Custom DNS configuration to be added to node-exporter pods
  dnsConfig: {}
    # nameservers:
    #   - 1.2.3.4
    # searches:
    #   - ns1.svc.cluster-domain.example
    #   - my.dns.search.suffix
    # options:
    #   - name: ndots
    #     value: "2"
  #   - name: edns0

  ## Security context to be added to node-exporter pods
  ##
  securityContext:
    fsGroup: 65534
    runAsGroup: 65534
    runAsNonRoot: true
    runAsUser: 65534

  service:
    annotations:
      prometheus.io/scrape: "true"
    labels: {}

    # Exposed as a headless service:
    # https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
    clusterIP: None

    ## List of IP addresses at which the node-exporter service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    externalIPs: []

    hostPort: 9100
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 9100
    type: ClusterIP

server:
  ## Prometheus server container name
  ##
  enabled: true

  ## Use a ClusterRole (and ClusterRoleBinding)
  ## - If set to false - we define a RoleBinding in the defined namespaces ONLY
  ##
  ## NB: because we need a Role with nonResourceURL's ("/metrics") - you must get someone with Cluster-admin privileges to define this role for you, before running with this setting enabled.
  ##     This makes prometheus work - for users who do not have ClusterAdmin privs, but wants prometheus to operate on their own namespaces, instead of clusterwide.
  ##
  ## You MUST also set namespaces to the ones you have access to and want monitored by Prometheus.
  ##
  # useExistingClusterRoleName: nameofclusterrole

  ## namespaces to monitor (instead of monitoring all - clusterwide). Needed if you want to run without Cluster-admin privileges.
  # namespaces:
  #   - yournamespace

  ## 配置prometheus容器名称
  name: server

  # sidecarContainers - add more containers to prometheus server
  # Key/Value where Key is the sidecar `- name: <Key>`
  # Example:
  #   sidecarContainers:
  #      webserver:
  #        image: nginx
  sidecarContainers: {}

  ## Prometheus server container image
  ##
  image:
    repository: quay.io/prometheus/prometheus
    tag: v2.26.0
    pullPolicy: IfNotPresent

  ## prometheus server priorityClassName
  ##
  priorityClassName: "monitor-service"

  ## EnableServiceLinks indicates whether information about services should be injected
  ## into pod's environment variables, matching the syntax of Docker links.
  ## WARNING: the field is unsupported and will be skipped in K8s prior to v1.13.0.
  ##
  enableServiceLinks: true

  ## The URL prefix at which the container can be accessed. Useful in the case the '-web.external-url' includes a slug
  ## so that the various internal URLs are still able to access as they are in the default case.
  ## (Optional)
  prefixURL: ""

  ## External URL which can access prometheus
  ## Maybe same with Ingress host name
  baseURL: ""

  ## Additional server container environment variables
  ##
  ## You specify this manually like you would a raw deployment manifest.
  ## This means you can bind in environment variables from secrets.
  ##
  ## e.g. static environment variable:
  ##  - name: DEMO_GREETING
  ##    value: "Hello from the environment"
  ##
  ## e.g. secret environment variable:
  ## - name: USERNAME
  ##   valueFrom:
  ##     secretKeyRef:
  ##       name: mysecret
  ##       key: username
  env: []

  extraFlags:
    - web.enable-lifecycle
    ## web.enable-admin-api flag controls access to the administrative HTTP API which includes functionality such as
    ## deleting time series. This is disabled by default.
    - web.enable-admin-api
    ##
    ## storage.tsdb.no-lockfile flag controls BD locking
    # - storage.tsdb.no-lockfile
    ##
    ## storage.tsdb.wal-compression flag enables compression of the write-ahead log (WAL)
    # - storage.tsdb.wal-compression

  ## Path to a configuration file on prometheus server container FS
  configPath: /etc/config/prometheus.yml

  ### The data directory used by prometheus to set --storage.tsdb.path
  ### When empty server.persistentVolume.mountPath is used instead
  ## 配置持久化存储SC
  storagePath: ""

  ## Prometheus配置文件自定义
  global:
    ## How frequently to scrape targets by default
    ##
    scrape_interval: 15s
    ## How long until a scrape request times out
    ##
    scrape_timeout: 10s
    ## How frequently to evaluate rules
    ##
    evaluation_interval: 15s
  ## https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write
  ##
  remoteWrite: []
  ## https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read
  ##
  remoteRead: []

  ## Additional Prometheus server container arguments
  ##
  extraArgs: {}

  ## Additional InitContainers to initialize the pod
  ##
  extraInitContainers: []

  ## Additional Prometheus server Volume mounts
  ##
  extraVolumeMounts: []

  ## Additional Prometheus server Volumes
  ##
  extraVolumes: []

  ## Additional Prometheus server hostPath mounts
  ##
  extraHostPathMounts: []
    # - name: certs-dir
    #   mountPath: /etc/kubernetes/certs
    #   subPath: ""
    #   hostPath: /etc/kubernetes/certs
    #   readOnly: true

  extraConfigmapMounts: []
    # - name: certs-configmap
    #   mountPath: /prometheus
    #   subPath: ""
    #   configMap: certs-configmap
    #   readOnly: true

  ## Additional Prometheus server Secret mounts
  # Defines additional mounts with secrets. Secrets must be manually created in the namespace.
  extraSecretMounts: []
    # - name: secret-files
    #   mountPath: /etc/secrets
    #   subPath: ""
    #   secretName: prom-secret-files
    #   readOnly: true

  ## ConfigMap override where fullname is {{.Release.Name}}-{{.Values.server.configMapOverrideName}}
  ## Defining configMapOverrideName will cause templates/server-configmap.yaml
  ## to NOT generate a ConfigMap resource
  ##
  configMapOverrideName: ""

  ingress:
    ## If true, Prometheus server Ingress will be created
    ##
    enabled: false

    ## Prometheus server Ingress annotations
    ##
    annotations: {}
    #   kubernetes.io/ingress.class: nginx
    #   kubernetes.io/tls-acme: 'true'

    ## Prometheus server Ingress additional labels
    ##
    extraLabels: {}

    ## Prometheus server Ingress hostnames with optional path
    ## Must be provided if Ingress is enabled
    ##
    hosts: []
    #   - prometheus.domain.com
    #   - domain.com/prometheus

    ## Extra paths to prepend to every host configuration. This is useful when working with annotation based services.
    extraPaths: []
    # - path: /*
    #   backend:
    #     serviceName: ssl-redirect
    #     servicePort: use-annotation

    ## Prometheus server Ingress TLS configuration
    ## Secrets must be manually created in the namespace
    ##
    tls: []
    #   - secretName: prometheus-server-tls
    #     hosts:
    #       - prometheus.domain.com

  ## Server Deployment Strategy type
  # strategy:
  #   type: Recreate

  ## hostAliases allows adding entries to /etc/hosts inside the containers
  hostAliases: []
  #   - ip: "127.0.0.1"
  #     hostnames:
  #       - "example.com"

  ## Node tolerations for server scheduling to nodes with taints
  ## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
  ##
  ## 配置prometheus-server污点容忍
  tolerations:
    - key: resource
    #   operator: "Equal|Exists"
      value: base
    #   effect: "NoSchedule|PreferNoSchedule|NoExecute(1.6 only)"
      effect: NoExecute

  ## Node labels for Prometheus server pod assignment
  ## Ref: https://kubernetes.io/docs/user-guide/node-selection/
  ##
  nodeSelector:
    kubernetes.io/hostname: prod-sys-k8s-wn4

  ## Pod affinity
  ##
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/resource
                operator: In
                values:
                  - base

  ## PodDisruptionBudget settings
  ## ref: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
  ##
  podDisruptionBudget:
    enabled: false
    maxUnavailable: 1

  ## Use an alternate scheduler, e.g. "stork".
  ## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
  ##
  # schedulerName:

  persistentVolume:
    ## If true, Prometheus server will create/use a Persistent Volume Claim
    ## If false, use emptyDir
    ##
    enabled: true

    ## Prometheus server data Persistent Volume access modes
    ## Must match those of existing PV or dynamic provisioner
    ## Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
    ##
    accessModes:
      - ReadWriteOnce

    ## Prometheus server data Persistent Volume annotations
    ##
    annotations: {}

    ## Prometheus server data Persistent Volume existing claim name
    ## Requires server.persistentVolume.enabled: true
    ## If defined, PVC must be created manually before volume will be bound
    existingClaim: ""

    ## Prometheus server data Persistent Volume mount root path
    ##
    mountPath: /data

    ## Prometheus server data Persistent Volume size
    ##
    size: 500Gi

    ## Prometheus server data Persistent Volume Storage Class
    ## If defined, storageClassName: <storageClass>
    ## If set to "-", storageClassName: "", which disables dynamic provisioning
    ## If undefined (the default) or set to null, no storageClassName spec is
    ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
    ##   GKE, AWS & OpenStack)
    ##
    storageClass: "alicloud-disk-essd"

    ## Prometheus server data Persistent Volume Binding Mode
    ## If defined, volumeBindingMode: <volumeBindingMode>
    ## If undefined (the default) or set to null, no volumeBindingMode spec is
    ##   set, choosing the default mode.
    ##
    # volumeBindingMode: ""

    ## Subdirectory of Prometheus server data Persistent Volume to mount
    ## Useful if the volume's root directory is not empty
    ##
    subPath: ""

  emptyDir:
    ## Prometheus server emptyDir volume size limit
    ##
    sizeLimit: ""

  ## Annotations to be added to Prometheus server pods
  ##
  podAnnotations: {}
    # iam.amazonaws.com/role: prometheus

  ## Labels to be added to Prometheus server pods
  ##
  podLabels: {}

  ## Prometheus AlertManager configuration
  ##
  alertmanagers: []

  ## Specify if a Pod Security Policy for node-exporter must be created
  ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
  ##
  podSecurityPolicy:
    annotations: {}
      ## Specify pod annotations
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#apparmor
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl
      ##
      # seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
      # seccomp.security.alpha.kubernetes.io/defaultProfileName: 'docker/default'
      # apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default'

  ## Use a StatefulSet if replicaCount needs to be greater than 1 (see below)
  ##
  replicaCount: 1

  ## Annotations to be added to deployment
  ##
  deploymentAnnotations: {}

  statefulSet:
    ## If true, use a statefulset instead of a deployment for pod management.
    ## This allows to scale replicas to more than 1 pod
    ##
    enabled: false

    annotations: {}
    labels: {}
    podManagementPolicy: OrderedReady

    ## Alertmanager headless service to use for the statefulset
    ##
    headless:
      annotations: {}
      labels: {}
      servicePort: 80
      ## Enable gRPC port on service to allow auto discovery with thanos-querier
      gRPC:
        enabled: false
        servicePort: 10901
        # nodePort: 10901

  ## Prometheus server readiness and liveness probe initial delay and timeout
  ## Ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
  ##
  readinessProbeInitialDelay: 30
  readinessProbePeriodSeconds: 5
  readinessProbeTimeout: 4
  readinessProbeFailureThreshold: 3
  readinessProbeSuccessThreshold: 1
  livenessProbeInitialDelay: 30
  livenessProbePeriodSeconds: 15
  livenessProbeTimeout: 10
  livenessProbeFailureThreshold: 3
  livenessProbeSuccessThreshold: 1

  ## Prometheus server resource requests and limits
  ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
  ##
  ## prometheus-server资源限额
  resources:
    limits:
      cpu: 4
      memory: 30Gi
    requests:
      cpu: 500m
      memory: 5Gi

  # Required for use in managed kubernetes clusters (such as AWS EKS) with custom CNI (such as calico),
  # because control-plane managed by AWS cannot communicate with pods' IP CIDR and admission webhooks are not working
  ##
  hostNetwork: false

  # When hostNetwork is enabled, you probably want to set this to ClusterFirstWithHostNet
  dnsPolicy: ClusterFirst

  ## Vertical Pod Autoscaler config
  ## Ref: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
  verticalAutoscaler:
    ## If true a VPA object will be created for the controller (either StatefulSet or Deployemnt, based on above configs)
    enabled: false
    # updateMode: "Auto"
    # containerPolicies:
    # - containerName: 'prometheus-server'

  # Custom DNS configuration to be added to prometheus server pods
  dnsConfig: {}
    # nameservers:
    #   - 1.2.3.4
    # searches:
    #   - ns1.svc.cluster-domain.example
    #   - my.dns.search.suffix
    # options:
    #   - name: ndots
    #     value: "2"
  #   - name: edns0
  ## Security context to be added to server pods
  ##
  securityContext:
    runAsUser: 65534
    runAsNonRoot: true
    runAsGroup: 65534
    fsGroup: 65534

  service:
    annotations: {}
    labels: {}
    clusterIP: ""

    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    ## 配置prometheus-server-service外部IP访问
    externalIPs:
      - 10.1.0.10

    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 9090
    sessionAffinity: None
    type: ClusterIP

    ## Enable gRPC port on service to allow auto discovery with thanos-querier
    gRPC:
      enabled: false
      servicePort: 10901
      # nodePort: 10901

    ## If using a statefulSet (statefulSet.enabled=true), configure the
    ## service to connect to a specific replica to have a consistent view
    ## of the data.
    statefulsetReplica:
      enabled: false
      replica: 0

  ## Prometheus server pod termination grace period
  ##
  terminationGracePeriodSeconds: 300

  ## Prometheus data retention period (default if not specified is 15 days)
  ##
  retention: "30d"

pushgateway:
  ## If false, pushgateway will not be installed
  ##
  enabled: true

  ## Use an alternate scheduler, e.g. "stork".
  ## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
  ##
  # schedulerName:

  ## pushgateway container name
  ##
  name: pushgateway

  ## pushgateway container image
  ##
  image:
    repository: prom/pushgateway
    tag: v1.3.1
    pullPolicy: IfNotPresent

  ## pushgateway priorityClassName
  ##
  priorityClassName: "monitor-service"

  ## Additional pushgateway container arguments
  ##
  ## for example: persistence.file: /data/pushgateway.data
  extraArgs: {}

  ## Additional InitContainers to initialize the pod
  ##
  extraInitContainers: []

  ingress:
    ## If true, pushgateway Ingress will be created
    ##
    enabled: false

    ## pushgateway Ingress annotations
    ##
    annotations: {}
    #   kubernetes.io/ingress.class: nginx
    #   kubernetes.io/tls-acme: 'true'

    ## pushgateway Ingress hostnames with optional path
    ## Must be provided if Ingress is enabled
    ##
    hosts: []
    #   - pushgateway.domain.com
    #   - domain.com/pushgateway

    ## Extra paths to prepend to every host configuration. This is useful when working with annotation based services.
    extraPaths: []
    # - path: /*
    #   backend:
    #     serviceName: ssl-redirect
    #     servicePort: use-annotation

    ## pushgateway Ingress TLS configuration
    ## Secrets must be manually created in the namespace
    ##
    tls: []
    #   - secretName: prometheus-alerts-tls
    #     hosts:
    #       - pushgateway.domain.com

  ## Node tolerations for pushgateway scheduling to nodes with taints
  ## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
  ##
  ## 配置PushGateway污点容忍
  tolerations:
    - key: resource
    #   operator: "Equal|Exists"
      value: base
    #   effect: "NoSchedule|PreferNoSchedule|NoExecute(1.6 only)"
      effect: NoExecute

  ## Node labels for pushgateway pod assignment
  ## Ref: https://kubernetes.io/docs/user-guide/node-selection/
  ##
  nodeSelector: {}

  ## 配置节点亲和性
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/resource
                operator: In
                values:
                  - base

  ## Annotations to be added to pushgateway pods
  ##
  podAnnotations: {}

  ## Labels to be added to pushgateway pods
  ##
  podLabels: {}

  ## Specify if a Pod Security Policy for node-exporter must be created
  ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
  ##
  podSecurityPolicy:
    annotations: {}
      ## Specify pod annotations
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#apparmor
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
      ## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl
      ##
      # seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
      # seccomp.security.alpha.kubernetes.io/defaultProfileName: 'docker/default'
      # apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default'

  replicaCount: 1

  ## Annotations to be added to deployment
  ##
  deploymentAnnotations: {}

  ## PodDisruptionBudget settings
  ## ref: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
  ##
  podDisruptionBudget:
    enabled: false
    maxUnavailable: 1

  ## pushgateway resource requests and limits
  ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
  ##
  ## 配置PushGateway资源限额
  resources:
    limits:
      cpu: 10m
      memory: 32Mi
    requests:
      cpu: 10m
      memory: 32Mi

  # Custom DNS configuration to be added to push-gateway pods
  dnsConfig: {}
    # nameservers:
    #   - 1.2.3.4
    # searches:
    #   - ns1.svc.cluster-domain.example
    #   - my.dns.search.suffix
    # options:
    #   - name: ndots
    #     value: "2"
  #   - name: edns0

  ## Security context to be added to push-gateway pods
  ##
  securityContext:
    runAsUser: 65534
    runAsNonRoot: true

  service:
    annotations:
      prometheus.io/probe: pushgateway
    labels: {}
    clusterIP: ""

    ## List of IP addresses at which the pushgateway service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    externalIPs: []

    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 9091
    type: ClusterIP

  ## pushgateway Deployment Strategy type
  # strategy:
  #   type: Recreate

  persistentVolume:
    ## If true, pushgateway will create/use a Persistent Volume Claim
    ##
    enabled: false

    ## pushgateway data Persistent Volume access modes
    ## Must match those of existing PV or dynamic provisioner
    ## Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
    ##
    accessModes:
      - ReadWriteOnce

    ## pushgateway data Persistent Volume Claim annotations
    ##
    annotations: {}

    ## pushgateway data Persistent Volume existing claim name
    ## Requires pushgateway.persistentVolume.enabled: true
    ## If defined, PVC must be created manually before volume will be bound
    existingClaim: ""

    ## pushgateway data Persistent Volume mount root path
    ##
    mountPath: /data

    ## pushgateway data Persistent Volume size
    ##
    size: 2Gi

    ## pushgateway data Persistent Volume Storage Class
    ## If defined, storageClassName: <storageClass>
    ## If set to "-", storageClassName: "", which disables dynamic provisioning
    ## If undefined (the default) or set to null, no storageClassName spec is
    ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
    ##   GKE, AWS & OpenStack)
    ##
    # storageClass: "-"

    ## pushgateway data Persistent Volume Binding Mode
    ## If defined, volumeBindingMode: <volumeBindingMode>
    ## If undefined (the default) or set to null, no volumeBindingMode spec is
    ##   set, choosing the default mode.
    ##
    # volumeBindingMode: ""

    ## Subdirectory of pushgateway data Persistent Volume to mount
    ## Useful if the volume's root directory is not empty
    ##
    subPath: ""


## alertmanager ConfigMap entries
##
alertmanagerFiles:
  alertmanager.yml:
    global: {}
      # slack_api_url: ''

    receivers:
      - name: default-receiver
        # slack_configs:
        #  - channel: '@you'
        #    send_resolved: true

    route:
      group_wait: 10s
      group_interval: 5m
      receiver: default-receiver
      repeat_interval: 1h

## Prometheus server ConfigMap entries
##
serverFiles:

  ## Alerts configuration
  ## Ref: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
  alerting_rules.yml: {}
  # groups:
  #   - name: Instances
  #     rules:
  #       - alert: InstanceDown
  #         expr: up == 0
  #         for: 5m
  #         labels:
  #           severity: page
  #         annotations:
  #           description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
  #           summary: 'Instance {{ $labels.instance }} down'
  ## DEPRECATED DEFAULT VALUE, unless explicitly naming your files, please use alerting_rules.yml
  alerts: {}

  ## Records configuration
  ## Ref: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
  recording_rules.yml: {}
  ## DEPRECATED DEFAULT VALUE, unless explicitly naming your files, please use recording_rules.yml
  rules: {}

  prometheus.yml:
    rule_files:
      - /etc/config/recording_rules.yml
      - /etc/config/alerting_rules.yml
    ## Below two files are DEPRECATED will be removed from this default values file
      - /etc/config/rules
      - /etc/config/alerts

    scrape_configs:
      - job_name: prometheus
        static_configs:
          - targets:
            - localhost:9090

      # A scrape configuration for running Prometheus on a Kubernetes cluster.
      # This uses separate scrape configs for cluster components (i.e. API server, node)
      # and services to allow each to use different authentication configs.
      #
      # Kubernetes labels will be added as Prometheus labels on metrics via the
      # `labelmap` relabeling action.

      # Scrape config for API servers.
      #
      # Kubernetes exposes API servers as endpoints to the default/kubernetes
      # service so this uses `endpoints` role and uses relabelling to only keep
      # the endpoints associated with the default/kubernetes service using the
      # default named port `https`. This works for single API server deployments as
      # well as HA API server deployments.
      - job_name: 'kubernetes-apiservers'

        kubernetes_sd_configs:
          - role: endpoints

        # Default to scraping over https. If required, just disable this or change to
        # `http`.
        scheme: https

        # This TLS & bearer token file config is used to connect to the actual scrape
        # endpoints for cluster components. This is separate to discovery auth
        # configuration because discovery & scraping are two separate concerns in
        # Prometheus. The discovery auth config is automatic if Prometheus runs inside
        # the cluster. Otherwise, more config options have to be provided within the
        # <kubernetes_sd_config>.
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          # If your node certificates are self-signed or use a different CA to the
          # master CA, then disable certificate verification below. Note that
          # certificate verification is an integral part of a secure infrastructure
          # so this should only be disabled in a controlled environment. You can
          # disable certificate verification by uncommenting the line below.
          #
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        # Keep only the default/kubernetes service endpoints for the https port. This
        # will add targets for each API server which Kubernetes adds an endpoint to
        # the default/kubernetes service.
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'

        # Default to scraping over https. If required, just disable this or change to
        # `http`.
        scheme: https

        # This TLS & bearer token file config is used to connect to the actual scrape
        # endpoints for cluster components. This is separate to discovery auth
        # configuration because discovery & scraping are two separate concerns in
        # Prometheus. The discovery auth config is automatic if Prometheus runs inside
        # the cluster. Otherwise, more config options have to be provided within the
        # <kubernetes_sd_config>.
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          # If your node certificates are self-signed or use a different CA to the
          # master CA, then disable certificate verification below. Note that
          # certificate verification is an integral part of a secure infrastructure
          # so this should only be disabled in a controlled environment. You can
          # disable certificate verification by uncommenting the line below.
          #
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
          - role: node

        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/$1/proxy/metrics


      - job_name: 'kubernetes-nodes-cadvisor'

        # Default to scraping over https. If required, just disable this or change to
        # `http`.
        scheme: https

        # This TLS & bearer token file config is used to connect to the actual scrape
        # endpoints for cluster components. This is separate to discovery auth
        # configuration because discovery & scraping are two separate concerns in
        # Prometheus. The discovery auth config is automatic if Prometheus runs inside
        # the cluster. Otherwise, more config options have to be provided within the
        # <kubernetes_sd_config>.
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          # If your node certificates are self-signed or use a different CA to the
          # master CA, then disable certificate verification below. Note that
          # certificate verification is an integral part of a secure infrastructure
          # so this should only be disabled in a controlled environment. You can
          # disable certificate verification by uncommenting the line below.
          #
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
          - role: node

        # This configuration will work only on kubelet 1.7.3+
        # As the scrape endpoints for cAdvisor have changed
        # if you are using older version you need to change the replacement to
        # replacement: /api/v1/nodes/$1:4194/proxy/metrics
        # more info here https://github.com/coreos/prometheus-operator/issues/633
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor

      # Scrape config for service endpoints.
      #
      # The relabeling allows the actual service scrape endpoint to be configured
      # via the following annotations:
      #
      # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
      # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
      # to set this to `https` & most likely set the `tls_config` of the scrape config.
      # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
      # * `prometheus.io/port`: If the metrics are exposed on a different port to the
      # service then set this appropriately.
      - job_name: 'kubernetes-service-endpoints'

        kubernetes_sd_configs:
          - role: endpoints

        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_name
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: kubernetes_node

      # Scrape config for slow service endpoints; same as above, but with a larger
      # timeout and a larger interval
      #
      # The relabeling allows the actual service scrape endpoint to be configured
      # via the following annotations:
      #
      # * `prometheus.io/scrape-slow`: Only scrape services that have a value of `true`
      # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
      # to set this to `https` & most likely set the `tls_config` of the scrape config.
      # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
      # * `prometheus.io/port`: If the metrics are exposed on a different port to the
      # service then set this appropriately.
      - job_name: 'kubernetes-service-endpoints-slow'

        scrape_interval: 5m
        scrape_timeout: 30s

        kubernetes_sd_configs:
          - role: endpoints

        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape_slow]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_name
          - source_labels: [__meta_kubernetes_pod_node_name]
            action: replace
            target_label: kubernetes_node

      - job_name: 'prometheus-pushgateway'
        honor_labels: true

        kubernetes_sd_configs:
          - role: service

        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
            action: keep
            regex: pushgateway

      # Example scrape config for probing services via the Blackbox Exporter.
      #
      # The relabeling allows the actual service scrape endpoint to be configured
      # via the following annotations:
      #
      # * `prometheus.io/probe`: Only probe services that have a value of `true`
      - job_name: 'kubernetes-services'

        metrics_path: /probe
        params:
          module: [http_2xx]

        kubernetes_sd_configs:
          - role: service

        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
            action: keep
            regex: true
          - source_labels: [__address__]
            target_label: __param_target
          - target_label: __address__
            replacement: blackbox
          - source_labels: [__param_target]
            target_label: instance
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            target_label: kubernetes_name

      # Example scrape config for pods
      #
      # The relabeling allows the actual pod scrape endpoint to be configured via the
      # following annotations:
      #
      # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
      # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
      # to set this to `https` & most likely set the `tls_config` of the scrape config.
      # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
      # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the default of `9102`.
      - job_name: 'kubernetes-pods'

        kubernetes_sd_configs:
          - role: pod

        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
            action: replace
            regex: (https?)
            target_label: __scheme__
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name
          - source_labels: [__meta_kubernetes_pod_phase]
            regex: Pending|Succeeded|Failed
            action: drop

      # Example Scrape config for pods which should be scraped slower. An useful example
      # would be stackriver-exporter which queries an API on every scrape of the pod
      #
      # The relabeling allows the actual pod scrape endpoint to be configured via the
      # following annotations:
      #
      # * `prometheus.io/scrape-slow`: Only scrape pods that have a value of `true`
      # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
      # to set this to `https` & most likely set the `tls_config` of the scrape config.
      # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
      # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the default of `9102`.
      - job_name: 'kubernetes-pods-slow'

        scrape_interval: 5m
        scrape_timeout: 30s

        kubernetes_sd_configs:
          - role: pod

        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape_slow]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
            action: replace
            regex: (https?)
            target_label: __scheme__
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name
          - source_labels: [__meta_kubernetes_pod_phase]
            regex: Pending|Succeeded|Failed
            action: drop

# adds additional scrape configs to prometheus.yml
# must be a string so you have to add a | after extraScrapeConfigs:
# example adds prometheus-blackbox-exporter scrape config
extraScrapeConfigs:
  # - job_name: 'prometheus-blackbox-exporter'
  #   metrics_path: /probe
  #   params:
  #     module: [http_2xx]
  #   static_configs:
  #     - targets:
  #       - https://example.com
  #   relabel_configs:
  #     - source_labels: [__address__]
  #       target_label: __param_target
  #     - source_labels: [__param_target]
  #       target_label: instance
  #     - target_label: __address__
  #       replacement: prometheus-blackbox-exporter:9115

# Adds option to add alert_relabel_configs to avoid duplicate alerts in alertmanager
# useful in H/A prometheus with different external labels but the same alerts
alertRelabelConfigs:
  # alert_relabel_configs:
  # - source_labels: [dc]
  #   regex: (.+)\d+
  #   target_label: dc

networkPolicy:
  ## Enable creation of NetworkPolicy resources.
  ##
  enabled: false

# Force namespace of namespaced resources
forceNamespace: null

5. 访问Prometheus & Alertmanager后台管理界面

<root@PROD-K8S-CP1 ~># kubectl describe svc prometheus-prometheus-server 
Name:              prometheus-prometheus-server
Namespace:         default
Labels:            app=prometheus
                   app.kubernetes.io/managed-by=Helm
                   chart=prometheus-14.6.0
                   component=prometheus-server
                   heritage=Helm
                   release=prometheus
Annotations:       meta.helm.sh/release-name: prometheus
                   meta.helm.sh/release-namespace: default
Selector:          app=prometheus,component=prometheus-server,release=prometheus
Type:              ClusterIP
IP:                10.12.0.20
External IPs:      10.1.0.10
Port:              http  9090/TCP
TargetPort:        9090/TCP
Endpoints:         172.21.3.157:9090
Session Affinity:  None
Events:            <none>
<root@PROD-K8S-CP1 ~># kubectl describe svc prometheus-alertmanager 
Name:              prometheus-alertmanager
Namespace:         default
Labels:            app=prometheus
                   app.kubernetes.io/managed-by=Helm
                   chart=prometheus-14.6.0
                   component=alertmanager
                   heritage=Helm
                   release=prometheus
Annotations:       meta.helm.sh/release-name: prometheus
                   meta.helm.sh/release-namespace: default
Selector:          app=prometheus,component=alertmanager,release=prometheus
Type:              ClusterIP
IP:                10.12.0.13
External IPs:      10.1.0.11
Port:              http  9093/TCP
TargetPort:        9093/TCP
Endpoints:         172.21.3.251:9093
Session Affinity:  None
Events:            <none>

6. 访问方式 10.1.0.10:80

7. 修改时区 lens直接修改

- name: localtime
              mountPath: /etc/localtime

----------------------------------------------

        - name: localtime
          hostPath:
            path: /etc/localtime
            type: ''

8. 注意修改Prometheus-server的滚动更新策略,最大不可用设置为1,否则会出现DB锁定错误

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1

9. 如果手动查看节点expose的metrics

<root@PRE-K8S-CP1 ~># curl -ik https://10.1.0.233:6443/metrics
HTTP/1.1 403 Forbidden
Cache-Control: no-cache, private
Content-Type: application/json
X-Content-Type-Options: nosniff
Date: Wed, 02 Jun 2021 03:40:55 GMT
Content-Length: 240

{
"kind": "Status",
"apiVersion": "v1",
"metadata": {

},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
"reason": "Forbidden",
"details": {

},
"code": 403
}


解决方案如下,是因为默认情况下,Kubernetes不允许未经授权下不能访问集群资源信息

kubectl create clusterrolebinding prometheus-admin --clusterrole cluster-admin --user system:anonymous

四、Prometheus 配置文件


reable_action参数解释

  • replace 默认,通过regex匹配source_label的值,使用replacement来引用表达式匹配的分组
  • keep 删除regex与连接不匹配的目标 source_label版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!s
  • drop 删除regex与连接匹配的目标 source_labels
  • labelmap 匹配regex所有标签名称。然后复制匹配标签的值进行分组,replacement分组引用(${1},${2},…)替代
  • labeld版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!rop 删除regex匹配的标签
  • labelkeep 删除regex不匹配的标签

默认配置文件

job_name一般用来定义收集什么类型的metrics,下面的job_name定义了收集的是 

  1. api-server的性能指标,具体的配置解释如下
- job_name: kubernetes-apiservers
  honor_timestamps: true
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  authorization:
    type: Bearer
    credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  follow_redirects: true
  relabel_configs:   ## 标签重新定义
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]  ##定义原标签的名称作为匹配的条件
    separator: ; ## 分隔符的定义,用于分隔源标签值之间的符号
    regex: default;kubernetes;https ## 匹配原标签source_lable的值,比如原标签是__meta_kubernetes_service_name="kubernetes"  __meta_kubernetes_namespace="default" __meta_kubernetes_endpoint_port_name="https" 匹配规则需要匹配到source_lable为“__meta_kubernetes_endpoint_port_name”的值是"https"这个endpoints,这里的endpoints就是一个监控点,不要与kubernetes的endpoints混淆
    replacement: $1 ##默认操作
    action: keep ## 删除regex与source_lable的不匹配的标签,并保留匹配的regex
  kubernetes_sd_configs:
  - role: endpoints
    follow_redirects: true

2. 节点性能指标采集job_name kubernetes_nodes

- job_name: kubernetes-nodes
  honor_timestamps: true
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  authorization:
    type: Bearer
    credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  follow_redirects: true
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap ##保留__meta_kubernetes_node_label_(.+)中的".+"的值
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: kubernetes.default.svc:443
    action: replace ## 新增__address标签并且值替换为replacement的值
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+) ##匹配source_lable任意的值(实际就是node节点的名称)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/$1/proxy/metrics
    action: replace ##替换面的$1就是把regex匹配到的node节点的名称的值
  kubernetes_sd_configs:
  - role: node
    follow_redirects: true

整段的配置意思就是修改了默认的__address__的值,第一段将IP修改成域名形式,第二配置将默认的/metrics修改成/api/v1/nodes/$1/proxy/metrics,整合起来最终的值就是 https://kubernetes.default.svc/api/v1/nodes/pre-k8s-cp3/proxy/metrics

3. 节点容器指标采集说明,job_name kubernetes_nodes_cadvisor

- job_name: kubernetes-nodes-cadvisor
  honor_timestamps: true
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  authorization:
    type: Bearer
    credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  follow_redirects: true
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: kubernetes.default.svc:443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    # 与节点性能指标采集唯一不同之处,URI路径不一样
    replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
    action: replace
  kubernetes_sd_configs:
  - role: node
    follow_redirects: true

4. endpoints的service的监控配置(配置在service中)

apiVersion: v1
kind: Service
metadata:
  name: prometheus-kube-state-metrics
  namespace: default
  selfLink: /api/v1/namespaces/default/services/prometheus-kube-state-metrics
  uid: 161fff0c-fffe-4561-90ba-0bec0608fbe4
  resourceVersion: '23976299'
  creationTimestamp: '2021-06-01T06:51:21Z'
  labels:
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    helm.sh/chart: kube-state-metrics-3.1.0
  annotations:
    meta.helm.sh/release-name: prometheus
    meta.helm.sh/release-namespace: default
    prometheus.io/scrape: 'true'
- job_name: kubernetes-service-endpoints
  honor_timestamps: true
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep ##注意这段配置,相当重要,直接决定这个job监控那到那些endpoints监控点,这一段意思是保留匹配__meta_kubernetes_service_annotation_prometheus_io_scrape的值为true的endpoints,并删除没有匹配到标签__meta_kubernetes_service_annotation_prometheus_io_scrape且值为true的endpoints
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace  ##将__scheme__原值替换为regex的值
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace ##source_lable __meta_kubernetes_service_annotation_prometheus_io_path的值赋予__metrics_path__
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace ## 将匹配到的__address__ __meta_kubernetes_service_annotation_prometheus_io_port的标签,根据regex的规则匹配到的值赋予__address__
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap ##保留.+的值为新增的label
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace ## 替换标签值,同上
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_node_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_node
    replacement: $1
    action: replace
  kubernetes_sd_configs:
  - role: endpoints
    follow_redirects: true

5. service监控配置,http_2xx

- job_name: kubernetes-services
  honor_timestamps: true
  params:
    module:
    - http_2xx ##引入的模块
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /probe
  scheme: http
  follow_redirects: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep ## 保留匹配__meta_kubernetes_service_annotation_prometheus_io_probe并且删除没有匹配到__meta_kubernetes_service_annotation_prometheus_io_probe的endpoints
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace ## 将regex的的值赋予__param_target
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: blackbox
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace ##经过上面标签重标,最终target_label instance的值就是上面的__address__的值
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_name
    replacement: $1
    action: replace
  kubernetes_sd_configs:
  - role: service
    follow_redirects: true

6. Pod监控配置

- job_name: kubernetes-pods
  honor_timestamps: true
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep #
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_phase]
    separator: ;
    regex: Pending|Succeeded|Failed
    replacement: $1
    action: drop
  kubernetes_sd_configs:
  - role: pod
    follow_redirects: true

五、JVM RabbitMQ监控


  1. JVM监控(需要在deployment.yaml指定prometheus_io_scrape标签)如下
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pre-common-gateway
      component: spring
      part-of: pre
      tier: backend
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: pre-common-gateway
        component: spring
        part-of: pre
        tier: backend
      annotations: # 新增监控注释说明
        prometheus.io/port: '8888'
        prometheus.io/scrape: 'true'
#############  Jvm监控   ##############
  - job_name: kubernetes-pods-jvm
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - action: keep
        regex: true
        source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      - action: replace
        regex: (https?)
        source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
          - __address__
          - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels: 
        - __meta_kubernetes_pod_container_name
        target_label: kubernetes_container_name
      - action: replace
        source_labels:
          - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name
      - action: drop
        regex: Pending|Succeeded|Failed
        source_labels:
          - __meta_kubernetes_pod_phase
    scrape_interval: 10s
    scrape_timeout: 10s

2. RabbitMQ监控

  • rabbitmq-exporter(第三方提供的prometheus-exporter方式)
    https://github.com/kbudde/rabbitmq_exporter
    主要针对rabbitmq的系统服务使用状态监控,比如 Queue/Channel/Exchange/Connection及队列消耗内存使用监控/及节点城内存/磁盘等
  • rabbitmq-prometheus(官方自带)
    https://www.rabbitmq.com/prometheus.html
    主要针对RabbitMQ的基础设施,比如元数据/erlang进程资源使用/内存配置/CPU使用/open file分配,一句话就是偏底层监控
以下的配置仅供参考,创建后的状态
apiVersion: v1
kind: Service
metadata:
  name: pre-rabbitmq-monitor
  namespace: pre
  annotations:
    prometheus.io/scrape: rabbitmq
spec:
  ports:
    - name: rabbitmq-exporter
      protocol: TCP
      port: 9419
      targetPort: 9419
    - name: rabbitmq-prometheus-port
      protocol: TCP
      port: 15692
      targetPort: 15692
  selector:
    app: pre-rabbitmq
  clusterIP: 10.11.0.90
  type: ClusterIP
  sessionAffinity: None
#############  rabbitmq监控   ##############
  - job_name: kubernetes-service-rabbitmq
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      # 新增自定义监控标签rabbitmq
      - action: keep
        regex: rabbitmq
        source_labels:
          - __meta_kubernetes_service_annotation_prometheus_io_scrape
      - action: replace
        regex: (https?)
        source_labels:
          - __meta_kubernetes_service_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
          - __meta_kubernetes_service_annotation_prometheus_io_path
        target_label: __metrics_path__
      # 删除不需要监控的端口
      - action: drop
        regex: (5672|15672)
        source_labels:
          - __meta_kubernetes_pod_container_port_number
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - action: labelmap
        regex: __meta_kubernetes_pod_label_statefulset_kubernetes_io_(.+)
      - action: replace
        source_labels:
          - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
          - __meta_kubernetes_service_name
        target_label: kubernetes_name
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_node_name
        target_label: kubernetes_node

3. 参考rabbitmq-exporter.yaml配置(lens直接修改)

- name: rabbitmq-exporter
          image: hub.qiangyun.com/rabbitmq-exporter
          ports:
            - name: mq-monitor
              containerPort: 9419
              protocol: TCP
          env:
            - name: RABBIT_USER
              value: guest
            - name: RABBIT_PASSWORD
              value: guest
            - name: RABBIT_CAPABILITIES
              value: bert
          resources:
            limits:
              cpu: 500m
              memory: 1Gi
          livenessProbe:
            httpGet:
              path: /metrics
              port: 9419
              scheme: HTTP
            initialDelaySeconds: 60
            timeoutSeconds: 15
            periodSeconds: 60
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /metrics
              port: 9419
              scheme: HTTP
            initialDelaySeconds: 20
            timeoutSeconds: 10
            periodSeconds: 60
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent

4. rabbitmq-monitor-service.yaml

apiVersion: v1
kind: Service
metadata:
  annotations:
    # 注意以下的配置与下面Prometheus的配置相呼应
    prometheus.io/scrape: rabbitmq
  name: rabbitmq-monitor
  namespace: prod
spec:
  ports:
    - name: rabbitmq-exporter
      port: 9419
      protocol: TCP
      targetPort: 9419
    - name: rabbitmq-prometheus-port
      port: 15692
      protocol: TCP
      targetPort: 15692
  selector:
    app: rabbitmq
  type: ClusterIP

六、参考配置


  • Prometheus完整配置
global:
  evaluation_interval: 15s
  scrape_interval: 15s
  scrape_timeout: 10s
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
  static_configs:
  - targets:
    - localhost:9090
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

##############  Kubernetes apiserver 监控配置 ##############
  job_name: kubernetes-apiservers
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: keep
    regex: default;kubernetes;https
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_service_name
    - __meta_kubernetes_endpoint_port_name
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

##############  Kubernetes node 性能指标监控配置 ##############
  job_name: kubernetes-nodes
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - replacement: kubernetes.default.svc:443
    target_label: __address__
  - regex: (.+)
    replacement: /api/v1/nodes/$1/proxy/metrics
    source_labels:
    - __meta_kubernetes_node_name
    target_label: __metrics_path__
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

##############  Kubernetes 节点Pod性能指标监控配置 ##############
  job_name: kubernetes-nodes-cadvisor
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - replacement: kubernetes.default.svc:443
    target_label: __address__
  - regex: (.+)
    replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
    source_labels:
    - __meta_kubernetes_node_name
    target_label: __metrics_path__
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true

##############  Kubernetes service endpoints 监控配置 ##############
- job_name: kubernetes-service-endpoints
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scrape
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_service_annotation_prometheus_io_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_service_name
    target_label: kubernetes_name
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_node_name
    target_label: kubernetes_node

##############  Kubernetes service endpoints RabbitMQ 监控配置 ##############
- job_name: kubernetes-service-rabbitmq
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: keep
    regex: rabbitmq
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scrape
  - action: keep
    regex: (15692|9419)
    source_labels:
    - __meta_kubernetes_pod_container_port_number
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_service_annotation_prometheus_io_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_service_name
    target_label: kubernetes_name
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_node_name
    target_label: kubernetes_node

##############  暂时没有理解这个 slow鬼 监控配置 ##############
- job_name: kubernetes-service-endpoints-slow
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_service_annotation_prometheus_io_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_service_name
    target_label: kubernetes_name
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_node_name
    target_label: kubernetes_node
  scrape_interval: 5m
  scrape_timeout: 30s
- honor_labels: true

##############  Prometheus pushgateway 监控配置 ##############
  job_name: prometheus-pushgateway
  kubernetes_sd_configs:
  - role: service
  relabel_configs:
  - action: keep
    regex: pushgateway
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_probe

##############  Kubernetes service http_2xx 监控配置 ##############
- job_name: kubernetes-services
  kubernetes_sd_configs:
  - role: service
  metrics_path: /probe
  params:
    module:
    - http_2xx
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_probe
  - source_labels:
    - __address__
    target_label: __param_target
  - replacement: blackbox
    target_label: __address__
  - source_labels:
    - __param_target
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: kubernetes_name

##############  Kubernetes 用户应用 Pod 自定义监控配置 ##############
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scrape
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_name
    target_label: kubernetes_pod_name
  - action: drop
    regex: Pending|Succeeded|Failed
    source_labels:
    - __meta_kubernetes_pod_phase

##############  Kubernetes 用户应用 Pod JVM 监控配置 ##############
- job_name: kubernetes-pods-jvm
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_jvm_scrape
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_pod_annotation_prometheus_io_jvm_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_name
    target_label: kubernetes_pod_name
  - action: drop
    regex: Pending|Succeeded|Failed
    source_labels:
    - __meta_kubernetes_pod_phase

##############  暂时没有理解这个 slow鬼 监控配置 ##############
- job_name: kubernetes-pods-slow
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_name
    target_label: kubernetes_pod_name
  - action: drop
    regex: Pending|Succeeded|Failed
    source_labels:
    - __meta_kubernetes_pod_phase
  scrape_interval: 5m
  scrape_timeout: 30s
alerting:
  alertmanagers:
  - kubernetes_sd_configs:
      - role: pod
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace]
      regex: default
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: prometheus
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_component]
      regex: alertmanager
      action: keep
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
      regex: .*
      action: keep
    - source_labels: [__meta_kubernetes_pod_container_port_number]
      regex: "9093"
      action: keep

ConfigMap形式

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server
  namespace: default
  labels:
    app: prometheus
    app.kubernetes.io/managed-by: Helm
    chart: prometheus-14.6.0
    component: server
    heritage: Helm
    release: prometheus
data:
  alerting_rules.yml: |
    {}
  alerts: |
    {}
  prometheus.yml: |
    global:
      evaluation_interval: 15s
      scrape_interval: 15s
      scrape_timeout: 10s
    rule_files:
    - /etc/config/recording_rules.yml
    - /etc/config/alerting_rules.yml
    - /etc/config/rules
    - /etc/config/alerts
    scrape_configs:
    - job_name: prometheus
      static_configs:
      - targets:
        - localhost:9090
    - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    ##############  Kubernetes apiserver 监控配置 ##############
      job_name: kubernetes-apiservers
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: default;kubernetes;https
        source_labels:
        - __meta_kubernetes_namespace
        - __meta_kubernetes_service_name
        - __meta_kubernetes_endpoint_port_name
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
    - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    ##############  Kubernetes node 性能指标监控配置 ##############
      job_name: kubernetes-nodes
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - replacement: kubernetes.default.svc:443
        target_label: __address__
      - regex: (.+)
        replacement: /api/v1/nodes/$1/proxy/metrics
        source_labels:
        - __meta_kubernetes_node_name
        target_label: __metrics_path__
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
    - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    ##############  Kubernetes 节点Pod性能指标监控配置 ##############
      job_name: kubernetes-nodes-cadvisor
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - replacement: kubernetes.default.svc:443
        target_label: __address__
      - regex: (.+)
        replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
        source_labels:
        - __meta_kubernetes_node_name
        target_label: __metrics_path__
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true

    ##############  Kubernetes service endpoints 监控配置 ##############
    - job_name: kubernetes-service-endpoints
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scrape
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_service_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_service_name
        target_label: kubernetes_name
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: kubernetes_node

    ##############  Kubernetes service endpoints RabbitMQ 监控配置 ##############
    - job_name: kubernetes-service-rabbitmq
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: rabbitmq
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scrape
      - action: keep
        regex: (15692|9419)
        source_labels:
        - __meta_kubernetes_pod_container_port_number
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_service_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_service_name
        target_label: kubernetes_name
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: kubernetes_node

    ##############  暂时没有理解这个 slow鬼 监控配置 ##############
    - job_name: kubernetes-service-endpoints-slow
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_service_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_service_name
        target_label: kubernetes_name
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: kubernetes_node
      scrape_interval: 5m
      scrape_timeout: 30s
    - honor_labels: true

    ##############  Prometheus pushgateway 监控配置 ##############
      job_name: prometheus-pushgateway
      kubernetes_sd_configs:
      - role: service
      relabel_configs:
      - action: keep
        regex: pushgateway
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_probe

    ##############  Kubernetes service http_2xx 监控配置 ##############
    - job_name: kubernetes-services
      kubernetes_sd_configs:
      - role: service
      metrics_path: /probe
      params:
        module:
        - http_2xx
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_probe
      - source_labels:
        - __address__
        target_label: __param_target
      - replacement: blackbox
        target_label: __address__
      - source_labels:
        - __param_target
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - source_labels:
        - __meta_kubernetes_service_name
        target_label: kubernetes_name

    ##############  Kubernetes 用户应用 Pod 自定义监控配置 ##############
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name
      - action: drop
        regex: Pending|Succeeded|Failed
        source_labels:
        - __meta_kubernetes_pod_phase

    ##############  Kubernetes 用户应用 Pod JVM 监控配置 ##############
    - job_name: kubernetes-pods-jvm
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_jvm_scrape
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_jvm_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels: 
        - __meta_kubernetes_pod_container_name
        target_label: kubernetes_container_name
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name
      - action: drop
        regex: Pending|Succeeded|Failed
        source_labels:
        - __meta_kubernetes_pod_phase

    ##############  暂时没有理解这个 slow鬼 监控配置 ##############
    - job_name: kubernetes-pods-slow
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name
      - action: drop
        regex: Pending|Succeeded|Failed
        source_labels:
        - __meta_kubernetes_pod_phase
      scrape_interval: 5m
      scrape_timeout: 30s
    alerting:
      alertmanagers:
      - kubernetes_sd_configs:
          - role: pod
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace]
          regex: default
          action: keep
        - source_labels: [__meta_kubernetes_pod_label_app]
          regex: prometheus
          action: keep
        - source_labels: [__meta_kubernetes_pod_label_component]
          regex: alertmanager
          action: keep
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
          regex: .*
          action: keep
        - source_labels: [__meta_kubernetes_pod_container_port_number]
          regex: "9093"
          action: keep
  recording_rules.yml: |
    {}
  rules: |
    {}

七、告警规则


常用告警规则配置

  • alerts
## CPU告警规则
groups:
- name: CpuAlertRule
  rules:
  - alert: PodCPU告警
    expr: onecore:pod > 80 or twocore:pod / 2 > 80 or squarecore:pod / 4 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      description: "CPU使用率大于80%"
      value: "{{$value}}%"
      #summary: 'CPU使用率大于80%,当前值为{{.Value}}%,CPU使用率: {{ printf `ceil(100 - ((avg by (instance)(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[1m]))) *100))` $labels.instance | query | first | value }}%'
  - alert: NodeCPU告警
    expr: round(100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m]))by(kubernetes_node)*100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      description: "CPU使用率大于80%"
      value: "{{$value}}%"
      #summary: 'CPU使用率大于80%,当前值为{{.Value}}%,CPU使用率: {{ printf `ceil(100 - ((avg by (instance)(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[1m]))) *100))` $labels.instance | query | first | value }}%'

## DISK告警规则
- name: DiskAlertRule
  rules:
  - alert: Pod磁盘告警
    expr: round(container_fs_usage_bytes{container=~".+",container!~"POD"}/1024/1024/1024/10*100) > 85
    for: 1m
    labels:
      severity: warning
    annotations:
      description: "磁盘使用率大于85%"
      value: "{{$value}}%"
  - alert: Node磁盘告警
    expr: round((1- node_filesystem_avail_bytes{fstype=~"ext.+|nfs.+",mountpoint!~".*docker.*"}/node_filesystem_size_bytes{fstype=~"ext.+|nfs.+",mountpoint!~".*docker.*"})*100) > 85
    for: 1m
    labels:
      severity: warning
    annotations:
      description: "磁盘使用率大于85%"
      value: "{{$value}}%"

## MEM告警规则
- name: MemAlertRule
  rules:
  - alert: Pod内存告警
    expr: round(container_memory_usage_bytes{container=~".+",container!~"POD|.+reload",pod!~"^csi.+"}/container_spec_memory_limit_bytes{container=~".+",container!~"POD|.reload",pod!~"^csi.+"}*100) > 85
    for: 2m
    labels:
      severity: warning
    annotations:
      description: "内存使用率大于85%"
      value: "{{$value}}%"
  - alert: Node内存告警
    expr: round(100-((node_memory_MemAvailable_bytes*100)/node_memory_MemTotal_bytes)) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      description: "内存使用率大于85%"
      value: "{{$value}}%"

## Pod意外重启
- name: PodRestartAlertRule
  rules:
  - alert: Pod重启告警
    expr: delta(kube_pod_container_status_restarts_total[1m]) > 0
    for: 1s
    labels:
      severity: warning
    annotations:
      description: "Pod发生意外重启事件"

## JvmCMSOldGC
- name: PodJvmOldGCAlertRule
  rules:
  - alert: PodJvmCMSOldGC
    expr: round((jvm_memory_pool_bytes_used{pool=~".+Old Gen"}/jvm_memory_pool_bytes_max{pool=~".+Old Gen"})*100) > 89
    for: 5s
    labels:
      severity: warning
    annotations:
      description: "Pod堆内存触发CMSOldGC"
      value: "{{$value}}%"

## Pod实例异常
- name: ContainerInstanceAlertRule
  rules:
  - alert: Pod实例异常
    expr: kube_pod_container_status_ready - kube_pod_container_status_running > 0
    for: 20s
    labels:
      severity: warning
    annotations:
      description: "Container实例异常"

## Pod实例OOM
- name: ContainerOOMAlertRule
  rules:
  - alert: Pod实例OOM
    expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
    for: 1s
    labels:
      severity: warning
    annotations:
      description: "Container实例OOM"

## Pod实例驱逐
- name: ContainerEvictionAlertRule
  rules:
  - alert: Pod实例驱逐
    expr: kube_pod_container_status_terminated_reason{reason="Evicted"} > 0
    for: 1s
    labels:
      severity: warning
    annotations:
      description: "Container实例驱逐"

## MQ内存告警
- name: MQMemoryAlertRule
  rules:
  - alert: MQ内存水位线
    expr: rabbitmq_node_mem_alarm{job=~".*rabbitmq.*"} == 1
    for: 1s
    labels:
      severity: warning
    annotations:
      description: "RabbitMQ内存高水位线告警"
      summary: RabbitMQ {{`{{ $labels.instance }}`}} High Memory Alarm is going off.  Which means the node hit highwater mark and has cut off network connectivity, see RabbitMQ WebUI
  - alert: MQ内存使用告警
    expr: round(avg(rabbitmq_node_mem_used{job=~".*rabbitmq.*"} / rabbitmq_node_mem_limit{job=~".*rabbitmq.*"})by(node,kubernetes_namespace)*100) > 90
    for: 10s
    labels:
      severity: warning
    annotations:
      description: "RabbitMQ使用告警"
      value: "{{$value}}%"
      summary: RabbitMQ {{`{{ $labels.instance }}`}} Memory Usage > 90%

##PodJava进程异常
- name: PodJavaProcessAlertRule
  rules:
  - alert: PodJava进程异常
    expr: sum(up{job="kubernetes-pods-jvm"})by(kubernetes_container_name,kubernetes_pod_name) == 0
    for: 10s
    labels:
      severity: warning
    annotations:
      description: "PodJava进程异常"
      summary: "赶快看看吧,顶不住了"

recording_rules

groups:
  - name: CpuRecordRules
    rules:
    - record: onecore:pod
      expr: round(sum by(pod, container, instance, namespace, name) (irate(container_cpu_usage_seconds_total{container!~"|POD|prod-xianxiang-edu-loan|prod-xy-fund|prod-common-callcenter|prod-risk-service|prod-qn-web-api|prod-xc-fund|prod-xc-user|sys-ingress|etcd|prod-qn-mp|prod-xc-common|prod-xianxiang-zuul|prod-qn-user|prod-xc-riskapi|prod-common-message|prod-common-trust-service|prod-xc-collection|kube-controller-manager|prod-qn-risk|prod-xy-zuul|metrics-server|prod-nflow-manager|kube-scheduler|prod-qn-gateway|prod-xc-pay|coredns|kube-apiserver|prod-qn-oms|prod-common-service|prod-nfsp-service|pord-ingress|prod-qn-cms|prod-internal-ingress|prod-xc-loan|prod-rabbitmq|prometheus-server"}[5m]) * 100))
    - record: twocore:pod
      expr: round(sum by(pod, container, instance, namespace, name) (irate(container_cpu_usage_seconds_total{container=~"prod-xianxiang-edu-loan|prod-xy-fund|prod-common-callcenter|prod-risk-service|prod-qn-web-api|prod-xc-fund|prod-xc-user|sys-ingress|etcd|prod-qn-mp|prod-xc-common|prod-xianxiang-zuul|prod-qn-user|prod-xc-riskapi|prod-common-message|prod-common-trust-service|prod-xc-collection|kube-controller-manager|prod-qn-risk|prod-xy-zuul|metrics-server|prod-nflow-manager|kube-scheduler|prod-qn-gateway|prod-xc-pay|coredns|kube-apiserver|prod-qn-oms|prod-common-service|prod-nfsp-service|pord-ingress|prod-qn-cms|prod-internal-ingress|prod-xc-loan"}[5m]) * 100))
    - record: squarecore:pod
      expr: round(sum by(pod, container, instance, namespace, name) (irate(container_cpu_usage_seconds_total{container=~"prod-rabbitmq|prometheus-server"}[5m]) * 100))
0 条回应