一、简介
Promethues是一款原生云计算基金项目,完全开源用于监控系统、服务、容器、数据库等,收集根据客户端的target配置监控项,依据监控判断表达式,进行数据展示及警报阀值进行告警
与其它监控系统比较有如下特点
- 多维度数据模型(时间序列,KEY/VALUE)
- 灵活的查询语言
- 不依赖其它分布式存储 ,可自主独立存储
- 时间序列收集通过HTTP PULL MODEL
- 通过服务发现&配置文件来发现客户端
- 支持多种模型的dashboard
可以采用 push gateway 的方式把时间序列数据推送至 Prometheus server 端
采集方式
由于数据采集可能会有丢失,所以 Prometheus 不适用对采集数据要 100% 准确的情形。但如果用于记录时间序列数据,Prometheus 具有
pull方式
Prometheus采集数据是用的pull也就是拉模型,通过HTTP协议去采集指标,只要应用系统能够提供HTTP接口就可以接入监控系统,相比于私有协议或二进制协议来说开发、简单。
push方式
对于定时任务这种短周期的指标采集,如果采用pull模式,可能造成任务结束了,Prometheus还没有来得及采集,这个时候可以使用加一个中转层,客户端推数据到Push Gateway缓存一下,由Prometheus从push gateway pull指标过来。(需要额外搭建Push Gateway,同时需要新增job去从gateway采数据)
二、组件架构
Prometheus组件
- Prometheus Server 数据库,数据处理,数据存储,数据查询,报警规则等数据图表展示功能
- alertmanager 报警接收及发送管理,提供报警定义模板配置,报警发送,报警路由定义
- Pushgateway 数据采集中转站,数据缓存管理
- exporters 数据采集
- Client library
架构图

三、部署方式
部署方式helm
引用文档,如下
官方,如何配置服务发现来监控kubernetes
https://prometheus.io/do版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作! cs/prometheus/latest/configuration/configuration/#kubernetes_sd_config
Kubernetes service discoveries暴露以下role以供Prometheus采集(Prometheus从kubernetes的REST API接口来scrape以下目标资源信息,并同时保持状态的同步)
- node
- endpoint
- service
- pod
- ingress
安装方式
- 在集群外部安装Prometheus,需要在prometheus.yml配置集群的ca证书
- 在集群内部安装Prometheus,需要创建rbac鉴权策略,具体安装可参考helm官方安装,本指南主要使用helm安装
helm安装共有二个链接,一个官方helm目前已经弃用,其来源指向也是artifacthub,一个是artifacthub专业kubernetes安装集成部署CNCF projects
prometheus artifacthub.io 安装地址
https://artifacthub.io/packages/helm/prometheus-community/prometheus
链接已经详细的指出如何使用helm安装Prometheu
为什么必须依赖于kube-state-metrics
因为kube-state-metrics主要用来监控kubernetes的副本集的活动状态
- 添加repo,如下
<root@PROD-K8S-CP1 ~># helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
"prometheus-community" has been added to your repositories
<root@PROD-K8S-CP1 ~># helm repo add kube-state-metrics https://kubernetes.github.io/kube-state-metrics
"kube-state-metrics" has been added to your repositories
<root@PROD-K8S-CP1 ~># helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "kube-state-metrics" chart repository
...Successfully got an update from the "cilium" chart repository
...Successfully got an update from the "prometheus-community" chart repository
Update Complete. ⎈ Happy Helming!⎈
2. 安装Prometheus
<root@PROD-K8S-CP1 ~># helm install prometheus prometheus-community/prometheus --version 14.1.0
NAME: prometheus
LAST DEPLOYED: Thu Sep 2 15:57:19 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Prometheus server can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-server.default.svc.cluster.local
Get the Prometheus server URL by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace default port-forward $POD_NAME 9090
The Prometheus alertmanager can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-alertmanager.default.svc.cluster.local
Get the Alertmanager URL by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=alertmanager" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace default port-forward $POD_NAME 9093
#################################################################################
###### WARNING: Pod Security Policy has been moved to a global property. #####
###### use .Values.podSecurityPolicy.enabled with pod-based #####
###### annotations #####
###### (e.g. .Values.nodeExporter.podSecurityPolicy.annotations) #####
#################################################################################
The Prometheus PushGateway can be accessed via port 9091 on the following DNS name from within your cluster:
prometheus-pushgateway.default.svc.cluster.local
Get the PushGateway URL by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=pushgateway" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace default port-forward $POD_NAME 9091
For more information on running Prometheus, visit:
https://prometheus.io/
3. 查看部署状态
<root@PROD-K8S-CP1 ~># kubectl get pods --all-namespaces -o wide| grep node
default prometheus-node-exporter-9rm22 1/1 Running 0 51m 10.1.17.237 prod-be-k8s-wn7 <none> <none>
default prometheus-node-exporter-qfxgz 1/1 Running 0 51m 10.1.17.238 prod-be-k8s-wn8 <none> <none>
default prometheus-node-exporter-z5znx 1/1 Running 0 51m 10.1.17.236 prod-be-k8s-wn6 <none> <none>
<root@PROD-K8S-CP1 ~># kubectl get pods --all-namespaces -o wide| grep prometheus
default prometheus-alertmanager-755d84cf4f-rfz4n 0/2 Pending 0 51m <none> <none> <none> <none>
default prometheus-kube-state-metrics-86dc6bb59f-wlpcl 1/1 Running 0 51m 172.21.12.167 prod-be-k8s-wn8 <none> <none>
default prometheus-node-exporter-9rm22 1/1 Running 0 51m 10.1.17.237 prod-be-k8s-wn7 <none> <none>
default prometheus-node-exporter-qfxgz 1/1 Running 0 51m 10.1.17.238 prod-be-k8s-wn8 <none> <none>
default prometheus-node-exporter-z5znx 1/1 Running 0 51m 10.1.17.236 prod-be-k8s-wn6 <none> <none>
default prometheus-pushgateway-745d67dd5f-7ckvv 1/1 Running 0 51m 172.21.12.2 prod-be-k8s-wn7 <none> <none>
default prometheus-server-867f854484-lcrq6 # 默认部署,启用了持久化存储,由于默认的持久化存储没有指定详细的PVC,所以在安装完需要调整持久化配置段,当然也可以选择不需要持久化存储,可以选择将本地卷映射入Pod中
4. 自定义安装
helm install prometheus prometheus-community/prometheus -f prometheus-values.yaml
<root@PROD-K8S-CP1 ~># helm show values prometheus-community/prometheus --version 14.1.0 > prometheus-values.yaml
修订版
rbac:
create: true
podSecurityPolicy:
enabled: false
imagePullSecrets:
# - name: "image-pull-secret"
## Define serviceAccount names for components. Defaults to component's fully qualified name.
##
serviceAccounts:
alertmanager:
create: true
name:
annotations: {}
nodeExporter:
create: true
name:
annotations: {}
pushgateway:
create: true
name:
annotations: {}
server:
create: true
name:
annotations: {}
alertmanager:
## If false, alertmanager will not be installed
##
enabled: true
## Use a ClusterRole (and ClusterRoleBinding)
## - If set to false - we define a Role and RoleBinding in the defined namespaces ONLY
## This makes alertmanager work - for users who do not have ClusterAdmin privs, but wants alertmanager to operate on their own namespaces, instead of clusterwide.
useClusterRole: true
## Set to a rolename to use existing role - skipping role creating - but still doing serviceaccount and rolebinding to the rolename set here.
useExistingRole: false
## alertmanager container name
##
name: alertmanager
## alertmanager container image
##
image:
repository: quay.io/prometheus/alertmanager
tag: v0.21.0
pullPolicy: IfNotPresent
## alertmanager priorityClassName
##
priorityClassName: ""
## Additional alertmanager container arguments
##
extraArgs: {}
## Additional InitContainers to initialize the pod
##
extraInitContainers: []
## The URL prefix at which the container can be accessed. Useful in the case the '-web.external-url' includes a slug
## so that the various internal URLs are still able to access as they are in the default case.
## (Optional)
prefixURL: ""
## External URL which can access alertmanager
baseURL: "http://localhost:9093"
## Additional alertmanager container environment variable
## For instance to add a http_proxy
##
extraEnv: {}
## Additional alertmanager Secret mounts
# Defines additional mounts with secrets. Secrets must be manually created in the namespace.
extraSecretMounts: []
# - name: secret-files
# mountPath: /etc/secrets
# subPath: ""
# secretName: alertmanager-secret-files
# readOnly: true
## ConfigMap override where fullname is {{.Release.Name}}-{{.Values.alertmanager.configMapOverrideName}}
## Defining configMapOverrideName will cause templates/alertmanager-configmap.yaml
## to NOT generate a ConfigMap resource
##
configMapOverrideName: ""
## The name of a secret in the same kubernetes namespace which contains the Alertmanager config
## Defining configFromSecret will cause templates/alertmanager-configmap.yaml
## to NOT generate a ConfigMap resource
##
configFromSecret: ""
## The configuration file name to be loaded to alertmanager
## Must match the key within configuration loaded from ConfigMap/Secret
##
configFileName: alertmanager.yml
ingress:
## If true, alertmanager Ingress will be created
##
enabled: false
## alertmanager Ingress annotations
##
annotations: {}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: 'true'
## alertmanager Ingress additional labels
##
extraLabels: {}
## alertmanager Ingress hostnames with optional path
## Must be provided if Ingress is enabled
##
hosts: []
# - alertmanager.domain.com
# - domain.com/alertmanager
## Extra paths to prepend to every host configuration. This is useful when working with annotation based services.
extraPaths: []
# - path: /*
# backend:
# serviceName: ssl-redirect
# servicePort: use-annotation
## alertmanager Ingress TLS configuration
## Secrets must be manually created in the namespace
##
tls: []
# - secretName: prometheus-alerts-tls
# hosts:
# - alertmanager.domain.com
## Alertmanager Deployment Strategy type
# strategy:
# type: Recreate
## Node tolerations for alertmanager scheduling to nodes with taints
## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
##
## 配置alertmanager污点容忍
tolerations:
- key: resource
# operator: "Equal|Exists"
value: base
# effect: "NoSchedule|PreferNoSchedule|NoExecute(1.6 only)"
effect: NoExecute
## Node labels for alertmanager pod assignment
## Ref: https://kubernetes.io/docs/user-guide/node-selection/
##
nodeSelector:
kubernetes.io/hostname: prod-sys-k8s-wn3
## Pod affinity
##
## 配置alertmanager节点亲和性
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/resource
operator: In
values:
- base
## PodDisruptionBudget settings
## ref: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
##
podDisruptionBudget:
enabled: false
maxUnavailable: 1
## Use an alternate scheduler, e.g. "stork".
## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
##
# schedulerName:
persistentVolume:
## If true, alertmanager will create/use a Persistent Volume Claim
## If false, use emptyDir
##
enabled: true
## alertmanager data Persistent Volume access modes
## Must match those of existing PV or dynamic provisioner
## Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
##
accessModes:
- ReadWriteOnce
## alertmanager data Persistent Volume Claim annotations
##
annotations: {}
## alertmanager data Persistent Volume existing claim name
## Requires alertmanager.persistentVolume.enabled: true
## If defined, PVC must be created manually before volume will be bound
existingClaim: ""
## alertmanager data Persistent Volume mount root path
##
mountPath: /data
## alertmanager data Persistent Volume size
##
size: 20Gi
## alertmanager data Persistent Volume Storage Class
## If defined, storageClassName: <storageClass>
## If set to "-", storageClassName: "", which disables dynamic provisioning
## If undefined (the default) or set to null, no storageClassName spec is
## set, choosing the default provisioner. (gp2 on AWS, standard on
## GKE, AWS & OpenStack)
##
# storageClass: "-"
## 配置alertmanager持久化存储
storageClass: "alicloud-disk-essd"
## alertmanager data Persistent Volume Binding Mode
## If defined, volumeBindingMode: <volumeBindingMode>
## If undefined (the default) or set to null, no volumeBindingMode spec is
## set, choosing the default mode.
##
# volumeBindingMode: ""
## Subdirectory of alertmanager data Persistent Volume to mount
## Useful if the volume's root directory is not empty
##
subPath: ""
emptyDir:
## alertmanager emptyDir volume size limit
##
sizeLimit: ""
## Annotations to be added to alertmanager pods
##
podAnnotations: {}
## Tell prometheus to use a specific set of alertmanager pods
## instead of all alertmanager pods found in the same namespace
## Useful if you deploy multiple releases within the same namespace
##
## prometheus.io/probe: alertmanager-teamA
## Labels to be added to Prometheus AlertManager pods
##
podLabels: {}
## Specify if a Pod Security Policy for node-exporter must be created
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
##
podSecurityPolicy:
annotations: {}
## Specify pod annotations
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#apparmor
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl
##
# seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
# seccomp.security.alpha.kubernetes.io/defaultProfileName: 'docker/default'
# apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default'
## Use a StatefulSet if replicaCount needs to be greater than 1 (see below)
##
replicaCount: 1
## Annotations to be added to deployment
##
deploymentAnnotations: {}
statefulSet:
## If true, use a statefulset instead of a deployment for pod management.
## This allows to scale replicas to more than 1 pod
##
enabled: false
annotations: {}
labels: {}
podManagementPolicy: OrderedReady
## Alertmanager headless service to use for the statefulset
##
headless:
annotations: {}
labels: {}
## Enabling peer mesh service end points for enabling the HA alert manager
## Ref: https://github.com/prometheus/alertmanager/blob/master/README.md
enableMeshPeer: false
servicePort: 80
## alertmanager resource requests and limits
## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
##
## 配置资源限额
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 10m
memory: 32Mi
# Custom DNS configuration to be added to alertmanager pods
dnsConfig: {}
# nameservers:
# - 1.2.3.4
# searches:
# - ns1.svc.cluster-domain.example
# - my.dns.search.suffix
# options:
# - name: ndots
# value: "2"
# - name: edns0
## 配置网络模式
hostNetwork: false
## Security context to be added to alertmanager pods
##
securityContext:
runAsUser: 65534
runAsNonRoot: true
runAsGroup: 65534
fsGroup: 65534
service:
annotations: {}
labels: {}
clusterIP: ""
## Enabling peer mesh service end points for enabling the HA alert manager
## Ref: https://github.com/prometheus/alertmanager/blob/master/README.md
# enableMeshPeer : true
## List of IP addresses at which the alertmanager service is available
## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
##
## 配置service外部地址
externalIPs:
- 10.1.0.11
loadBalancerIP: ""
loadBalancerSourceRanges: []
servicePort: 9093
# nodePort: 30000
sessionAffinity: None
type: ClusterIP
## Monitors ConfigMap changes and POSTs to a URL
## Ref: https://github.com/jimmidyson/configmap-reload
##
configmapReload:
prometheus:
## If false, the configmap-reload container will not be deployed
##
enabled: true
## configmap-reload container name
##
name: configmap-reload
## configmap-reload container image
##
image:
repository: jimmidyson/configmap-reload
tag: v0.5.0
pullPolicy: IfNotPresent
## Additional configmap-reload container arguments
##
extraArgs: {}
## Additional configmap-reload volume directories
##
extraVolumeDirs: []
## Additional configmap-reload mounts
##
extraConfigmapMounts: []
# - name: prometheus-alerts
# mountPath: /etc/alerts.d
# subPath: ""
# configMap: prometheus-alerts
# readOnly: true
## configmap-reload resource requests and limits
## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
##
resources: {}
alertmanager:
## If false, the configmap-reload container will not be deployed
##
enabled: true
## configmap-reload container name
##
name: configmap-reload
## configmap-reload container image
##
image:
repository: jimmidyson/configmap-reload
tag: v0.5.0
pullPolicy: IfNotPresent
## Additional configmap-reload container arguments
##
extraArgs: {}
## Additional configmap-reload volume directories
##
extraVolumeDirs: []
## Additional configmap-reload mounts
##
extraConfigmapMounts: []
# - name: prometheus-alerts
# mountPath: /etc/alerts.d
# subPath: ""
# configMap: prometheus-alerts
# readOnly: true
## configmap-reload resource requests and limits
## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
##
resources: {}
kubeStateMetrics:
## If false, kube-state-metrics sub-chart will not be installed
##
enabled: true
## 配置节点亲和性
nodeSelector:
kubernetes.io/hostname: prod-sys-k8s-wn3
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/resource
operator: In
values:
- base
priorityClassName: "monitor-service"
resources:
limits:
cpu: 1
memory: 128Mi
requests:
cpu: 100m
memory: 30Mi
tolerations:
- key: resource
# operator: "Equal|Exists"
value: base
effect: NoExecute
## kube-state-metrics sub-chart configurable values
## Please see https://github.com/kubernetes/kube-state-metrics/tree/master/charts/kube-state-metrics
##
# kube-state-metrics:
nodeExporter:
## If false, node-exporter will not be installed
##
enabled: true
## If true, node-exporter pods share the host network namespace
##
hostNetwork: true
## If true, node-exporter pods share the host PID namespace
##
hostPID: true
## If true, node-exporter pods mounts host / at /host/root
##
hostRootfs: true
## node-exporter container name
##
name: node-exporter
## node-exporter container image
##
image:
repository: quay.io/prometheus/node-exporter
tag: v1.1.2
pullPolicy: IfNotPresent
## 优先级配置
priorityClassName: "monitor-service"
## Specify if a Pod Security Policy for node-exporter must be created
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
##
podSecurityPolicy:
annotations: {}
## Specify pod annotations
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#apparmor
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl
##
# seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
# seccomp.security.alpha.kubernetes.io/defaultProfileName: 'docker/default'
# apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default'
## node-exporter priorityClassName
## 配置node-exporter优先级
priorityClassName: "monitor-service"
## Custom Update Strategy
## 配置滚动更新策略
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
## Additional node-exporter container arguments
##
extraArgs: {}
## Additional InitContainers to initialize the pod
##
extraInitContainers: []
## Additional node-exporter hostPath mounts
##
extraHostPathMounts: []
# - name: textfile-dir
# mountPath: /srv/txt_collector
# hostPath: /var/lib/node-exporter
# readOnly: true
# mountPropagation: HostToContainer
extraConfigmapMounts: []
# - name: certs-configmap
# mountPath: /prometheus
# configMap: certs-configmap
# readOnly: true
## Node tolerations for node-exporter scheduling to nodes with taints
## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
##
## 配置node-exporter污点容忍
tolerations:
# - key: "key"
# operator: "Equal|Exists"
- operator: Exists
# value: "value"
# effect: "NoSchedule|PreferNoSchedule|NoExecute(1.6 only)"
## Node labels for node-exporter pod assignment
## Ref: https://kubernetes.io/docs/user-guide/node-selection/
##
nodeSelector: {}
## Annotations to be added to node-exporter pods
##
podAnnotations: {}
## Labels to be added to node-exporter pods
##
pod:
labels: {}
## PodDisruptionBudget settings
## ref: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
##
podDisruptionBudget:
enabled: false
maxUnavailable: 1
## node-exporter resource limits & requests
## Ref: https://kubernetes.io/docs/user-guide/compute-resources/
##
## 配置node-exporter资源限额
resources:
limits:
cpu: 1
memory: 128Mi
requests:
cpu: 100m
memory: 30Mi
# Custom DNS configuration to be added to node-exporter pods
dnsConfig: {}
# nameservers:
# - 1.2.3.4
# searches:
# - ns1.svc.cluster-domain.example
# - my.dns.search.suffix
# options:
# - name: ndots
# value: "2"
# - name: edns0
## Security context to be added to node-exporter pods
##
securityContext:
fsGroup: 65534
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
service:
annotations:
prometheus.io/scrape: "true"
labels: {}
# Exposed as a headless service:
# https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
clusterIP: None
## List of IP addresses at which the node-exporter service is available
## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
##
externalIPs: []
hostPort: 9100
loadBalancerIP: ""
loadBalancerSourceRanges: []
servicePort: 9100
type: ClusterIP
server:
## Prometheus server container name
##
enabled: true
## Use a ClusterRole (and ClusterRoleBinding)
## - If set to false - we define a RoleBinding in the defined namespaces ONLY
##
## NB: because we need a Role with nonResourceURL's ("/metrics") - you must get someone with Cluster-admin privileges to define this role for you, before running with this setting enabled.
## This makes prometheus work - for users who do not have ClusterAdmin privs, but wants prometheus to operate on their own namespaces, instead of clusterwide.
##
## You MUST also set namespaces to the ones you have access to and want monitored by Prometheus.
##
# useExistingClusterRoleName: nameofclusterrole
## namespaces to monitor (instead of monitoring all - clusterwide). Needed if you want to run without Cluster-admin privileges.
# namespaces:
# - yournamespace
## 配置prometheus容器名称
name: server
# sidecarContainers - add more containers to prometheus server
# Key/Value where Key is the sidecar `- name: <Key>`
# Example:
# sidecarContainers:
# webserver:
# image: nginx
sidecarContainers: {}
## Prometheus server container image
##
image:
repository: quay.io/prometheus/prometheus
tag: v2.26.0
pullPolicy: IfNotPresent
## prometheus server priorityClassName
##
priorityClassName: "monitor-service"
## EnableServiceLinks indicates whether information about services should be injected
## into pod's environment variables, matching the syntax of Docker links.
## WARNING: the field is unsupported and will be skipped in K8s prior to v1.13.0.
##
enableServiceLinks: true
## The URL prefix at which the container can be accessed. Useful in the case the '-web.external-url' includes a slug
## so that the various internal URLs are still able to access as they are in the default case.
## (Optional)
prefixURL: ""
## External URL which can access prometheus
## Maybe same with Ingress host name
baseURL: ""
## Additional server container environment variables
##
## You specify this manually like you would a raw deployment manifest.
## This means you can bind in environment variables from secrets.
##
## e.g. static environment variable:
## - name: DEMO_GREETING
## value: "Hello from the environment"
##
## e.g. secret environment variable:
## - name: USERNAME
## valueFrom:
## secretKeyRef:
## name: mysecret
## key: username
env: []
extraFlags:
- web.enable-lifecycle
## web.enable-admin-api flag controls access to the administrative HTTP API which includes functionality such as
## deleting time series. This is disabled by default.
- web.enable-admin-api
##
## storage.tsdb.no-lockfile flag controls BD locking
# - storage.tsdb.no-lockfile
##
## storage.tsdb.wal-compression flag enables compression of the write-ahead log (WAL)
# - storage.tsdb.wal-compression
## Path to a configuration file on prometheus server container FS
configPath: /etc/config/prometheus.yml
### The data directory used by prometheus to set --storage.tsdb.path
### When empty server.persistentVolume.mountPath is used instead
## 配置持久化存储SC
storagePath: ""
## Prometheus配置文件自定义
global:
## How frequently to scrape targets by default
##
scrape_interval: 15s
## How long until a scrape request times out
##
scrape_timeout: 10s
## How frequently to evaluate rules
##
evaluation_interval: 15s
## https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write
##
remoteWrite: []
## https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read
##
remoteRead: []
## Additional Prometheus server container arguments
##
extraArgs: {}
## Additional InitContainers to initialize the pod
##
extraInitContainers: []
## Additional Prometheus server Volume mounts
##
extraVolumeMounts: []
## Additional Prometheus server Volumes
##
extraVolumes: []
## Additional Prometheus server hostPath mounts
##
extraHostPathMounts: []
# - name: certs-dir
# mountPath: /etc/kubernetes/certs
# subPath: ""
# hostPath: /etc/kubernetes/certs
# readOnly: true
extraConfigmapMounts: []
# - name: certs-configmap
# mountPath: /prometheus
# subPath: ""
# configMap: certs-configmap
# readOnly: true
## Additional Prometheus server Secret mounts
# Defines additional mounts with secrets. Secrets must be manually created in the namespace.
extraSecretMounts: []
# - name: secret-files
# mountPath: /etc/secrets
# subPath: ""
# secretName: prom-secret-files
# readOnly: true
## ConfigMap override where fullname is {{.Release.Name}}-{{.Values.server.configMapOverrideName}}
## Defining configMapOverrideName will cause templates/server-configmap.yaml
## to NOT generate a ConfigMap resource
##
configMapOverrideName: ""
ingress:
## If true, Prometheus server Ingress will be created
##
enabled: false
## Prometheus server Ingress annotations
##
annotations: {}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: 'true'
## Prometheus server Ingress additional labels
##
extraLabels: {}
## Prometheus server Ingress hostnames with optional path
## Must be provided if Ingress is enabled
##
hosts: []
# - prometheus.domain.com
# - domain.com/prometheus
## Extra paths to prepend to every host configuration. This is useful when working with annotation based services.
extraPaths: []
# - path: /*
# backend:
# serviceName: ssl-redirect
# servicePort: use-annotation
## Prometheus server Ingress TLS configuration
## Secrets must be manually created in the namespace
##
tls: []
# - secretName: prometheus-server-tls
# hosts:
# - prometheus.domain.com
## Server Deployment Strategy type
# strategy:
# type: Recreate
## hostAliases allows adding entries to /etc/hosts inside the containers
hostAliases: []
# - ip: "127.0.0.1"
# hostnames:
# - "example.com"
## Node tolerations for server scheduling to nodes with taints
## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
##
## 配置prometheus-server污点容忍
tolerations:
- key: resource
# operator: "Equal|Exists"
value: base
# effect: "NoSchedule|PreferNoSchedule|NoExecute(1.6 only)"
effect: NoExecute
## Node labels for Prometheus server pod assignment
## Ref: https://kubernetes.io/docs/user-guide/node-selection/
##
nodeSelector:
kubernetes.io/hostname: prod-sys-k8s-wn4
## Pod affinity
##
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/resource
operator: In
values:
- base
## PodDisruptionBudget settings
## ref: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
##
podDisruptionBudget:
enabled: false
maxUnavailable: 1
## Use an alternate scheduler, e.g. "stork".
## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
##
# schedulerName:
persistentVolume:
## If true, Prometheus server will create/use a Persistent Volume Claim
## If false, use emptyDir
##
enabled: true
## Prometheus server data Persistent Volume access modes
## Must match those of existing PV or dynamic provisioner
## Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
##
accessModes:
- ReadWriteOnce
## Prometheus server data Persistent Volume annotations
##
annotations: {}
## Prometheus server data Persistent Volume existing claim name
## Requires server.persistentVolume.enabled: true
## If defined, PVC must be created manually before volume will be bound
existingClaim: ""
## Prometheus server data Persistent Volume mount root path
##
mountPath: /data
## Prometheus server data Persistent Volume size
##
size: 500Gi
## Prometheus server data Persistent Volume Storage Class
## If defined, storageClassName: <storageClass>
## If set to "-", storageClassName: "", which disables dynamic provisioning
## If undefined (the default) or set to null, no storageClassName spec is
## set, choosing the default provisioner. (gp2 on AWS, standard on
## GKE, AWS & OpenStack)
##
storageClass: "alicloud-disk-essd"
## Prometheus server data Persistent Volume Binding Mode
## If defined, volumeBindingMode: <volumeBindingMode>
## If undefined (the default) or set to null, no volumeBindingMode spec is
## set, choosing the default mode.
##
# volumeBindingMode: ""
## Subdirectory of Prometheus server data Persistent Volume to mount
## Useful if the volume's root directory is not empty
##
subPath: ""
emptyDir:
## Prometheus server emptyDir volume size limit
##
sizeLimit: ""
## Annotations to be added to Prometheus server pods
##
podAnnotations: {}
# iam.amazonaws.com/role: prometheus
## Labels to be added to Prometheus server pods
##
podLabels: {}
## Prometheus AlertManager configuration
##
alertmanagers: []
## Specify if a Pod Security Policy for node-exporter must be created
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
##
podSecurityPolicy:
annotations: {}
## Specify pod annotations
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#apparmor
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl
##
# seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
# seccomp.security.alpha.kubernetes.io/defaultProfileName: 'docker/default'
# apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default'
## Use a StatefulSet if replicaCount needs to be greater than 1 (see below)
##
replicaCount: 1
## Annotations to be added to deployment
##
deploymentAnnotations: {}
statefulSet:
## If true, use a statefulset instead of a deployment for pod management.
## This allows to scale replicas to more than 1 pod
##
enabled: false
annotations: {}
labels: {}
podManagementPolicy: OrderedReady
## Alertmanager headless service to use for the statefulset
##
headless:
annotations: {}
labels: {}
servicePort: 80
## Enable gRPC port on service to allow auto discovery with thanos-querier
gRPC:
enabled: false
servicePort: 10901
# nodePort: 10901
## Prometheus server readiness and liveness probe initial delay and timeout
## Ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
##
readinessProbeInitialDelay: 30
readinessProbePeriodSeconds: 5
readinessProbeTimeout: 4
readinessProbeFailureThreshold: 3
readinessProbeSuccessThreshold: 1
livenessProbeInitialDelay: 30
livenessProbePeriodSeconds: 15
livenessProbeTimeout: 10
livenessProbeFailureThreshold: 3
livenessProbeSuccessThreshold: 1
## Prometheus server resource requests and limits
## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
##
## prometheus-server资源限额
resources:
limits:
cpu: 4
memory: 30Gi
requests:
cpu: 500m
memory: 5Gi
# Required for use in managed kubernetes clusters (such as AWS EKS) with custom CNI (such as calico),
# because control-plane managed by AWS cannot communicate with pods' IP CIDR and admission webhooks are not working
##
hostNetwork: false
# When hostNetwork is enabled, you probably want to set this to ClusterFirstWithHostNet
dnsPolicy: ClusterFirst
## Vertical Pod Autoscaler config
## Ref: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
verticalAutoscaler:
## If true a VPA object will be created for the controller (either StatefulSet or Deployemnt, based on above configs)
enabled: false
# updateMode: "Auto"
# containerPolicies:
# - containerName: 'prometheus-server'
# Custom DNS configuration to be added to prometheus server pods
dnsConfig: {}
# nameservers:
# - 1.2.3.4
# searches:
# - ns1.svc.cluster-domain.example
# - my.dns.search.suffix
# options:
# - name: ndots
# value: "2"
# - name: edns0
## Security context to be added to server pods
##
securityContext:
runAsUser: 65534
runAsNonRoot: true
runAsGroup: 65534
fsGroup: 65534
service:
annotations: {}
labels: {}
clusterIP: ""
## List of IP addresses at which the Prometheus server service is available
## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
##
## 配置prometheus-server-service外部IP访问
externalIPs:
- 10.1.0.10
loadBalancerIP: ""
loadBalancerSourceRanges: []
servicePort: 9090
sessionAffinity: None
type: ClusterIP
## Enable gRPC port on service to allow auto discovery with thanos-querier
gRPC:
enabled: false
servicePort: 10901
# nodePort: 10901
## If using a statefulSet (statefulSet.enabled=true), configure the
## service to connect to a specific replica to have a consistent view
## of the data.
statefulsetReplica:
enabled: false
replica: 0
## Prometheus server pod termination grace period
##
terminationGracePeriodSeconds: 300
## Prometheus data retention period (default if not specified is 15 days)
##
retention: "30d"
pushgateway:
## If false, pushgateway will not be installed
##
enabled: true
## Use an alternate scheduler, e.g. "stork".
## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
##
# schedulerName:
## pushgateway container name
##
name: pushgateway
## pushgateway container image
##
image:
repository: prom/pushgateway
tag: v1.3.1
pullPolicy: IfNotPresent
## pushgateway priorityClassName
##
priorityClassName: "monitor-service"
## Additional pushgateway container arguments
##
## for example: persistence.file: /data/pushgateway.data
extraArgs: {}
## Additional InitContainers to initialize the pod
##
extraInitContainers: []
ingress:
## If true, pushgateway Ingress will be created
##
enabled: false
## pushgateway Ingress annotations
##
annotations: {}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: 'true'
## pushgateway Ingress hostnames with optional path
## Must be provided if Ingress is enabled
##
hosts: []
# - pushgateway.domain.com
# - domain.com/pushgateway
## Extra paths to prepend to every host configuration. This is useful when working with annotation based services.
extraPaths: []
# - path: /*
# backend:
# serviceName: ssl-redirect
# servicePort: use-annotation
## pushgateway Ingress TLS configuration
## Secrets must be manually created in the namespace
##
tls: []
# - secretName: prometheus-alerts-tls
# hosts:
# - pushgateway.domain.com
## Node tolerations for pushgateway scheduling to nodes with taints
## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
##
## 配置PushGateway污点容忍
tolerations:
- key: resource
# operator: "Equal|Exists"
value: base
# effect: "NoSchedule|PreferNoSchedule|NoExecute(1.6 only)"
effect: NoExecute
## Node labels for pushgateway pod assignment
## Ref: https://kubernetes.io/docs/user-guide/node-selection/
##
nodeSelector: {}
## 配置节点亲和性
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/resource
operator: In
values:
- base
## Annotations to be added to pushgateway pods
##
podAnnotations: {}
## Labels to be added to pushgateway pods
##
podLabels: {}
## Specify if a Pod Security Policy for node-exporter must be created
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
##
podSecurityPolicy:
annotations: {}
## Specify pod annotations
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#apparmor
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
## Ref: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl
##
# seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
# seccomp.security.alpha.kubernetes.io/defaultProfileName: 'docker/default'
# apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default'
replicaCount: 1
## Annotations to be added to deployment
##
deploymentAnnotations: {}
## PodDisruptionBudget settings
## ref: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
##
podDisruptionBudget:
enabled: false
maxUnavailable: 1
## pushgateway resource requests and limits
## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
##
## 配置PushGateway资源限额
resources:
limits:
cpu: 10m
memory: 32Mi
requests:
cpu: 10m
memory: 32Mi
# Custom DNS configuration to be added to push-gateway pods
dnsConfig: {}
# nameservers:
# - 1.2.3.4
# searches:
# - ns1.svc.cluster-domain.example
# - my.dns.search.suffix
# options:
# - name: ndots
# value: "2"
# - name: edns0
## Security context to be added to push-gateway pods
##
securityContext:
runAsUser: 65534
runAsNonRoot: true
service:
annotations:
prometheus.io/probe: pushgateway
labels: {}
clusterIP: ""
## List of IP addresses at which the pushgateway service is available
## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
##
externalIPs: []
loadBalancerIP: ""
loadBalancerSourceRanges: []
servicePort: 9091
type: ClusterIP
## pushgateway Deployment Strategy type
# strategy:
# type: Recreate
persistentVolume:
## If true, pushgateway will create/use a Persistent Volume Claim
##
enabled: false
## pushgateway data Persistent Volume access modes
## Must match those of existing PV or dynamic provisioner
## Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
##
accessModes:
- ReadWriteOnce
## pushgateway data Persistent Volume Claim annotations
##
annotations: {}
## pushgateway data Persistent Volume existing claim name
## Requires pushgateway.persistentVolume.enabled: true
## If defined, PVC must be created manually before volume will be bound
existingClaim: ""
## pushgateway data Persistent Volume mount root path
##
mountPath: /data
## pushgateway data Persistent Volume size
##
size: 2Gi
## pushgateway data Persistent Volume Storage Class
## If defined, storageClassName: <storageClass>
## If set to "-", storageClassName: "", which disables dynamic provisioning
## If undefined (the default) or set to null, no storageClassName spec is
## set, choosing the default provisioner. (gp2 on AWS, standard on
## GKE, AWS & OpenStack)
##
# storageClass: "-"
## pushgateway data Persistent Volume Binding Mode
## If defined, volumeBindingMode: <volumeBindingMode>
## If undefined (the default) or set to null, no volumeBindingMode spec is
## set, choosing the default mode.
##
# volumeBindingMode: ""
## Subdirectory of pushgateway data Persistent Volume to mount
## Useful if the volume's root directory is not empty
##
subPath: ""
## alertmanager ConfigMap entries
##
alertmanagerFiles:
alertmanager.yml:
global: {}
# slack_api_url: ''
receivers:
- name: default-receiver
# slack_configs:
# - channel: '@you'
# send_resolved: true
route:
group_wait: 10s
group_interval: 5m
receiver: default-receiver
repeat_interval: 1h
## Prometheus server ConfigMap entries
##
serverFiles:
## Alerts configuration
## Ref: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
alerting_rules.yml: {}
# groups:
# - name: Instances
# rules:
# - alert: InstanceDown
# expr: up == 0
# for: 5m
# labels:
# severity: page
# annotations:
# description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
# summary: 'Instance {{ $labels.instance }} down'
## DEPRECATED DEFAULT VALUE, unless explicitly naming your files, please use alerting_rules.yml
alerts: {}
## Records configuration
## Ref: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
recording_rules.yml: {}
## DEPRECATED DEFAULT VALUE, unless explicitly naming your files, please use recording_rules.yml
rules: {}
prometheus.yml:
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
## Below two files are DEPRECATED will be removed from this default values file
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
# A scrape configuration for running Prometheus on a Kubernetes cluster.
# This uses separate scrape configs for cluster components (i.e. API server, node)
# and services to allow each to use different authentication configs.
#
# Kubernetes labels will be added as Prometheus labels on metrics via the
# `labelmap` relabeling action.
# Scrape config for API servers.
#
# Kubernetes exposes API servers as endpoints to the default/kubernetes
# service so this uses `endpoints` role and uses relabelling to only keep
# the endpoints associated with the default/kubernetes service using the
# default named port `https`. This works for single API server deployments as
# well as HA API server deployments.
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# If your node certificates are self-signed or use a different CA to the
# master CA, then disable certificate verification below. Note that
# certificate verification is an integral part of a secure infrastructure
# so this should only be disabled in a controlled environment. You can
# disable certificate verification by uncommenting the line below.
#
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# Keep only the default/kubernetes service endpoints for the https port. This
# will add targets for each API server which Kubernetes adds an endpoint to
# the default/kubernetes service.
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# If your node certificates are self-signed or use a different CA to the
# master CA, then disable certificate verification below. Note that
# certificate verification is an integral part of a secure infrastructure
# so this should only be disabled in a controlled environment. You can
# disable certificate verification by uncommenting the line below.
#
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/$1/proxy/metrics
- job_name: 'kubernetes-nodes-cadvisor'
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# If your node certificates are self-signed or use a different CA to the
# master CA, then disable certificate verification below. Note that
# certificate verification is an integral part of a secure infrastructure
# so this should only be disabled in a controlled environment. You can
# disable certificate verification by uncommenting the line below.
#
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
# This configuration will work only on kubelet 1.7.3+
# As the scrape endpoints for cAdvisor have changed
# if you are using older version you need to change the replacement to
# replacement: /api/v1/nodes/$1:4194/proxy/metrics
# more info here https://github.com/coreos/prometheus-operator/issues/633
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
# Scrape config for slow service endpoints; same as above, but with a larger
# timeout and a larger interval
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape-slow`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints-slow'
scrape_interval: 5m
scrape_timeout: 30s
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape_slow]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
- job_name: 'prometheus-pushgateway'
honor_labels: true
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: pushgateway
# Example scrape config for probing services via the Blackbox Exporter.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/probe`: Only probe services that have a value of `true`
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the default of `9102`.
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
regex: (https?)
target_label: __scheme__
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_phase]
regex: Pending|Succeeded|Failed
action: drop
# Example Scrape config for pods which should be scraped slower. An useful example
# would be stackriver-exporter which queries an API on every scrape of the pod
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape-slow`: Only scrape pods that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the default of `9102`.
- job_name: 'kubernetes-pods-slow'
scrape_interval: 5m
scrape_timeout: 30s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape_slow]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
regex: (https?)
target_label: __scheme__
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_phase]
regex: Pending|Succeeded|Failed
action: drop
# adds additional scrape configs to prometheus.yml
# must be a string so you have to add a | after extraScrapeConfigs:
# example adds prometheus-blackbox-exporter scrape config
extraScrapeConfigs:
# - job_name: 'prometheus-blackbox-exporter'
# metrics_path: /probe
# params:
# module: [http_2xx]
# static_configs:
# - targets:
# - https://example.com
# relabel_configs:
# - source_labels: [__address__]
# target_label: __param_target
# - source_labels: [__param_target]
# target_label: instance
# - target_label: __address__
# replacement: prometheus-blackbox-exporter:9115
# Adds option to add alert_relabel_configs to avoid duplicate alerts in alertmanager
# useful in H/A prometheus with different external labels but the same alerts
alertRelabelConfigs:
# alert_relabel_configs:
# - source_labels: [dc]
# regex: (.+)\d+
# target_label: dc
networkPolicy:
## Enable creation of NetworkPolicy resources.
##
enabled: false
# Force namespace of namespaced resources
forceNamespace: null
5. 访问Prometheus & Alertmanager后台管理界面
<root@PROD-K8S-CP1 ~># kubectl describe svc prometheus-prometheus-server
Name: prometheus-prometheus-server
Namespace: default
Labels: app=prometheus
app.kubernetes.io/managed-by=Helm
chart=prometheus-14.6.0
component=prometheus-server
heritage=Helm
release=prometheus
Annotations: meta.helm.sh/release-name: prometheus
meta.helm.sh/release-namespace: default
Selector: app=prometheus,component=prometheus-server,release=prometheus
Type: ClusterIP
IP: 10.12.0.20
External IPs: 10.1.0.10
Port: http 9090/TCP
TargetPort: 9090/TCP
Endpoints: 172.21.3.157:9090
Session Affinity: None
Events: <none>
<root@PROD-K8S-CP1 ~># kubectl describe svc prometheus-alertmanager
Name: prometheus-alertmanager
Namespace: default
Labels: app=prometheus
app.kubernetes.io/managed-by=Helm
chart=prometheus-14.6.0
component=alertmanager
heritage=Helm
release=prometheus
Annotations: meta.helm.sh/release-name: prometheus
meta.helm.sh/release-namespace: default
Selector: app=prometheus,component=alertmanager,release=prometheus
Type: ClusterIP
IP: 10.12.0.13
External IPs: 10.1.0.11
Port: http 9093/TCP
TargetPort: 9093/TCP
Endpoints: 172.21.3.251:9093
Session Affinity: None
Events: <none>
6. 访问方式 10.1.0.10:80
7. 修改时区 lens直接修改
- name: localtime
mountPath: /etc/localtime
----------------------------------------------
- name: localtime
hostPath:
path: /etc/localtime
type: ''
8. 注意修改Prometheus-server的滚动更新策略,最大不可用设置为1,否则会出现DB锁定错误
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
9. 如果手动查看节点expose的metrics
<root@PRE-K8S-CP1 ~># curl -ik https://10.1.0.233:6443/metrics
HTTP/1.1 403 Forbidden
Cache-Control: no-cache, private
Content-Type: application/json
X-Content-Type-Options: nosniff
Date: Wed, 02 Jun 2021 03:40:55 GMT
Content-Length: 240
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
"reason": "Forbidden",
"details": {
},
"code": 403
}
解决方案如下,是因为默认情况下,Kubernetes不允许未经授权下不能访问集群资源信息
kubectl create clusterrolebinding prometheus-admin --clusterrole cluster-admin --user system:anonymous
四、Prometheus 配置文件
reable_action参数解释
- replace 默认,通过regex匹配source_label的值,使用replacement来引用表达式匹配的分组
- keep 删除regex与连接不匹配的目标 source_labels
- drop 删除regex与连接匹配的目标 source_labels
- labelmap 匹配regex所有标签名称。然后复制匹配标签的值进行分组,replacement分组引用(${1},${2},…)替代
- labeldrop 删除regex匹配的标签
- labelkeep 删除regex不匹配的标签
默认配置文件
job_name一般用来定义收集什么类型的metrics,下面的job_name定义了收集的是
- api-server的性能指标,具体的配置解释如下
- job_name: kubernetes-apiservers
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
follow_redirects: true
relabel_configs: ## 标签重新定义
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] ##定义原标签的名称作为匹配的条件
separator: ; ## 分隔符的定义,用于分隔源标签值之间的符号
regex: default;kubernetes;https ## 匹配原标签source_lable的值,比如原标签是__meta_kubernetes_service_name="kubernetes" __meta_kubernetes_namespace="default" __meta_kubernetes_endpoint_port_name="https" 匹配规则需要匹配到source_lable为“__meta_kubernetes_endpoint_port_name”的值是"https"这个endpoints,这里的endpoints就是一个监控点,不要与kubernetes的endpoints混淆
replacement: $1 ##默认操作
action: keep ## 删除regex与source_lable的不匹配的标签,并保留匹配的regex
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
2. 节点性能指标采集job_name kubernetes_nodes
- job_name: kubernetes-nodes
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
follow_redirects: true
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap ##保留__meta_kubernetes_node_label_(.+)中的".+"的值
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace ## 新增__address标签并且值替换为replacement的值
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+) ##匹配source_lable任意的值(实际就是node节点的名称)
target_label: __metrics_path__
replacement: /api/v1/nodes/$1/proxy/metrics
action: replace ##替换面的$1就是把regex匹配到的node节点的名称的值
kubernetes_sd_configs:
- role: node
follow_redirects: true
整段的配置意思就是修改了默认的__address__的值,第一段将IP修改成域名形式,第二配置将默认的/metrics修改成/api/v1/nodes/$1/proxy/metrics,整合起来最终的值就是 https://kubernetes.default.svc/api/v1/nodes/pre-k8s-cp3/proxy/metrics
3. 节点容器指标采集说明,job_name kubernetes_nodes_cadvisor
- job_name: kubernetes-nodes-cadvisor
honor_timestamps: true
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
follow_redirects: true
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
# 与节点性能指标采集唯一不同之处,URI路径不一样
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
action: replace
kubernetes_sd_configs:
- role: node
follow_redirects: true
4. endpoints的service的监控配置(配置在service中)
apiVersion: v1
kind: Service
metadata:
name: prometheus-kube-state-metrics
namespace: default
selfLink: /api/v1/namespaces/default/services/prometheus-kube-state-metrics
uid: 161fff0c-fffe-4561-90ba-0bec0608fbe4
resourceVersion: '23976299'
creationTimestamp: '2021-06-01T06:51:21Z'
labels:
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kube-state-metrics
helm.sh/chart: kube-state-metrics-3.1.0
annotations:
meta.helm.sh/release-name: prometheus
meta.helm.sh/release-namespace: default
prometheus.io/scrape: 'true'
- job_name: kubernetes-service-endpoints
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
follow_redirects: true
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep ##注意这段配置,相当重要,直接决定这个job监控那到那些endpoints监控点,这一段意思是保留匹配__meta_kubernetes_service_annotation_prometheus_io_scrape的值为true的endpoints,并删除没有匹配到标签__meta_kubernetes_service_annotation_prometheus_io_scrape且值为true的endpoints
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
separator: ;
regex: (https?)
target_label: __scheme__
replacement: $1
action: replace ##将__scheme__原值替换为regex的值
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace ##source_lable __meta_kubernetes_service_annotation_prometheus_io_path的值赋予__metrics_path__
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace ## 将匹配到的__address__ __meta_kubernetes_service_annotation_prometheus_io_port的标签,根据regex的规则匹配到的值赋予__address__
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap ##保留.+的值为新增的label
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace ## 替换标签值,同上
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: kubernetes_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: (.*)
target_label: kubernetes_node
replacement: $1
action: replace
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
5. service监控配置,http_2xx
- job_name: kubernetes-services
honor_timestamps: true
params:
module:
- http_2xx ##引入的模块
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /probe
scheme: http
follow_redirects: true
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
separator: ;
regex: "true"
replacement: $1
action: keep ## 保留匹配__meta_kubernetes_service_annotation_prometheus_io_probe并且删除没有匹配到__meta_kubernetes_service_annotation_prometheus_io_probe的endpoints
- source_labels: [__address__]
separator: ;
regex: (.*)
target_label: __param_target
replacement: $1
action: replace ## 将regex的的值赋予__param_target
- separator: ;
regex: (.*)
target_label: __address__
replacement: blackbox
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace ##经过上面标签重标,最终target_label instance的值就是上面的__address__的值
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: kubernetes_name
replacement: $1
action: replace
kubernetes_sd_configs:
- role: service
follow_redirects: true
6. Pod监控配置
- job_name: kubernetes-pods
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
follow_redirects: true
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep #
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
separator: ;
regex: (https?)
target_label: __scheme__
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- separator: ;
regex: __meta_kubernetes_pod_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: kubernetes_pod_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: Pending|Succeeded|Failed
replacement: $1
action: drop
kubernetes_sd_configs:
- role: pod
follow_redirects: true
五、JVM RabbitMQ监控
- JVM监控(需要在deployment.yaml指定prometheus_io_scrape标签)如下
spec:
replicas: 1
selector:
matchLabels:
app: pre-common-gateway
component: spring
part-of: pre
tier: backend
template:
metadata:
creationTimestamp: null
labels:
app: pre-common-gateway
component: spring
part-of: pre
tier: backend
annotations: # 新增监控注释说明
prometheus.io/port: '8888'
prometheus.io/scrape: 'true'
############# Jvm监控 ##############
- job_name: kubernetes-pods-jvm
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: kubernetes_container_name
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
scrape_interval: 10s
scrape_timeout: 10s
2. RabbitMQ监控
- rabbitmq-exporter(第三方提供的prometheus-exporter方式)
https://github.com/kbudde/rabbitmq_exporter
主要针对rabbitmq的系统服务使用状态监控,比如 Queue/Channel/Exchange/Connection及队列消耗内存使用监控/及节点城内存/磁盘等 - rabbitmq-prometheus(官方自带)
https://www.rabbitmq.com/prometheus.html
主要针对RabbitMQ的基础设施,比如元数据/erlang进程资源使用/内存配置/CPU使用/open file分配,一句话就是偏底层监控
以下的配置仅供参考,创建后的状态
apiVersion: v1
kind: Service
metadata:
name: pre-rabbitmq-monitor
namespace: pre
annotations:
prometheus.io/scrape: rabbitmq
spec:
ports:
- name: rabbitmq-exporter
protocol: TCP
port: 9419
targetPort: 9419
- name: rabbitmq-prometheus-port
protocol: TCP
port: 15692
targetPort: 15692
selector:
app: pre-rabbitmq
clusterIP: 10.11.0.90
type: ClusterIP
sessionAffinity: None
############# rabbitmq监控 ##############
- job_name: kubernetes-service-rabbitmq
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
# 新增自定义监控标签rabbitmq
- action: keep
regex: rabbitmq
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
# 删除不需要监控的端口
- action: drop
regex: (5672|15672)
source_labels:
- __meta_kubernetes_pod_container_port_number
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: labelmap
regex: __meta_kubernetes_pod_label_statefulset_kubernetes_io_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
3. 参考rabbitmq-exporter.yaml配置(lens直接修改)
- name: rabbitmq-exporter
image: hub.qiangyun.com/rabbitmq-exporter
ports:
- name: mq-monitor
containerPort: 9419
protocol: TCP
env:
- name: RABBIT_USER
value: guest
- name: RABBIT_PASSWORD
value: guest
- name: RABBIT_CAPABILITIES
value: bert
resources:
limits:
cpu: 500m
memory: 1Gi
livenessProbe:
httpGet:
path: /metrics
port: 9419
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 15
periodSeconds: 60
successThreshold: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /metrics
port: 9419
scheme: HTTP
initialDelaySeconds: 20
timeoutSeconds: 10
periodSeconds: 60
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
4. rabbitmq-monitor-service.yaml
apiVersion: v1
kind: Service
metadata:
annotations:
# 注意以下的配置与下面Prometheus的配置相呼应
prometheus.io/scrape: rabbitmq
name: rabbitmq-monitor
namespace: prod
spec:
ports:
- name: rabbitmq-exporter
port: 9419
protocol: TCP
targetPort: 9419
- name: rabbitmq-prometheus-port
port: 15692
protocol: TCP
targetPort: 15692
selector:
app: rabbitmq
type: ClusterIP
六、参考配置
- Prometheus完整配置
global:
evaluation_interval: 15s
scrape_interval: 15s
scrape_timeout: 10s
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
############## Kubernetes apiserver 监控配置 ##############
job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
############## Kubernetes node 性能指标监控配置 ##############
job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
############## Kubernetes 节点Pod性能指标监控配置 ##############
job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
############## Kubernetes service endpoints 监控配置 ##############
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
############## Kubernetes service endpoints RabbitMQ 监控配置 ##############
- job_name: kubernetes-service-rabbitmq
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: rabbitmq
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: keep
regex: (15692|9419)
source_labels:
- __meta_kubernetes_pod_container_port_number
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
############## 暂时没有理解这个 slow鬼 监控配置 ##############
- job_name: kubernetes-service-endpoints-slow
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
scrape_interval: 5m
scrape_timeout: 30s
- honor_labels: true
############## Prometheus pushgateway 监控配置 ##############
job_name: prometheus-pushgateway
kubernetes_sd_configs:
- role: service
relabel_configs:
- action: keep
regex: pushgateway
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
############## Kubernetes service http_2xx 监控配置 ##############
- job_name: kubernetes-services
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module:
- http_2xx
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
- source_labels:
- __address__
target_label: __param_target
- replacement: blackbox
target_label: __address__
- source_labels:
- __param_target
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
############## Kubernetes 用户应用 Pod 自定义监控配置 ##############
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
############## Kubernetes 用户应用 Pod JVM 监控配置 ##############
- job_name: kubernetes-pods-jvm
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_jvm_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_jvm_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
############## 暂时没有理解这个 slow鬼 监控配置 ##############
- job_name: kubernetes-pods-slow
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
scrape_interval: 5m
scrape_timeout: 30s
alerting:
alertmanagers:
- kubernetes_sd_configs:
- role: pod
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
regex: default
action: keep
- source_labels: [__meta_kubernetes_pod_label_app]
regex: prometheus
action: keep
- source_labels: [__meta_kubernetes_pod_label_component]
regex: alertmanager
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
regex: .*
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "9093"
action: keep
ConfigMap形式
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-server
namespace: default
labels:
app: prometheus
app.kubernetes.io/managed-by: Helm
chart: prometheus-14.6.0
component: server
heritage: Helm
release: prometheus
data:
alerting_rules.yml: |
{}
alerts: |
{}
prometheus.yml: |
global:
evaluation_interval: 15s
scrape_interval: 15s
scrape_timeout: 10s
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
############## Kubernetes apiserver 监控配置 ##############
job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
############## Kubernetes node 性能指标监控配置 ##############
job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
############## Kubernetes 节点Pod性能指标监控配置 ##############
job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
############## Kubernetes service endpoints 监控配置 ##############
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
############## Kubernetes service endpoints RabbitMQ 监控配置 ##############
- job_name: kubernetes-service-rabbitmq
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: rabbitmq
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: keep
regex: (15692|9419)
source_labels:
- __meta_kubernetes_pod_container_port_number
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
############## 暂时没有理解这个 slow鬼 监控配置 ##############
- job_name: kubernetes-service-endpoints-slow
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
scrape_interval: 5m
scrape_timeout: 30s
- honor_labels: true
############## Prometheus pushgateway 监控配置 ##############
job_name: prometheus-pushgateway
kubernetes_sd_configs:
- role: service
relabel_configs:
- action: keep
regex: pushgateway
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
############## Kubernetes service http_2xx 监控配置 ##############
- job_name: kubernetes-services
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module:
- http_2xx
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
- source_labels:
- __address__
target_label: __param_target
- replacement: blackbox
target_label: __address__
- source_labels:
- __param_target
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
############## Kubernetes 用户应用 Pod 自定义监控配置 ##############
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
############## Kubernetes 用户应用 Pod JVM 监控配置 ##############
- job_name: kubernetes-pods-jvm
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_jvm_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_jvm_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: kubernetes_container_name
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
############## 暂时没有理解这个 slow鬼 监控配置 ##############
- job_name: kubernetes-pods-slow
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
scrape_interval: 5m
scrape_timeout: 30s
alerting:
alertmanagers:
- kubernetes_sd_configs:
- role: pod
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
regex: default
action: keep
- source_labels: [__meta_kubernetes_pod_label_app]
regex: prometheus
action: keep
- source_labels: [__meta_kubernetes_pod_label_component]
regex: alertmanager
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
regex: .*
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "9093"
action: keep
recording_rules.yml: |
{}
rules: |
{}
七、告警规则
常用告警规则配置
- alerts
## CPU告警规则
groups:
- name: CpuAlertRule
rules:
- alert: PodCPU告警
expr: onecore:pod > 80 or twocore:pod / 2 > 80 or squarecore:pod / 4 > 80
for: 2m
labels:
severity: warning
annotations:
description: "CPU使用率大于80%"
value: "{{$value}}%"
#summary: 'CPU使用率大于80%,当前值为{{.Value}}%,CPU使用率: {{ printf `ceil(100 - ((avg by (instance)(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[1m]))) *100))` $labels.instance | query | first | value }}%'
- alert: NodeCPU告警
expr: round(100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m]))by(kubernetes_node)*100) > 80
for: 2m
labels:
severity: warning
annotations:
description: "CPU使用率大于80%"
value: "{{$value}}%"
#summary: 'CPU使用率大于80%,当前值为{{.Value}}%,CPU使用率: {{ printf `ceil(100 - ((avg by (instance)(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[1m]))) *100))` $labels.instance | query | first | value }}%'
## DISK告警规则
- name: DiskAlertRule
rules:
- alert: Pod磁盘告警
expr: round(container_fs_usage_bytes{container=~".+",container!~"POD"}/1024/1024/1024/10*100) > 85
for: 1m
labels:
severity: warning
annotations:
description: "磁盘使用率大于85%"
value: "{{$value}}%"
- alert: Node磁盘告警
expr: round((1- node_filesystem_avail_bytes{fstype=~"ext.+|nfs.+",mountpoint!~".*docker.*"}/node_filesystem_size_bytes{fstype=~"ext.+|nfs.+",mountpoint!~".*docker.*"})*100) > 85
for: 1m
labels:
severity: warning
annotations:
description: "磁盘使用率大于85%"
value: "{{$value}}%"
## MEM告警规则
- name: MemAlertRule
rules:
- alert: Pod内存告警
expr: round(container_memory_usage_bytes{container=~".+",container!~"POD|.+reload",pod!~"^csi.+"}/container_spec_memory_limit_bytes{container=~".+",container!~"POD|.reload",pod!~"^csi.+"}*100) > 85
for: 2m
labels:
severity: warning
annotations:
description: "内存使用率大于85%"
value: "{{$value}}%"
- alert: Node内存告警
expr: round(100-((node_memory_MemAvailable_bytes*100)/node_memory_MemTotal_bytes)) > 80
for: 2m
labels:
severity: warning
annotations:
description: "内存使用率大于85%"
value: "{{$value}}%"
## Pod意外重启
- name: PodRestartAlertRule
rules:
- alert: Pod重启告警
expr: delta(kube_pod_container_status_restarts_total[1m]) > 0
for: 1s
labels:
severity: warning
annotations:
description: "Pod发生意外重启事件"
## JvmCMSOldGC
- name: PodJvmOldGCAlertRule
rules:
- alert: PodJvmCMSOldGC
expr: round((jvm_memory_pool_bytes_used{pool=~".+Old Gen"}/jvm_memory_pool_bytes_max{pool=~".+Old Gen"})*100) > 89
for: 5s
labels:
severity: warning
annotations:
description: "Pod堆内存触发CMSOldGC"
value: "{{$value}}%"
## Pod实例异常
- name: ContainerInstanceAlertRule
rules:
- alert: Pod实例异常
expr: kube_pod_container_status_ready - kube_pod_container_status_running > 0
for: 20s
labels:
severity: warning
annotations:
description: "Container实例异常"
## Pod实例OOM
- name: ContainerOOMAlertRule
rules:
- alert: Pod实例OOM
expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
for: 1s
labels:
severity: warning
annotations:
description: "Container实例OOM"
## Pod实例驱逐
- name: ContainerEvictionAlertRule
rules:
- alert: Pod实例驱逐
expr: kube_pod_container_status_terminated_reason{reason="Evicted"} > 0
for: 1s
labels:
severity: warning
annotations:
description: "Container实例驱逐"
## MQ内存告警
- name: MQMemoryAlertRule
rules:
- alert: MQ内存水位线
expr: rabbitmq_node_mem_alarm{job=~".*rabbitmq.*"} == 1
for: 1s
labels:
severity: warning
annotations:
description: "RabbitMQ内存高水位线告警"
summary: RabbitMQ {{`{{ $labels.instance }}`}} High Memory Alarm is going off. Which means the node hit highwater mark and has cut off network connectivity, see RabbitMQ WebUI
- alert: MQ内存使用告警
expr: round(avg(rabbitmq_node_mem_used{job=~".*rabbitmq.*"} / rabbitmq_node_mem_limit{job=~".*rabbitmq.*"})by(node,kubernetes_namespace)*100) > 90
for: 10s
labels:
severity: warning
annotations:
description: "RabbitMQ使用告警"
value: "{{$value}}%"
summary: RabbitMQ {{`{{ $labels.instance }}`}} Memory Usage > 90%
##PodJava进程异常
- name: PodJavaProcessAlertRule
rules:
- alert: PodJava进程异常
expr: sum(up{job="kubernetes-pods-jvm"})by(kubernetes_container_name,kubernetes_pod_name) == 0
for: 10s
labels:
severity: warning
annotations:
description: "PodJava进程异常"
summary: "赶快看看吧,顶不住了"
recording_rules
groups:
- name: CpuRecordRules
rules:
- record: onecore:pod
expr: round(sum by(pod, container, instance, namespace, name) (irate(container_cpu_usage_seconds_total{container!~"|POD|prod-xianxiang-edu-loan|prod-xy-fund|prod-common-callcenter|prod-risk-service|prod-qn-web-api|prod-xc-fund|prod-xc-user|sys-ingress|etcd|prod-qn-mp|prod-xc-common|prod-xianxiang-zuul|prod-qn-user|prod-xc-riskapi|prod-common-message|prod-common-trust-service|prod-xc-collection|kube-controller-manager|prod-qn-risk|prod-xy-zuul|metrics-server|prod-nflow-manager|kube-scheduler|prod-qn-gateway|prod-xc-pay|coredns|kube-apiserver|prod-qn-oms|prod-common-service|prod-nfsp-service|pord-ingress|prod-qn-cms|prod-internal-ingress|prod-xc-loan|prod-rabbitmq|prometheus-server"}[5m]) * 100))
- record: twocore:pod
expr: round(sum by(pod, container, instance, namespace, name) (irate(container_cpu_usage_seconds_total{container=~"prod-xianxiang-edu-loan|prod-xy-fund|prod-common-callcenter|prod-risk-service|prod-qn-web-api|prod-xc-fund|prod-xc-user|sys-ingress|etcd|prod-qn-mp|prod-xc-common|prod-xianxiang-zuul|prod-qn-user|prod-xc-riskapi|prod-common-message|prod-common-trust-service|prod-xc-collection|kube-controller-manager|prod-qn-risk|prod-xy-zuul|metrics-server|prod-nflow-manager|kube-scheduler|prod-qn-gateway|prod-xc-pay|coredns|kube-apiserver|prod-qn-oms|prod-common-service|prod-nfsp-service|pord-ingress|prod-qn-cms|prod-internal-ingress|prod-xc-loan"}[5m]) * 100))
- record: squarecore:pod
expr: round(sum by(pod, container, instance, namespace, name) (irate(container_cpu_usage_seconds_total{container=~"prod-rabbitmq|prometheus-server"}[5m]) * 100))