保姆级教程:在K8s集群里用DaemonSet部署Prometheus的node-exporter(附完整YAML)
Kubernetes实战用DaemonSet部署Prometheus node-exporter的终极指南当你的Kubernetes集群规模逐渐扩大节点级别的监控数据收集就成了运维工作的重中之重。想象一下凌晨三点被报警叫醒却发现无法快速定位是哪个节点的磁盘即将写满——这种痛苦每个运维都深有体会。本文将带你一步步实现零失败的node-exporter部署提供经过生产验证的YAML配置并揭示那些官方文档没写的实战细节。1. 环境准备与原理剖析在开始部署之前我们需要明确几个关键概念。node-exporter作为Prometheus生态中最基础的组件之一它以DaemonSet形式运行在每个节点上专门采集主机级别的指标数据。与常见的Deployment不同DaemonSet确保集群中每个节点包括后续新增的节点都会自动运行一个副本。必备前置条件检查# 检查kubectl是否配置正确 kubectl cluster-info # 验证节点状态 kubectl get nodes -o wide需要特别注意的权限问题需要访问主机/proc和/sys文件系统可能需要特权模式运行某些云平台需要额外的RBAC配置下表对比了不同部署方式的优劣部署方式资源占用隔离性维护难度适用场景DaemonSet低弱易集群节点监控Sidecar中强中特定Pod监控独立主机部署高强难传统服务器监控2. 完整YAML配置解析下面是我们经过多个生产环境验证的DaemonSet配置包含了详细的注释说明apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring labels: app.kubernetes.io/name: node-exporter app.kubernetes.io/version: v1.3.1 spec: selector: matchLabels: app.kubernetes.io/name: node-exporter template: metadata: labels: app.kubernetes.io/name: node-exporter annotations: prometheus.io/scrape: true prometheus.io/port: 9100 spec: hostNetwork: true hostPID: true hostIPC: true containers: - name: node-exporter image: prom/node-exporter:v1.3.1 args: - --path.procfs/host/proc - --path.sysfs/host/sys - --collector.filesystem.ignored-mount-points^/(dev|proc|sys|var/lib/docker/.)($|/) - --collector.netclass.ignored-devices^(veth|docker|cali|flannel).*$ ports: - containerPort: 9100 protocol: TCP resources: limits: cpu: 200m memory: 100Mi requests: cpu: 100m memory: 50Mi volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /rootfs readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule关键配置解析hostNetwork: true- 直接使用主机网络避免额外的网络开销ignored-mount-points- 忽略容器内部挂载点避免重复统计ignored-devices- 过滤虚拟网络设备减少噪音指标资源限制 - 防止监控组件自身消耗过多资源3. 部署与验证流程执行部署命令前建议先创建独立的监控命名空间kubectl create namespace monitoring应用DaemonSet配置kubectl apply -f node-exporter-daemonset.yaml验证部署状态的几种方法# 检查Pod状态 kubectl get pods -n monitoring -l app.kubernetes.io/namenode-exporter -o wide # 查看日志确认无报错 kubectl logs -n monitoring pod-name # 临时端口转发测试 kubectl port-forward -n monitoring pod-name 9100:9100 # 另开终端访问 curl http://localhost:9100/metrics常见问题排查表问题现象可能原因解决方案Pod处于CrashLoopBackOff权限不足添加securityContext.privileged指标接口返回403网络策略限制检查NetworkPolicy配置缺少磁盘指标挂载路径不正确验证volumeMounts配置主节点没有运行Pod缺少容忍度配置添加master节点的toleration4. 高级配置与优化技巧对于生产环境我们还需要考虑以下几个进阶配置指标采集优化args: - --collector.disable-defaults - --collector.cpu - --collector.meminfo - --collector.diskstats - --collector.netdev - --collector.filesystem - --collector.loadavg安全加固配置securityContext: runAsUser: 65534 runAsGroup: 65534 readOnlyRootFilesystem: true capabilities: drop: - ALL add: - CHOWN - DAC_OVERRIDE - SETGID - SETUID资源监控看板建议指标CPU使用率rate(node_cpu_seconds_total{modeidle}[1m])内存压力node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes磁盘预警predict_linear(node_filesystem_free_bytes{mountpoint/}[1h], 4*3600) 05. 与Prometheus的集成部署完成后需要在Prometheus中添加抓取配置。以下是对应的ServiceMonitor配置示例apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: node-exporter namespace: monitoring labels: release: prometheus-operator spec: endpoints: - port: http-metrics interval: 15s path: /metrics selector: matchLabels: app.kubernetes.io/name: node-exporter namespaceSelector: matchNames: - monitoring对于没有使用Operator的情况可以直接在prometheus.yml中添加scrape_configs: - job_name: node-exporter kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name] action: keep regex: node-exporter - source_labels: [__address__] action: replace regex: ([^:])(?::\d)? replacement: ${1}:9100 target_label: __address__6. 真实案例性能问题排查曾经遇到过一个典型案例某集群的node-exporter指标采集间隔性地出现超时。通过以下步骤最终定位问题首先检查Pod资源使用情况kubectl top pods -n monitoring发现内存使用接近限制值调整资源限制resources: limits: memory: 200Mi requests: memory: 100Mi进一步分析发现是filesystem collector耗时过长优化采集配置args: - --collector.filesystem.ignored-fs-types^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$这个案例告诉我们默认配置不一定适合所有场景需要根据实际环境进行调整。