Keep开源告警管理平台:从零到生产环境的完整部署指南
Keep开源告警管理平台从零到生产环境的完整部署指南【免费下载链接】keepThe open-source AIOps and alert management platform项目地址: https://gitcode.com/GitHub_Trending/kee/keep在云原生时代告警管理已成为运维团队面临的核心挑战。告警风暴、重复告警、缺乏上下文信息等问题严重影响了团队的响应效率。Keep作为一款开源的AIOps和告警管理平台提供了从Docker快速体验到Kubernetes生产部署的完整解决方案帮助企业构建高效的告警管理生态系统。本文将带您深入了解Keep的核心功能并逐步指导您完成从概念验证到生产环境的完整部署流程。理解Keep的核心价值为什么选择这个开源AIOps平台Keep是一个功能强大的开源告警管理和自动化平台专为开发者和运维团队设计。它通过AI驱动的告警处理、智能关联分析和自动化工作流帮助企业从被动响应转向主动运维。核心功能特性对比功能模块传统告警管理痛点Keep解决方案业务价值统一告警视图告警分散在各个监控工具中集中管理所有监控工具的告警消除告警孤岛提升可视化能力智能去重与关联重复告警泛滥难以识别根因基于AI的告警去重和关联分析减少告警噪音识别根本原因自动化工作流手动响应告警效率低下可视化工作流编排引擎自动化响应减少人工干预双向集成工具间集成复杂数据不一致与100监控工具的深度集成统一告警处理流程服务拓扑故障传播路径不清晰动态服务依赖关系映射快速定位故障传播路径技术架构概览Keep采用现代化的微服务架构设计主要包含以下核心组件Frontend: 基于Next.js构建的现代化Web界面Backend: FastAPI后端服务处理所有业务逻辑WebSocket Server: 实时通知服务基于Soketi实现Database: 支持多种数据库PostgreSQL、MySQL、SQLite图Keep平台的AI工作流助手界面展示AI驱动的告警处理流程环境准备与系统要求在部署Keep之前确保您的环境满足以下要求开发/测试环境要求Docker Engine 20.10 或 Docker Desktop 4.0Docker Compose 2.04GB可用内存10GB可用磁盘空间支持Linux/macOS/Windows系统生产环境要求Kubernetes 1.24 集群Helm 3.8 包管理器持久化存储如NFS、Ceph、云存储监控和日志收集系统Prometheus、Loki等8GB内存4核CPU以上网络与安全考虑端口服务用途生产环境建议3000FrontendWeb界面访问通过Ingress暴露配置HTTPS8080Backend APIAPI服务端口内部访问通过Ingress代理6001WebSocket实时通知服务通过Ingress暴露配置WSS5432PostgreSQL数据库端口仅内部访问配置SSL加密Docker快速部署5分钟概念验证对于想要快速体验Keep功能的团队Docker Compose是最佳选择。以下是完整的部署步骤一键部署方案# 克隆项目仓库 git clone https://gitcode.com/GitHub_Trending/kee/keep cd keep # 使用官方安装脚本 curl https://raw.githubusercontent.com/keephq/keep/main/start.sh | sh自定义配置启动如果您需要自定义配置可以直接使用docker-compose文件# docker-compose.yml 关键配置示例 services: keep-backend: environment: # 数据库配置 DATABASE_CONNECTION_STRING: sqlite:///state/keep.db # JWT密钥配置 KEEP_JWT_SECRET: your-secure-jwt-secret-key # 时区设置 TZ: Asia/Shanghai # 启用指标收集 KEEP_METRICS: true keep-frontend: environment: # API端点配置 NEXT_PUBLIC_API_URL: http://localhost:8080 # WebSocket端点 NEXT_PUBLIC_WS_URL: ws://localhost:6001 # 认证类型 AUTH_TYPE: NO_AUTH启动后验证服务启动后您可以通过以下方式验证部署状态# 查看容器运行状态 docker-compose ps # 查看服务日志 docker-compose logs -f keep-backend # 检查服务健康状态 curl http://localhost:8080/health # 访问Web界面 # 浏览器打开 http://localhost:3000 # 默认用户名/密码keep/keep如果使用DB认证启用身份认证对于生产环境强烈建议启用身份认证。Keep支持多种认证方式# docker-compose-with-auth.yml 配置示例 services: keep-backend: environment: - AUTH_TYPEDB - KEEP_JWT_SECRETverysecretkey - KEEP_DEFAULT_USERNAMEadmin - KEEP_DEFAULT_PASSWORDsecurepassword123 keep-frontend: environment: - AUTH_TYPEDB - NEXTAUTH_SECRETverysecretkey生产环境部署Kubernetes最佳实践对于生产环境强烈建议使用Helm进行部署以获得更好的可维护性和扩展性。Helm Chart安装# 添加Helm仓库 helm repo add keep https://keephq.github.io/helm-charts helm repo update # 创建命名空间 kubectl create namespace keep # 创建配置文件 cat values.yaml EOF global: ingress: enabled: true className: nginx annotations: cert-manager.io/cluster-issuer: letsencrypt-prod hosts: - host: keep.yourdomain.com paths: - path: / pathType: Prefix tls: - secretName: keep-tls hosts: - keep.yourdomain.com backend: replicaCount: 2 resources: requests: memory: 512Mi cpu: 250m limits: memory: 2Gi cpu: 1000m frontend: replicaCount: 2 resources: requests: memory: 256Mi cpu: 100m limits: memory: 512Mi cpu: 500m EOF # 安装Keep helm install keep keep/keep -n keep -f values.yaml高可用架构设计图Keep服务拓扑视图展示服务间依赖关系和故障传播路径Keep在Kubernetes中的高可用架构包含以下关键组件核心服务副本策略backend: replicaCount: 3 podDisruptionBudget: minAvailable: 2 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 frontend: replicaCount: 2 podDisruptionBudget: minAvailable: 1 affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - keep-frontend topologyKey: kubernetes.io/hostname数据库高可用配置database: enabled: true architecture: replication primary: persistence: size: 50Gi storageClass: gp2 resources: requests: memory: 1Gi cpu: 500m readReplicas: replicaCount: 2 persistence: size: 20Gi关键配置与自定义最佳实践数据库选择与优化Keep支持多种数据库后端生产环境建议选择数据库适用场景连接字符串示例性能优化建议PostgreSQL生产环境首选postgresql://user:passhost:5432/db配置连接池启用pg_stat_statementsMySQL已有MySQL环境mysql://user:passhost:3306/db调整innodb_buffer_pool_sizeSQLite开发/测试sqlite:///data/keep.db仅用于概念验证不推荐生产PostgreSQL优化配置示例-- 创建专用数据库和用户 CREATE DATABASE keep; CREATE USER keep WITH ENCRYPTED PASSWORD secure_password; GRANT ALL PRIVILEGES ON DATABASE keep TO keep; -- 性能优化参数 ALTER DATABASE keep SET shared_preload_libraries pg_stat_statements; ALTER DATABASE keep SET max_connections 200; ALTER DATABASE keep SET work_mem 16MB; ALTER DATABASE keep SET maintenance_work_mem 64MB;身份认证与安全配置JWT密钥管理最佳实践# 生成安全的JWT密钥 openssl rand -base64 32 # 在Kubernetes中存储为Secret kubectl create secret generic keep-secrets \ --from-literaljwt-secret$(openssl rand -base64 32) \ --from-literalnextauth-secret$(openssl rand -base64 32) \ --namespace keepOAuth2集成配置backend: env: - name: AUTH_TYPE value: oauth2 - name: OAUTH2_CLIENT_ID valueFrom: secretKeyRef: name: oauth2-secrets key: client-id - name: OAUTH2_CLIENT_SECRET valueFrom: secretKeyRef: name: oauth2-secrets key: client-secret - name: OAUTH2_AUTHORIZATION_URL value: https://your-auth-server.com/oauth2/authorize - name: OAUTH2_TOKEN_URL value: https://your-auth-server.com/oauth2/token - name: OAUTH2_USERINFO_URL value: https://your-auth-server.com/oauth2/userinfo监控与告警配置集成OpenTelemetry实现全面监控backend: env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: http://otel-collector:4317 - name: OTEL_SERVICE_NAME value: keep-backend - name: OTEL_RESOURCE_ATTRIBUTES value: service.namespacekeep,service.version1.0.0 - name: KEEP_METRICS value: true frontend: env: - name: NEXT_PUBLIC_OTEL_ENABLED value: true - name: NEXT_PUBLIC_OTEL_EXPORTER value: otlp - name: NEXT_PUBLIC_OTEL_ENDPOINT value: http://otel-collector:4317高级功能配置与集成实战AI驱动的告警关联Keep的AI功能可以自动关联相关告警减少告警噪音# AI关联配置示例 ai: enabled: true provider: openai config: api_key: {{ secrets.OPENAI_API_KEY }} model: gpt-4 temperature: 0.1 correlation: enabled: true similarity_threshold: 0.8 max_cluster_size: 10 auto_resolve: true resolution_threshold: 0.9图Keep的告警关联与拓扑分析界面展示跨服务根因定位能力自动化工作流编排使用YAML定义复杂的告警处理工作流workflow: id: kubernetes-pod-restart name: 自动重启故障Kubernetes Pod description: 自动检测并重启连续失败的Pod triggers: - type: interval value: 300 # 每5分钟检查一次 steps: - name: 获取故障Pod provider: type: kubernetes config: {{ providers.kubernetes }} with: action: get_pods namespace: production label_selector: appcritical - name: 检查Pod状态 foreach: {{ steps.获取故障Pod.results }} if: {{ item.status.phase Failed and item.status.containerStatuses[0].restartCount 3 }} provider: type: kubernetes with: action: delete_pod name: {{ item.metadata.name }} namespace: {{ item.metadata.namespace }} - name: 发送通知 provider: type: slack config: {{ providers.slack }} with: channel: #alerts message: | 自动重启故障Pod 已重启以下Pod: {% for pod in steps.检查Pod状态.results %} - {{ pod.metadata.name }} ({{ pod.metadata.namespace }}) {% endfor %}多监控平台集成配置Keep支持与主流监控平台的深度集成监控平台集成方式配置示例关键特性Prometheus直接拉取Alertmanager Webhook实时告警支持PromQLDatadogWebhook接收API密钥配置丰富的指标和日志集成GrafanaAlertmanager集成Webhook配置可视化告警面板New Relic事件API事件发送配置APM和基础设施监控Prometheus集成配置示例# Prometheus Alertmanager配置 alerting: alertmanagers: - static_configs: - targets: - keep-backend:8080 scheme: http path_prefix: /alerts # Keep中的Prometheus Provider配置 providers: prometheus: type: prometheus config: url: http://prometheus:9090 auth: type: bearer token: {{ secrets.PROMETHEUS_TOKEN }} alerts: enabled: true scrape_interval: 30s rules: - name: high_cpu_usage expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode\idle\}[5m])) * 100) 80 for: 5m labels: severity: critical annotations: summary: High CPU usage on {{ $labels.instance }}运维监控与故障排除健康检查配置为所有服务配置完善的健康检查backend: livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 successThreshold: 1 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 successThreshold: 1 startupProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 30日志收集策略配置结构化日志收集# 日志格式配置 backend: env: - name: LOG_LEVEL value: INFO - name: LOG_FORMAT value: json - name: LOG_JSON_INDENT value: 0 - name: LOG_AUTH_PAYLOAD value: false # Fluent Bit sidecar配置 logging: enabled: true fluentbit: enabled: true image: repository: fluent/fluent-bit tag: 2.2 config: inputs: | [INPUT] Name tail Path /var/log/containers/*keep*.log Parser docker Tag kube.* Mem_Buf_Limit 5MB Skip_Long_Lines On outputs: | [OUTPUT] Name loki Match * Host loki.monitoring.svc.cluster.local Port 3100 Labels {appkeep}常见故障排除指南数据库连接问题# 检查数据库连接 kubectl exec -it deploy/keep-backend -n keep -- \ python -c import psycopg2; psycopg2.connect(postgresql://keep:passwordkeep-postgresql:5432/keep) # 查看数据库状态 kubectl logs -f statefulset/keep-postgresql -n keep # 检查数据库迁移状态 kubectl exec -it deploy/keep-backend -n keep -- \ alembic currentWebSocket连接失败# 测试WebSocket连接 kubectl port-forward svc/keep-websocket 6001:6001 -n keep wscat -c ws://localhost:6001 # 检查WebSocket服务日志 kubectl logs -f deploy/keep-websocket-server -n keep # 检查网络策略 kubectl get networkpolicy -n keep前端无法访问# 检查Ingress配置 kubectl get ingress -n keep # 检查前端服务状态 kubectl get svc keep-frontend -n keep # 检查前端Pod日志 kubectl logs -f deploy/keep-frontend -n keep性能优化与扩展策略水平扩展配置根据负载情况动态调整副本数backend: autoscaling: enabled: true minReplicas: 2 maxReplicas: 10 targetCPUUtilizationPercentage: 70 targetMemoryUtilizationPercentage: 80 behavior: scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 10 periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 60 frontend: autoscaling: enabled: true minReplicas: 2 maxReplicas: 5 targetCPUUtilizationPercentage: 60缓存与性能优化# Redis缓存配置 backend: env: - name: REDIS_URL value: redis://keep-redis:6379/0 - name: CACHE_TTL value: 300 # 5分钟缓存 - name: ALERT_CACHE_SIZE value: 10000 # 告警缓存大小 - name: WORKFLOW_CACHE_SIZE value: 1000 # 工作流缓存大小 # 数据库连接池配置 database: connectionPool: maxSize: 20 minSize: 5 idleTimeout: 300000 # 5分钟 maxLifetime: 1800000 # 30分钟数据保留策略# 告警数据保留策略 retention: alerts: enabled: true days: 90 # 保留90天 archive: true # 归档旧数据 archivePath: /data/archive incidents: enabled: true days: 365 # 保留1年 workflowExecutions: enabled: true days: 30 # 保留30天 maxRows: 1000000 # 最大行数限制安全加固与合规性配置网络安全策略# 网络策略配置 networkPolicy: enabled: true ingress: - from: - namespaceSelector: matchLabels: name: monitoring ports: - port: 8080 protocol: TCP - from: - ipBlock: cidr: 10.0.0.0/8 ports: - port: 3000 protocol: TCP egress: - to: - ipBlock: cidr: 0.0.0.0/0 except: - 10.0.0.0/8 - 172.16.0.0/12 - 192.168.0.0/16 ports: - port: 443 protocol: TCP - port: 80 protocol: TCP数据加密配置# 数据传输加密 backend: env: - name: ENCRYPTION_KEY valueFrom: secretKeyRef: name: keep-encryption key: encryption-key - name: TLS_ENABLED value: true - name: TLS_CERT_PATH value: /etc/ssl/certs/tls.crt - name: TLS_KEY_PATH value: /etc/ssl/private/tls.key # 数据库加密 database: encryption: enabled: true key: {{ secrets.DB_ENCRYPTION_KEY }} ssl: enabled: true caCert: {{ secrets.DB_CA_CERT }} clientCert: {{ secrets.DB_CLIENT_CERT }} clientKey: {{ secrets.DB_CLIENT_KEY }}总结与后续优化建议通过本文的完整指南您已经掌握了从Docker快速体验到Kubernetes生产环境部署Keep的全过程。Keep作为开源告警管理平台提供了强大的AIOps能力和灵活的部署选项。部署路径总结概念验证阶段使用Docker Compose快速启动验证基本功能开发环境配置持久化存储和基础集成预生产环境部署到Kubernetes配置监控和备份生产环境实现高可用、安全加固和性能优化图Keep告警汇总看板展示告警集中管理和多维度筛选功能后续优化建议短期优化1-2周配置告警通知渠道Slack、Teams、邮件等设置基础工作流自动化规则集成现有监控工具Prometheus、Datadog等配置基础告警路由和分派规则中期优化1-3个月实施AI驱动的告警关联和去重建立服务拓扑映射和依赖关系配置复杂的工作流规则和自动化响应设置告警升级和值班管理策略长期优化3-6个月实现跨团队告警协同和知识共享建立告警知识库和最佳实践文档优化告警响应SLA和MTTR指标实施预测性告警和容量规划资源与支持官方文档docs/deployment/configuration.mdx示例配置examples/workflows/ 目录集成插件keep/providers/ 目录社区支持通过GitHub Issues获取技术支持企业支持考虑商业支持选项以获得专业指导通过遵循本指南中的最佳实践您可以构建一个稳定、高效且可扩展的告警管理平台显著提升团队的运维效率和响应能力。Keep的开源特性确保了透明度和可定制性使其成为现代云原生环境中的理想选择。【免费下载链接】keepThe open-source AIOps and alert management platform项目地址: https://gitcode.com/GitHub_Trending/kee/keep创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考