1. 项目概述基于裸金属环境的RKE2 Kubernetes集群MLOps平台构建在当今数据驱动的业务环境中机器学习模型的工业化部署已成为企业核心竞争力的关键组成部分。本文将详细记录我们在裸金属服务器上基于Rancher RKE2 Kubernetes发行版构建完整MLOps平台的实战过程。这种架构组合特别适合对数据主权有严格要求、需要最大化硬件利用率的场景例如金融风控系统、工业质检平台等对延迟敏感且需要处理大量非结构化数据的应用。2. 环境规划与集群部署2.1 硬件资源配置建议我们采用三台Dell PowerEdge R740xd服务器组成高可用集群每台配置双路Intel Xeon Gold 6248R (48核/96线程)384GB DDR4 ECC内存4×NVIDIA T4 GPU (16GB显存/卡)2×1.6TB NVMe SSD (RAID1系统盘)8×4TB HDD (Ceph存储池)关键提示GPU节点需要额外配置NVIDIA容器运行时和对应的device plugin建议在集群初始化前在所有GPU节点预装驱动版本470.82.012.2 RKE2集群初始化配置创建/etc/rancher/rke2/config.yaml配置文件token: 自定义安全令牌 tls-san: - mlops-cluster.example.com node-taint: - nvidia.com/gputrue:NoSchedule kubelet-arg: - max-pods250 cni: cilium disable: - rke2-ingress-nginx执行集群引导命令curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSIONv1.25.9rke2r1 sh - systemctl enable rke2-server.service systemctl start rke2-server.service3. MLOps核心组件部署3.1 机器学习工作流引擎采用Argo Workflows作为编排引擎通过Helm进行部署helm repo add argo https://argoproj.github.io/argo-helm helm install argo-workflows argo/argo-workflows \ --namespace mlops \ --set server.service.typeLoadBalancer \ --set executor.imagePullPolicyIfNotPresent \ --set singleNamespacefalse关键参数调优建议workflow.controller.workflowTTLSeconds: 设置工作流保留时间executor.resources.requests.cpu: 每个任务容器的CPU预留artifactRepository.archiveLogs: 启用日志归档到S33.2 模型版本控制与注册部署MLflow跟踪服务器# mlflow-values.yaml backendStore: postgresql: enabled: true postgresqlDatabase: mlflow postgresqlUsername: mlflowuser artifactRoot: s3://mlflow-artifacts-prod ingress: enabled: true hosts: - host: mlflow.example.com4. 性能优化与监控体系4.1 GPU资源调度策略创建GPU资源类apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: gpu-critical value: 1000000 description: For GPU-accelerated workloads通过Node Feature Discovery自动标记GPU节点kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.11.2/deployment/overlays/default/kustomization.yaml4.2 监控告警配置Grafana仪表盘关键指标GPU利用率DCGM exporter提供模型推理延迟Prometheus Histogram批处理任务队列深度存储IOPS/吞吐量告警规则示例- alert: HighGPUThrottling expr: avg(dcgm_gpu_throttle_reasons_sw_thermal_slowdown) by (pod) 0 for: 5m labels: severity: warning annotations: summary: GPU thermal throttling detected on {{ $labels.pod }}5. 安全加固实践5.1 网络策略配置限制训练命名空间的网络访问apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-only-model-registry spec: podSelector: matchLabels: role: training-job policyTypes: - Egress egress: - to: - podSelector: matchLabels: app: mlflow-server ports: - protocol: TCP port: 50005.2 镜像签名验证配置准入控制器cosign generate-key-pair kubectl create secret generic cosign-pub-key --from-filecosign.pub部署验证WebhookapiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration webhooks: - name: validator.chainguard.dev rules: - operations: [CREATE] apiGroups: [] apiVersions: [v1] resources: [pods]6. 持续交付流水线设计6.1 GitOps工作流架构graph LR A[代码提交] -- B[代码扫描] B -- C[容器构建] C -- D[模型训练] D -- E[性能测试] E -- F[自动部署]6.2 Tekton流水线示例定义模型训练任务apiVersion: tekton.dev/v1beta1 kind: Pipeline metadata: name: model-training-pipeline spec: workspaces: - name: shared-data tasks: - name: fetch-dataset taskRef: name: git-clone workspaces: - name: output workspace: shared-data7. 典型问题排查指南7.1 GPU资源分配失败常见症状Pod状态显示Pending事件日志中出现Insufficient nvidia.com/gpu排查步骤检查节点资源容量kubectl describe node node-name | grep -A 10 Allocatable验证device plugin运行状态kubectl get pods -n kube-system -l namenvidia-device-plugin-ds7.2 模型服务冷启动延迟优化方案使用Knative Serving自动伸缩预加载模型缓存from tensorflow import keras import threading def preload_model(): keras.models.load_model(/models/production/1) threading.Thread(targetpreload_model).start()8. 成本优化策略8.1 弹性伸缩配置集群自动伸缩器参数apiVersion: karpenter.sh/v1alpha5 kind: Provisioner metadata: name: gpu-worker spec: requirements: - key: node.kubernetes.io/instance-type operator: In values: [g4dn.2xlarge, p3.2xlarge] consolidation: enabled: true ttlSecondsAfterEmpty: 3008.2 训练任务竞价实例支持Argo Workflows模板片段template: retryStrategy: limit: 3 retryPolicy: Always affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: k8s.amazonaws.com/spot operator: In values: [true]我在实际部署中发现RKE2的自动证书轮换机制与某些ML工具链存在兼容性问题建议在部署Istio时显式配置证书有效期meshConfig: defaultConfig: proxyMetadata: SECRET_TTL: 720h