Kubernetes AI Cluster Setup: Production-Ready Infrastructure for Machine Learning Workloads

Kubernetes AI Cluster Setup: Production-Ready Infrastructure for Machine Learning Workloads

Kubernetes has become the de facto standard for orchestrating AI and machine learning workloads at scale. This comprehensive guide covers everything needed to build, configure, and manage production-ready Kubernetes clusters specifically optimized for AI applications, from GPU scheduling to model serving and auto-scaling.

AI-Optimized Cluster Architecture

Cluster Design Principles

Designing Kubernetes clusters for AI workloads requires specific considerations:

  • Heterogeneous Node Types mixing CPU-only and GPU-enabled nodes
  • Resource Isolation preventing interference between training and inference workloads
  • Storage Strategy handling large datasets and model artifacts efficiently
  • Network Optimization minimizing latency for real-time inference
  • Scalability Planning accommodating varying computational demands

Node Pool Configuration

Structuring node pools for different AI workload types:

# GPU Training Node Pool
apiVersion: v1
kind: Node
metadata:
  name: gpu-training-node
  labels:
    node-type: gpu-training
    accelerator: nvidia-v100
    workload: training
spec:
  capacity:
    cpu: "32"
    memory: "128Gi"
    nvidia.com/gpu: "4"
    ephemeral-storage: "1Ti"
  taints:
  - key: nvidia.com/gpu
    value: "training"
    effect: NoSchedule

---
# GPU Inference Node Pool
apiVersion: v1
kind: Node
metadata:
  name: gpu-inference-node
  labels:
    node-type: gpu-inference
    accelerator: nvidia-t4
    workload: inference
spec:
  capacity:
    cpu: "16"
    memory: "64Gi"
    nvidia.com/gpu: "2"
    ephemeral-storage: "500Gi"
  taints:
  - key: nvidia.com/gpu
    value: "inference"
    effect: NoSchedule

---
# CPU-Only Node Pool
apiVersion: v1
kind: Node
metadata:
  name: cpu-node
  labels:
    node-type: cpu-only
    workload: general
spec:
  capacity:
    cpu: "16"
    memory: "32Gi"
    ephemeral-storage: "200Gi"

Cluster Installation and Setup

Prerequisites and Planning

Essential requirements for AI cluster deployment:

  1. Hardware Requirements

    • GPU nodes with NVIDIA Tesla V100, A100, or T4 GPUs
    • High-memory nodes for large model loading
    • Fast SSD storage for model and dataset caching
    • High-bandwidth networking (10Gbps+)
  2. Software Dependencies

    • Kubernetes 1.28+ with GPU support
    • NVIDIA GPU Operator for GPU management
    • Container runtime with GPU support (containerd/Docker)
    • Network CNI plugin (Calico, Flannel, or Cilium)

Cluster Bootstrap with kubeadm

Setting up the control plane and worker nodes:

# Initialize control plane
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \
  --service-cidr=10.96.0.0/12 \
  --kubernetes-version=v1.28.0 \
  --control-plane-endpoint=k8s-api.yourdomain.com

# Configure kubectl
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install CNI plugin (Calico)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/calico.yaml

# Join worker nodes
kubeadm join k8s-api.yourdomain.com:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash>

GPU Operator Installation

Deploying NVIDIA GPU Operator for GPU management:

# Add NVIDIA Helm repository
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set nodeStatusExporter.enabled=true \
  --set gfd.enabled=true \
  --set migManager.enabled=false

# Verify GPU Operator installation
kubectl get pods -n gpu-operator
kubectl describe nodes | grep nvidia.com/gpu

Storage Configuration for AI Workloads

Persistent Volume Setup

Configuring storage for datasets and model artifacts:

# High-performance SSD StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  fsType: ext4
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

---
# Shared dataset storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: shared-datasets
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-12345678
  directoryPerms: "0755"
volumeBindingMode: Immediate

---
# Model registry PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-registry-pvc
  namespace: ai-models
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: shared-datasets
  resources:
    requests:
      storage: 1Ti

Dataset Management

Implementing efficient dataset handling:

# Dataset cache DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dataset-cache
  namespace: ai-infrastructure
spec:
  selector:
    matchLabels:
      app: dataset-cache
  template:
    metadata:
      labels:
        app: dataset-cache
    spec:
      containers:
      - name: cache-manager
        image: redis:7-alpine
        resources:
          requests:
            memory: "4Gi"
            cpu: "1"
          limits:
            memory: "8Gi"
            cpu: "2"
        volumeMounts:
        - name: cache-storage
          mountPath: /data
        env:
        - name: REDIS_MAXMEMORY
          value: "6gb"
        - name: REDIS_MAXMEMORY_POLICY
          value: "allkeys-lru"
      volumes:
      - name: cache-storage
        hostPath:
          path: /var/lib/dataset-cache
          type: DirectoryOrCreate
      nodeSelector:
        workload: training

AI Workload Scheduling and Resource Management

GPU Resource Quotas

Implementing resource quotas for GPU utilization:

# GPU resource quota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ai-training
spec:
  hard:
    requests.nvidia.com/gpu: "16"
    limits.nvidia.com/gpu: "16"
    requests.memory: "512Gi"
    limits.memory: "512Gi"
    requests.cpu: "128"
    limits.cpu: "128"

---
# Limit ranges for AI workloads
apiVersion: v1
kind: LimitRange
metadata:
  name: ai-workload-limits
  namespace: ai-training
spec:
  limits:
  - type: Container
    default:
      nvidia.com/gpu: "1"
      memory: "8Gi"
      cpu: "4"
    defaultRequest:
      nvidia.com/gpu: "1"
      memory: "4Gi"
      cpu: "2"
    max:
      nvidia.com/gpu: "8"
      memory: "64Gi"
      cpu: "32"
    min:
      nvidia.com/gpu: "1"
      memory: "1Gi"
      cpu: "1"

Priority Classes for AI Workloads

Defining scheduling priorities for different workload types:

# High priority for inference workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-inference-priority
value: 1000
globalDefault: false
description: "High priority for AI inference workloads"

---
# Medium priority for training workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-training-priority
value: 500
globalDefault: false
description: "Medium priority for AI training workloads"

---
# Low priority for batch processing
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-batch-priority
value: 100
globalDefault: false
description: "Low priority for AI batch processing"

Model Serving Infrastructure

Model Server Deployment

Deploying scalable model serving infrastructure:

# Model server deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-model-server
  namespace: ai-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-model-server
  template:
    metadata:
      labels:
        app: ai-model-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      priorityClassName: ai-inference-priority
      nodeSelector:
        workload: inference
      tolerations:
      - key: nvidia.com/gpu
        operator: Equal
        value: "inference"
        effect: NoSchedule
      containers:
      - name: model-server
        image: your-registry/ai-model-server:latest
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8080
          name: metrics
        env:
        - name: MODEL_PATH
          value: "/models"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: BATCH_SIZE
          value: "8"
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "4"
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "8"
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        - name: cache-storage
          mountPath: /cache
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        startupProbe:
          httpGet:
            path: /startup
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 30
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-registry-pvc
      - name: cache-storage
        emptyDir:
          sizeLimit: 10Gi

---
# Model server service
apiVersion: v1
kind: Service
metadata:
  name: ai-model-service
  namespace: ai-inference
spec:
  selector:
    app: ai-model-server
  ports:
  - name: http
    port: 80
    targetPort: 8000
  - name: metrics
    port: 8080
    targetPort: 8080
  type: ClusterIP

---
# Ingress for external access
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ai-model-ingress
  namespace: ai-inference
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - ai-api.yourdomain.com
    secretName: ai-api-tls
  rules:
  - host: ai-api.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ai-model-service
            port:
              number: 80

Horizontal Pod Autoscaling

Implementing intelligent auto-scaling for AI workloads:

# HPA for model servers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-model-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-model-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: inference_queue_length
      target:
        type: AverageValue
        averageValue: "10"
  - type: Pods
    pods:
      metric:
        name: gpu_utilization_percent
      target:
        type: AverageValue
        averageValue: "75"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60

---
# Vertical Pod Autoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: ai-model-vpa
  namespace: ai-inference
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-model-server
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: model-server
      maxAllowed:
        cpu: "16"
        memory: "32Gi"
      minAllowed:
        cpu: "2"
        memory: "4Gi"
      controlledResources: ["cpu", "memory"]

Training Workload Management

Distributed Training Setup

Configuring distributed training with multiple GPUs:

# PyTorch distributed training job
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-training
  namespace: ai-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          priorityClassName: ai-training-priority
          nodeSelector:
            workload: training
          tolerations:
          - key: nvidia.com/gpu
            operator: Equal
            value: "training"
            effect: NoSchedule
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
            command:
            - python
            - /workspace/train.py
            - --backend=nccl
            - --epochs=100
            - --batch-size=32
            resources:
              requests:
                nvidia.com/gpu: 4
                memory: "32Gi"
                cpu: "16"
              limits:
                nvidia.com/gpu: 4
                memory: "64Gi"
                cpu: "32"
            volumeMounts:
            - name: training-data
              mountPath: /data
            - name: model-output
              mountPath: /output
            - name: workspace
              mountPath: /workspace
            env:
            - name: NCCL_DEBUG
              value: "INFO"
            - name: PYTHONUNBUFFERED
              value: "1"
          volumes:
          - name: training-data
            persistentVolumeClaim:
              claimName: training-dataset-pvc
          - name: model-output
            persistentVolumeClaim:
              claimName: model-output-pvc
          - name: workspace
            configMap:
              name: training-scripts
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          priorityClassName: ai-training-priority
          nodeSelector:
            workload: training
          tolerations:
          - key: nvidia.com/gpu
            operator: Equal
            value: "training"
            effect: NoSchedule
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
            command:
            - python
            - /workspace/train.py
            - --backend=nccl
            - --epochs=100
            - --batch-size=32
            resources:
              requests:
                nvidia.com/gpu: 4
                memory: "32Gi"
                cpu: "16"
              limits:
                nvidia.com/gpu: 4
                memory: "64Gi"
                cpu: "32"
            volumeMounts:
            - name: training-data
              mountPath: /data
            - name: workspace
              mountPath: /workspace
            env:
            - name: NCCL_DEBUG
              value: "INFO"
            - name: PYTHONUNBUFFERED
              value: "1"
          volumes:
          - name: training-data
            persistentVolumeClaim:
              claimName: training-dataset-pvc
          - name: workspace
            configMap:
              name: training-scripts

Job Scheduling and Queuing

Implementing job queuing for efficient resource utilization:

# Volcano scheduler for batch jobs
apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

---
# Queue for training jobs
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ai-training-queue
spec:
  weight: 100
  reclaimable: true
  capability:
    nvidia.com/gpu: 32
    memory: "512Gi"
    cpu: "256"

---
# Queue for inference jobs
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ai-inference-queue
spec:
  weight: 200
  reclaimable: false
  capability:
    nvidia.com/gpu: 16
    memory: "256Gi"
    cpu: "128"

Monitoring and Observability

Prometheus Monitoring Stack

Comprehensive monitoring for AI workloads:

# Prometheus configuration for AI metrics
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
    - "/etc/prometheus/rules/*.yml"
    
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
    
    - job_name: 'gpu-metrics'
      static_configs:
      - targets: ['dcgm-exporter:9400']
    
    - job_name: 'ai-model-servers'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['ai-inference']
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: ai-model-server

---
# GPU monitoring with DCGM exporter
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
        ports:
        - name: metrics
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
        env:
        - name: DCGM_EXPORTER_LISTEN
          value: ":9400"
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys
      nodeSelector:
        accelerator: nvidia-gpu
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Custom AI Metrics

Implementing AI-specific monitoring metrics:

# ServiceMonitor for AI workloads
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ai-workload-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      monitoring: ai-workloads
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

---
# PrometheusRule for AI alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ai-workload-alerts
  namespace: monitoring
spec:
  groups:
  - name: ai-inference
    rules:
    - alert: HighInferenceLatency
      expr: histogram_quantile(0.95, ai_inference_duration_seconds) > 1.0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High inference latency detected"
        description: "95th percentile inference latency is {{ $value }}s"
    
    - alert: GPUMemoryHigh
      expr: dcgm_fb_used / dcgm_fb_total > 0.9
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "GPU memory usage is high"
        description: "GPU memory usage is {{ $value | humanizePercentage }}"
    
    - alert: ModelServerDown
      expr: up{job="ai-model-servers"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "AI model server is down"
        description: "Model server {{ $labels.instance }} is not responding"

Security and Compliance

Network Policies

Implementing network security for AI workloads:

# Network policy for AI inference namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ai-inference-policy
  namespace: ai-inference
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8000
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: ai-models
    ports:
    - protocol: TCP
      port: 443
  - to: []
    ports:
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53

---
# Pod Security Policy for AI workloads
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: ai-workload-psp
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'

RBAC Configuration

Role-based access control for AI operations:

# ClusterRole for AI operators
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ai-operator
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["autoscaling"]
  resources: ["horizontalpodautoscalers"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["kubeflow.org"]
  resources: ["pytorchjobs", "tfjobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

---
# ServiceAccount for AI workloads
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ai-workload-sa
  namespace: ai-training

---
# RoleBinding for AI operators
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: ai-operator-binding
subjects:
- kind: ServiceAccount
  name: ai-workload-sa
  namespace: ai-training
roleRef:
  kind: ClusterRole
  name: ai-operator
  apiGroup: rbac.authorization.k8s.io

Backup and Disaster Recovery

Model and Data Backup

Implementing comprehensive backup strategies:

# Velero backup for AI workloads
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: ai-workloads-backup
  namespace: velero
spec:
  includedNamespaces:
  - ai-training
  - ai-inference
  - ai-models
  includedResources:
  - persistentvolumeclaims
  - persistentvolumes
  - deployments
  - services
  - configmaps
  - secrets
  labelSelector:
    matchLabels:
      backup: "true"
  storageLocation: default
  ttl: 720h0m0s

---
# Scheduled backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: ai-daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces:
    - ai-training
    - ai-inference
    - ai-models
    storageLocation: default
    ttl: 168h0m0s

Performance Optimization

Cluster Autoscaling

Implementing intelligent cluster scaling:

# Cluster Autoscaler configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0
        name: cluster-autoscaler
        resources:
          limits:
            cpu: 100m
            memory: 300Mi
          requests:
            cpu: 100m
            memory: 300Mi
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/ai-cluster
        - --balance-similar-node-groups
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        - --max-node-provision-time=15m
        env:
        - name: AWS_REGION
          value: us-west-2

Troubleshooting and Maintenance

Common Issues and Solutions

Resolving frequent AI cluster problems:

  1. GPU Scheduling Issues

    # Check GPU availability
    kubectl describe nodes | grep nvidia.com/gpu
    
    # Verify GPU operator pods
    kubectl get pods -n gpu-operator
    
    # Check device plugin logs
    kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
    
  2. Model Loading Timeouts

    # Increase startup probe timeout
    kubectl patch deployment ai-model-server -p '{"spec":{"template":{"spec":{"containers":[{"name":"model-server","startupProbe":{"failureThreshold":60}}]}}}}'
    
    # Check model loading logs
    kubectl logs -f deployment/ai-model-server -c model-server
    
    # Verify model storage access
    kubectl exec -it deployment/ai-model-server -- ls -la /models
    
  3. Resource Contention

    # Check resource usage
    kubectl top nodes
    kubectl top pods -n ai-training --sort-by=memory
    
    # Identify resource bottlenecks
    kubectl describe node <node-name> | grep -A 10 "Allocated resources"
    
    # Review resource quotas
    kubectl describe resourcequota -n ai-training
    
  4. Network Connectivity Issues

    # Test pod-to-pod communication
    kubectl exec -it <pod-name> -- nslookup ai-model-service.ai-inference.svc.cluster.local
    
    # Check network policies
    kubectl get networkpolicy -A
    
    # Verify ingress configuration
    kubectl describe ingress ai-model-ingress -n ai-inference
    

Maintenance Procedures

Regular maintenance tasks for optimal cluster performance:

# Weekly cluster health check
#!/bin/bash
echo "=== Kubernetes AI Cluster Health Check ==="

# Check node status
echo "Node Status:"
kubectl get nodes -o wide

# Check GPU availability
echo -e "\nGPU Resources:"
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"

# Check critical pods
echo -e "\nCritical Pod Status:"
kubectl get pods -n kube-system | grep -E "(coredns|kube-proxy|calico)"
kubectl get pods -n gpu-operator
kubectl get pods -n ai-inference
kubectl get pods -n ai-training

# Check resource usage
echo -e "\nResource Usage:"
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20

# Check persistent volumes
echo -e "\nStorage Status:"
kubectl get pv,pvc -A

# Check recent events
echo -e "\nRecent Events:"
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Generate report
echo -e "\n=== Health Check Complete ==="
date

Performance Tuning

Optimizing cluster performance for AI workloads:

# Kubelet configuration for AI nodes
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubelet-config
  namespace: kube-system
data:
  kubelet: |
    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    maxPods: 110
    podsPerCore: 10
    evictionHard:
      memory.available: "1Gi"
      nodefs.available: "10%"
      imagefs.available: "10%"
    evictionSoft:
      memory.available: "2Gi"
      nodefs.available: "15%"
      imagefs.available: "15%"
    evictionSoftGracePeriod:
      memory.available: "2m"
      nodefs.available: "2m"
      imagefs.available: "2m"
    imageGCHighThresholdPercent: 85
    imageGCLowThresholdPercent: 80
    cpuManagerPolicy: "static"
    topologyManagerPolicy: "single-numa-node"
    systemReserved:
      cpu: "1"
      memory: "2Gi"
      ephemeral-storage: "10Gi"
    kubeReserved:
      cpu: "1"
      memory: "2Gi"
      ephemeral-storage: "10Gi"

Best Practices and Recommendations

Production Deployment Guidelines

Essential practices for production AI clusters:

  1. Resource Planning

    • Size GPU nodes based on model requirements
    • Plan for 20-30% overhead for system processes
    • Implement proper resource quotas and limits
    • Use node affinity for workload placement
  2. Security Hardening

    • Enable Pod Security Standards
    • Implement network policies
    • Use service mesh for encrypted communication
    • Regular security scanning and updates
  3. Monitoring and Alerting

    • Monitor GPU utilization and memory
    • Track model inference latency and throughput
    • Set up alerts for resource exhaustion
    • Implement distributed tracing for complex workflows
  4. Backup and Recovery

    • Regular backups of model artifacts and configurations
    • Test disaster recovery procedures
    • Document recovery processes
    • Implement cross-region replication for critical models

Cost Optimization Strategies

Reducing operational costs while maintaining performance:

# Spot instance node pool for training
apiVersion: v1
kind: Node
metadata:
  name: spot-training-node
  labels:
    node-type: spot-training
    lifecycle: spot
    workload: training
spec:
  taints:
  - key: node.kubernetes.io/spot
    value: "true"
    effect: NoSchedule
  - key: nvidia.com/gpu
    value: "training"
    effect: NoSchedule

---
# Tolerations for spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-training-job
spec:
  template:
    spec:
      tolerations:
      - key: node.kubernetes.io/spot
        operator: Equal
        value: "true"
        effect: NoSchedule
      - key: nvidia.com/gpu
        operator: Equal
        value: "training"
        effect: NoSchedule
      nodeSelector:
        lifecycle: spot
        workload: training

Future Considerations

Emerging Technologies

Preparing for next-generation AI infrastructure:

  1. Multi-Instance GPU (MIG) Support

    • Partition A100 GPUs for better utilization
    • Implement MIG-aware scheduling
    • Optimize resource allocation for mixed workloads
  2. Edge AI Integration

    • Deploy lightweight models to edge nodes
    • Implement federated learning workflows
    • Manage model synchronization across locations
  3. Quantum Computing Integration

    • Prepare for hybrid classical-quantum workloads
    • Implement quantum simulator support
    • Plan for quantum-classical communication protocols

Scaling Strategies

Planning for growth and evolution:

# Multi-cluster federation setup
apiVersion: core.kubefed.io/v1beta1
kind: KubeFedCluster
metadata:
  name: ai-cluster-west
  namespace: kube-federation-system
spec:
  apiEndpoint: https://ai-cluster-west.example.com
  caBundle: <base64-encoded-ca-bundle>
  secretRef:
    name: ai-cluster-west-secret

---
# Federated deployment for global model serving
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
  name: global-ai-model-server
  namespace: ai-inference
spec:
  template:
    metadata:
      labels:
        app: global-ai-model-server
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: global-ai-model-server
      template:
        metadata:
          labels:
            app: global-ai-model-server
        spec:
          containers:
          - name: model-server
            image: your-registry/ai-model-server:latest
  placement:
    clusters:
    - name: ai-cluster-west
    - name: ai-cluster-east
    - name: ai-cluster-europe

Conclusion

Building and managing Kubernetes clusters for AI workloads requires careful planning, specialized configuration, and ongoing optimization. This comprehensive guide provides the foundation for deploying production-ready AI infrastructure that can scale with your organization's needs.

Key takeaways for successful AI cluster deployment:

  • Start with proper resource planning and node pool design
  • Implement comprehensive monitoring from day one
  • Prioritize security and compliance throughout the deployment
  • Plan for scalability and future technology adoption
  • Establish robust backup and recovery procedures
  • Continuously optimize for performance and cost efficiency

The AI landscape continues to evolve rapidly, and your Kubernetes infrastructure should be designed to adapt to new technologies, frameworks, and deployment patterns. Regular updates, monitoring, and optimization will ensure your AI cluster remains efficient, secure, and capable of supporting your organization's machine learning initiatives.

For organizations just beginning their AI journey, start with a smaller cluster and gradually expand as you gain experience and understand your specific workload requirements. The investment in proper infrastructure will pay dividends in improved model performance, reduced operational overhead, and faster time-to-market for AI applications.

Back to Deployment-Tutorials
Home