Kubernetes AI Cluster Setup: Production-Ready Infrastructure for Machine Learning Workloads
Kubernetes has become the de facto standard for orchestrating AI and machine learning workloads at scale. This comprehensive guide covers everything needed to build, configure, and manage production-ready Kubernetes clusters specifically optimized for AI applications, from GPU scheduling to model serving and auto-scaling.
AI-Optimized Cluster Architecture
Cluster Design Principles
Designing Kubernetes clusters for AI workloads requires specific considerations:
- Heterogeneous Node Types mixing CPU-only and GPU-enabled nodes
- Resource Isolation preventing interference between training and inference workloads
- Storage Strategy handling large datasets and model artifacts efficiently
- Network Optimization minimizing latency for real-time inference
- Scalability Planning accommodating varying computational demands
Node Pool Configuration
Structuring node pools for different AI workload types:
# GPU Training Node Pool
apiVersion: v1
kind: Node
metadata:
name: gpu-training-node
labels:
node-type: gpu-training
accelerator: nvidia-v100
workload: training
spec:
capacity:
cpu: "32"
memory: "128Gi"
nvidia.com/gpu: "4"
ephemeral-storage: "1Ti"
taints:
- key: nvidia.com/gpu
value: "training"
effect: NoSchedule
---
# GPU Inference Node Pool
apiVersion: v1
kind: Node
metadata:
name: gpu-inference-node
labels:
node-type: gpu-inference
accelerator: nvidia-t4
workload: inference
spec:
capacity:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: "2"
ephemeral-storage: "500Gi"
taints:
- key: nvidia.com/gpu
value: "inference"
effect: NoSchedule
---
# CPU-Only Node Pool
apiVersion: v1
kind: Node
metadata:
name: cpu-node
labels:
node-type: cpu-only
workload: general
spec:
capacity:
cpu: "16"
memory: "32Gi"
ephemeral-storage: "200Gi"
Cluster Installation and Setup
Prerequisites and Planning
Essential requirements for AI cluster deployment:
Hardware Requirements
- GPU nodes with NVIDIA Tesla V100, A100, or T4 GPUs
- High-memory nodes for large model loading
- Fast SSD storage for model and dataset caching
- High-bandwidth networking (10Gbps+)
Software Dependencies
- Kubernetes 1.28+ with GPU support
- NVIDIA GPU Operator for GPU management
- Container runtime with GPU support (containerd/Docker)
- Network CNI plugin (Calico, Flannel, or Cilium)
Cluster Bootstrap with kubeadm
Setting up the control plane and worker nodes:
# Initialize control plane
sudo kubeadm init \
--pod-network-cidr=10.244.0.0/16 \
--service-cidr=10.96.0.0/12 \
--kubernetes-version=v1.28.0 \
--control-plane-endpoint=k8s-api.yourdomain.com
# Configure kubectl
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# Install CNI plugin (Calico)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/calico.yaml
# Join worker nodes
kubeadm join k8s-api.yourdomain.com:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash>
GPU Operator Installation
Deploying NVIDIA GPU Operator for GPU management:
# Add NVIDIA Helm repository
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set nodeStatusExporter.enabled=true \
--set gfd.enabled=true \
--set migManager.enabled=false
# Verify GPU Operator installation
kubectl get pods -n gpu-operator
kubectl describe nodes | grep nvidia.com/gpu
Storage Configuration for AI Workloads
Persistent Volume Setup
Configuring storage for datasets and model artifacts:
# High-performance SSD StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "3000"
throughput: "125"
fsType: ext4
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
# Shared dataset storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: shared-datasets
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: fs-12345678
directoryPerms: "0755"
volumeBindingMode: Immediate
---
# Model registry PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-registry-pvc
namespace: ai-models
spec:
accessModes:
- ReadWriteMany
storageClassName: shared-datasets
resources:
requests:
storage: 1Ti
Dataset Management
Implementing efficient dataset handling:
# Dataset cache DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dataset-cache
namespace: ai-infrastructure
spec:
selector:
matchLabels:
app: dataset-cache
template:
metadata:
labels:
app: dataset-cache
spec:
containers:
- name: cache-manager
image: redis:7-alpine
resources:
requests:
memory: "4Gi"
cpu: "1"
limits:
memory: "8Gi"
cpu: "2"
volumeMounts:
- name: cache-storage
mountPath: /data
env:
- name: REDIS_MAXMEMORY
value: "6gb"
- name: REDIS_MAXMEMORY_POLICY
value: "allkeys-lru"
volumes:
- name: cache-storage
hostPath:
path: /var/lib/dataset-cache
type: DirectoryOrCreate
nodeSelector:
workload: training
AI Workload Scheduling and Resource Management
GPU Resource Quotas
Implementing resource quotas for GPU utilization:
# GPU resource quota
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ai-training
spec:
hard:
requests.nvidia.com/gpu: "16"
limits.nvidia.com/gpu: "16"
requests.memory: "512Gi"
limits.memory: "512Gi"
requests.cpu: "128"
limits.cpu: "128"
---
# Limit ranges for AI workloads
apiVersion: v1
kind: LimitRange
metadata:
name: ai-workload-limits
namespace: ai-training
spec:
limits:
- type: Container
default:
nvidia.com/gpu: "1"
memory: "8Gi"
cpu: "4"
defaultRequest:
nvidia.com/gpu: "1"
memory: "4Gi"
cpu: "2"
max:
nvidia.com/gpu: "8"
memory: "64Gi"
cpu: "32"
min:
nvidia.com/gpu: "1"
memory: "1Gi"
cpu: "1"
Priority Classes for AI Workloads
Defining scheduling priorities for different workload types:
# High priority for inference workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ai-inference-priority
value: 1000
globalDefault: false
description: "High priority for AI inference workloads"
---
# Medium priority for training workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ai-training-priority
value: 500
globalDefault: false
description: "Medium priority for AI training workloads"
---
# Low priority for batch processing
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ai-batch-priority
value: 100
globalDefault: false
description: "Low priority for AI batch processing"
Model Serving Infrastructure
Model Server Deployment
Deploying scalable model serving infrastructure:
# Model server deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-model-server
namespace: ai-inference
spec:
replicas: 3
selector:
matchLabels:
app: ai-model-server
template:
metadata:
labels:
app: ai-model-server
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
priorityClassName: ai-inference-priority
nodeSelector:
workload: inference
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "inference"
effect: NoSchedule
containers:
- name: model-server
image: your-registry/ai-model-server:latest
ports:
- containerPort: 8000
name: http
- containerPort: 8080
name: metrics
env:
- name: MODEL_PATH
value: "/models"
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: BATCH_SIZE
value: "8"
resources:
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "4"
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "8"
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
- name: cache-storage
mountPath: /cache
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
startupProbe:
httpGet:
path: /startup
port: 8000
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 30
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-registry-pvc
- name: cache-storage
emptyDir:
sizeLimit: 10Gi
---
# Model server service
apiVersion: v1
kind: Service
metadata:
name: ai-model-service
namespace: ai-inference
spec:
selector:
app: ai-model-server
ports:
- name: http
port: 80
targetPort: 8000
- name: metrics
port: 8080
targetPort: 8080
type: ClusterIP
---
# Ingress for external access
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ai-model-ingress
namespace: ai-inference
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
ingressClassName: nginx
tls:
- hosts:
- ai-api.yourdomain.com
secretName: ai-api-tls
rules:
- host: ai-api.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ai-model-service
port:
number: 80
Horizontal Pod Autoscaling
Implementing intelligent auto-scaling for AI workloads:
# HPA for model servers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-model-hpa
namespace: ai-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-model-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: inference_queue_length
target:
type: AverageValue
averageValue: "10"
- type: Pods
pods:
metric:
name: gpu_utilization_percent
target:
type: AverageValue
averageValue: "75"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
---
# Vertical Pod Autoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: ai-model-vpa
namespace: ai-inference
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-model-server
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: model-server
maxAllowed:
cpu: "16"
memory: "32Gi"
minAllowed:
cpu: "2"
memory: "4Gi"
controlledResources: ["cpu", "memory"]
Training Workload Management
Distributed Training Setup
Configuring distributed training with multiple GPUs:
# PyTorch distributed training job
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: distributed-training
namespace: ai-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
priorityClassName: ai-training-priority
nodeSelector:
workload: training
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "training"
effect: NoSchedule
containers:
- name: pytorch
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
command:
- python
- /workspace/train.py
- --backend=nccl
- --epochs=100
- --batch-size=32
resources:
requests:
nvidia.com/gpu: 4
memory: "32Gi"
cpu: "16"
limits:
nvidia.com/gpu: 4
memory: "64Gi"
cpu: "32"
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
- name: workspace
mountPath: /workspace
env:
- name: NCCL_DEBUG
value: "INFO"
- name: PYTHONUNBUFFERED
value: "1"
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-dataset-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
- name: workspace
configMap:
name: training-scripts
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
priorityClassName: ai-training-priority
nodeSelector:
workload: training
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "training"
effect: NoSchedule
containers:
- name: pytorch
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
command:
- python
- /workspace/train.py
- --backend=nccl
- --epochs=100
- --batch-size=32
resources:
requests:
nvidia.com/gpu: 4
memory: "32Gi"
cpu: "16"
limits:
nvidia.com/gpu: 4
memory: "64Gi"
cpu: "32"
volumeMounts:
- name: training-data
mountPath: /data
- name: workspace
mountPath: /workspace
env:
- name: NCCL_DEBUG
value: "INFO"
- name: PYTHONUNBUFFERED
value: "1"
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-dataset-pvc
- name: workspace
configMap:
name: training-scripts
Job Scheduling and Queuing
Implementing job queuing for efficient resource utilization:
# Volcano scheduler for batch jobs
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
---
# Queue for training jobs
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ai-training-queue
spec:
weight: 100
reclaimable: true
capability:
nvidia.com/gpu: 32
memory: "512Gi"
cpu: "256"
---
# Queue for inference jobs
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ai-inference-queue
spec:
weight: 200
reclaimable: false
capability:
nvidia.com/gpu: 16
memory: "256Gi"
cpu: "128"
Monitoring and Observability
Prometheus Monitoring Stack
Comprehensive monitoring for AI workloads:
# Prometheus configuration for AI metrics
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/*.yml"
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- job_name: 'gpu-metrics'
static_configs:
- targets: ['dcgm-exporter:9400']
- job_name: 'ai-model-servers'
kubernetes_sd_configs:
- role: pod
namespaces:
names: ['ai-inference']
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: ai-model-server
---
# GPU monitoring with DCGM exporter
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
ports:
- name: metrics
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
nodeSelector:
accelerator: nvidia-gpu
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Custom AI Metrics
Implementing AI-specific monitoring metrics:
# ServiceMonitor for AI workloads
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ai-workload-metrics
namespace: monitoring
spec:
selector:
matchLabels:
monitoring: ai-workloads
endpoints:
- port: metrics
interval: 30s
path: /metrics
---
# PrometheusRule for AI alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ai-workload-alerts
namespace: monitoring
spec:
groups:
- name: ai-inference
rules:
- alert: HighInferenceLatency
expr: histogram_quantile(0.95, ai_inference_duration_seconds) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High inference latency detected"
description: "95th percentile inference latency is {{ $value }}s"
- alert: GPUMemoryHigh
expr: dcgm_fb_used / dcgm_fb_total > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "GPU memory usage is high"
description: "GPU memory usage is {{ $value | humanizePercentage }}"
- alert: ModelServerDown
expr: up{job="ai-model-servers"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "AI model server is down"
description: "Model server {{ $labels.instance }} is not responding"
Security and Compliance
Network Policies
Implementing network security for AI workloads:
# Network policy for AI inference namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ai-inference-policy
namespace: ai-inference
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 8000
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: ai-models
ports:
- protocol: TCP
port: 443
- to: []
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
---
# Pod Security Policy for AI workloads
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: ai-workload-psp
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
RBAC Configuration
Role-based access control for AI operations:
# ClusterRole for AI operators
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: ai-operator
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["kubeflow.org"]
resources: ["pytorchjobs", "tfjobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
# ServiceAccount for AI workloads
apiVersion: v1
kind: ServiceAccount
metadata:
name: ai-workload-sa
namespace: ai-training
---
# RoleBinding for AI operators
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ai-operator-binding
subjects:
- kind: ServiceAccount
name: ai-workload-sa
namespace: ai-training
roleRef:
kind: ClusterRole
name: ai-operator
apiGroup: rbac.authorization.k8s.io
Backup and Disaster Recovery
Model and Data Backup
Implementing comprehensive backup strategies:
# Velero backup for AI workloads
apiVersion: velero.io/v1
kind: Backup
metadata:
name: ai-workloads-backup
namespace: velero
spec:
includedNamespaces:
- ai-training
- ai-inference
- ai-models
includedResources:
- persistentvolumeclaims
- persistentvolumes
- deployments
- services
- configmaps
- secrets
labelSelector:
matchLabels:
backup: "true"
storageLocation: default
ttl: 720h0m0s
---
# Scheduled backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: ai-daily-backup
namespace: velero
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- ai-training
- ai-inference
- ai-models
storageLocation: default
ttl: 168h0m0s
Performance Optimization
Cluster Autoscaling
Implementing intelligent cluster scaling:
# Cluster Autoscaler configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/ai-cluster
- --balance-similar-node-groups
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --max-node-provision-time=15m
env:
- name: AWS_REGION
value: us-west-2
Troubleshooting and Maintenance
Common Issues and Solutions
Resolving frequent AI cluster problems:
GPU Scheduling Issues
# Check GPU availability kubectl describe nodes | grep nvidia.com/gpu # Verify GPU operator pods kubectl get pods -n gpu-operator # Check device plugin logs kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonsetModel Loading Timeouts
# Increase startup probe timeout kubectl patch deployment ai-model-server -p '{"spec":{"template":{"spec":{"containers":[{"name":"model-server","startupProbe":{"failureThreshold":60}}]}}}}' # Check model loading logs kubectl logs -f deployment/ai-model-server -c model-server # Verify model storage access kubectl exec -it deployment/ai-model-server -- ls -la /modelsResource Contention
# Check resource usage kubectl top nodes kubectl top pods -n ai-training --sort-by=memory # Identify resource bottlenecks kubectl describe node <node-name> | grep -A 10 "Allocated resources" # Review resource quotas kubectl describe resourcequota -n ai-trainingNetwork Connectivity Issues
# Test pod-to-pod communication kubectl exec -it <pod-name> -- nslookup ai-model-service.ai-inference.svc.cluster.local # Check network policies kubectl get networkpolicy -A # Verify ingress configuration kubectl describe ingress ai-model-ingress -n ai-inference
Maintenance Procedures
Regular maintenance tasks for optimal cluster performance:
# Weekly cluster health check
#!/bin/bash
echo "=== Kubernetes AI Cluster Health Check ==="
# Check node status
echo "Node Status:"
kubectl get nodes -o wide
# Check GPU availability
echo -e "\nGPU Resources:"
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"
# Check critical pods
echo -e "\nCritical Pod Status:"
kubectl get pods -n kube-system | grep -E "(coredns|kube-proxy|calico)"
kubectl get pods -n gpu-operator
kubectl get pods -n ai-inference
kubectl get pods -n ai-training
# Check resource usage
echo -e "\nResource Usage:"
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20
# Check persistent volumes
echo -e "\nStorage Status:"
kubectl get pv,pvc -A
# Check recent events
echo -e "\nRecent Events:"
kubectl get events --sort-by='.lastTimestamp' | tail -20
# Generate report
echo -e "\n=== Health Check Complete ==="
date
Performance Tuning
Optimizing cluster performance for AI workloads:
# Kubelet configuration for AI nodes
apiVersion: v1
kind: ConfigMap
metadata:
name: kubelet-config
namespace: kube-system
data:
kubelet: |
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 110
podsPerCore: 10
evictionHard:
memory.available: "1Gi"
nodefs.available: "10%"
imagefs.available: "10%"
evictionSoft:
memory.available: "2Gi"
nodefs.available: "15%"
imagefs.available: "15%"
evictionSoftGracePeriod:
memory.available: "2m"
nodefs.available: "2m"
imagefs.available: "2m"
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
cpuManagerPolicy: "static"
topologyManagerPolicy: "single-numa-node"
systemReserved:
cpu: "1"
memory: "2Gi"
ephemeral-storage: "10Gi"
kubeReserved:
cpu: "1"
memory: "2Gi"
ephemeral-storage: "10Gi"
Best Practices and Recommendations
Production Deployment Guidelines
Essential practices for production AI clusters:
Resource Planning
- Size GPU nodes based on model requirements
- Plan for 20-30% overhead for system processes
- Implement proper resource quotas and limits
- Use node affinity for workload placement
Security Hardening
- Enable Pod Security Standards
- Implement network policies
- Use service mesh for encrypted communication
- Regular security scanning and updates
Monitoring and Alerting
- Monitor GPU utilization and memory
- Track model inference latency and throughput
- Set up alerts for resource exhaustion
- Implement distributed tracing for complex workflows
Backup and Recovery
- Regular backups of model artifacts and configurations
- Test disaster recovery procedures
- Document recovery processes
- Implement cross-region replication for critical models
Cost Optimization Strategies
Reducing operational costs while maintaining performance:
# Spot instance node pool for training
apiVersion: v1
kind: Node
metadata:
name: spot-training-node
labels:
node-type: spot-training
lifecycle: spot
workload: training
spec:
taints:
- key: node.kubernetes.io/spot
value: "true"
effect: NoSchedule
- key: nvidia.com/gpu
value: "training"
effect: NoSchedule
---
# Tolerations for spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-training-job
spec:
template:
spec:
tolerations:
- key: node.kubernetes.io/spot
operator: Equal
value: "true"
effect: NoSchedule
- key: nvidia.com/gpu
operator: Equal
value: "training"
effect: NoSchedule
nodeSelector:
lifecycle: spot
workload: training
Future Considerations
Emerging Technologies
Preparing for next-generation AI infrastructure:
Multi-Instance GPU (MIG) Support
- Partition A100 GPUs for better utilization
- Implement MIG-aware scheduling
- Optimize resource allocation for mixed workloads
Edge AI Integration
- Deploy lightweight models to edge nodes
- Implement federated learning workflows
- Manage model synchronization across locations
Quantum Computing Integration
- Prepare for hybrid classical-quantum workloads
- Implement quantum simulator support
- Plan for quantum-classical communication protocols
Scaling Strategies
Planning for growth and evolution:
# Multi-cluster federation setup
apiVersion: core.kubefed.io/v1beta1
kind: KubeFedCluster
metadata:
name: ai-cluster-west
namespace: kube-federation-system
spec:
apiEndpoint: https://ai-cluster-west.example.com
caBundle: <base64-encoded-ca-bundle>
secretRef:
name: ai-cluster-west-secret
---
# Federated deployment for global model serving
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
name: global-ai-model-server
namespace: ai-inference
spec:
template:
metadata:
labels:
app: global-ai-model-server
spec:
replicas: 3
selector:
matchLabels:
app: global-ai-model-server
template:
metadata:
labels:
app: global-ai-model-server
spec:
containers:
- name: model-server
image: your-registry/ai-model-server:latest
placement:
clusters:
- name: ai-cluster-west
- name: ai-cluster-east
- name: ai-cluster-europe
Conclusion
Building and managing Kubernetes clusters for AI workloads requires careful planning, specialized configuration, and ongoing optimization. This comprehensive guide provides the foundation for deploying production-ready AI infrastructure that can scale with your organization's needs.
Key takeaways for successful AI cluster deployment:
- Start with proper resource planning and node pool design
- Implement comprehensive monitoring from day one
- Prioritize security and compliance throughout the deployment
- Plan for scalability and future technology adoption
- Establish robust backup and recovery procedures
- Continuously optimize for performance and cost efficiency
The AI landscape continues to evolve rapidly, and your Kubernetes infrastructure should be designed to adapt to new technologies, frameworks, and deployment patterns. Regular updates, monitoring, and optimization will ensure your AI cluster remains efficient, secure, and capable of supporting your organization's machine learning initiatives.
For organizations just beginning their AI journey, start with a smaller cluster and gradually expand as you gain experience and understand your specific workload requirements. The investment in proper infrastructure will pay dividends in improved model performance, reduced operational overhead, and faster time-to-market for AI applications.