GPU Scheduling en Kubernetes: Deploy AI Workloads a Escala (Guía Completa 2025)

Última actualización: Noviembre 2025
Tiempo de lectura: 18 minutos
Nivel: Avanzado (over 9000)

La gestión eficiente de GPUs en Kubernetes se ha convertido en un requisito crítico para equipos que despliegan workloads de IA a escala. Con el aumento de búsquedas “kubernetes ai” en un 300% durante 2025, y el precio de GPUs datacenter (A100, H100) alcanzando los $30,000 por unidad, optimizar el uso de GPUs no es opcional – es obligatorio.

En esta guía completa vas a aprender cómo implementar GPU scheduling profesional en Kubernetes usando NVIDIA GPU Operator, configurar estrategias de particionamiento (MIG y Time-Slicing), y deployar workloads reales de IA (vLLM, Ollama, ComfyUI) con auto-scaling y monitoreo.

Por Qué Necesitas GPU Scheduling en Kubernetes

Imagina que tienes 4 GPUs A100 en tu homelab o datacenter (inversión de ~$120,000). Sin GPU scheduling optimizado:

❌ Desperdicio brutal: GPU utilizacion promedio del 20-30%
❌ Costos disparados: Pagas $120k pero usas solo $36k de valor real
❌ Latencia impredecible: Sin QoS garantizado, workloads interfieren entre sí
❌ Scaling imposible: No puedes ejecutar 10+ modelos en 4 GPUs

Con GPU scheduling correcto:

✅ Utilización 80%+: Aprovechas toda la capacidad de compute
✅ 4.4x más modelos: Con MIG puedes servir 28 instancias en 4 GPUs A100
✅ QoS garantizado: Latencia P99 predecible (58ms vs 145ms)
✅ ROI real: $450/modelo vs $2,000/modelo (344% más barato)

Tabla de Contenidos

Fundamentos: NVIDIA GPU Operator
MIG vs Time-Slicing: Estrategias de Particionamiento
Instalación GPU Operator con Helm
Configuración MIG en Kubernetes
Time-Slicing Setup
Deploy vLLM en Kubernetes
Deploy Ollama con Multi-Model
ComfyUI en Kubernetes
GPU Scheduling Strategies
Monitoreo GPUs con Prometheus + Grafana
Troubleshooting Común
Preguntas Frecuentes

Fundamentos: NVIDIA GPU Operator {#fundamentos-nvidia-gpu-operator}

Qué es y Por Qué lo Necesitas

El NVIDIA GPU Operator es un operador Kubernetes que automatiza completamente la gestión de GPUs en tu clúster. Antes del GPU Operator, configurar GPUs en Kubernetes requería:

Instalar drivers NVIDIA manualmente en cada nodo
Configurar NVIDIA Container Toolkit
Instalar Device Plugin
Configurar DCGM para monitoreo
Reiniciar nodos múltiples veces

Con GPU Operator, TODO esto se automatiza en un solo comando Helm:

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace

Arquitectura del GPU Operator

El GPU Operator despliega 8 componentes DaemonSet en tus nodos GPU:

Componente	Función	Versión
GPU Operator	Control plane, gestiona ciclo de vida	v25.10.0
NVIDIA Driver	Drivers CUDA en cada nodo GPU	580.95.05
Container Toolkit	Runtime para contenedores GPU	v1.18.0
Device Plugin	Expone GPUs como recursos K8s	v0.18.0
DCGM Exporter	Métricas Prometheus	v4.4.1
GPU Feature Discovery	Etiqueta nodos automáticamente	v0.18.0
MIG Manager	Gestiona Multi-Instance GPU	v0.13.0
Node Feature Discovery	Detecta hardware PCI	Latest

Flujo de Operación

1. NFD detecta GPUs vía PCI ID (10de = NVIDIA)
   ↓
2. GPU Operator despliega DaemonSets en nodos GPU
   ↓
3. Driver DaemonSet instala drivers CUDA
   ↓
4. Device Plugin se registra con kubelet
   ↓
5. kubelet reporta recursos "nvidia.com/gpu" al API server
   ↓
6. Scheduler puede asignar GPUs a Pods

Ventajas Técnicas

Zero-touch provisioning: Añade nodo GPU → drivers se instalan automáticamente
Multi-cloud: Funciona en EKS, AKS, GKE, on-premise
Version management: Actualiza drivers sin SSH manual
Observability: Métricas GPU automáticas vía DCGM

Más info: Guía Kubernetes IA Producción

MIG vs Time-Slicing: Estrategias de Particionamiento {#mig-vs-time-slicing}

Comparativa Técnica Completa

Aspecto	MIG (Multi-Instance GPU)	Time-Slicing
Aislamiento	✅ Hardware-level (L2 cache, memory controllers)	⚠️ Software-level (sharing físico)
QoS	✅ Garantizado (bandwidth dedicado)	❌ Best-effort (contención posible)
Latencia P99	✅ 58ms (ResNet50 A100)	⚠️ 145ms (2.5x peor)
Throughput	✅ 4,970 img/s	⚠️ 2,480 img/s (50% menos)
Fairness	✅ Perfect (recursos garantizados)	❌ Variable (workload A puede saturar GPU)
Configuración	⚠️ Compleja (requiere labels, reboot)	✅ Simple (ConfigMap + restart device-plugin)
GPUs Compatibles	⚠️ Solo A100, H100, A30, H200	✅ Cualquier GPU NVIDIA (V100, T4, RTX)
Max Instancias	⚠️ 7 (A100), 4 (A30)	✅ Ilimitado (10+)
Overhead	✅ ~5%	⚠️ 15-20% (context switching)
Costo	✅ $450/modelo (10 models)	⚠️ $480/modelo (+6.7%)
Uso Ideal	Production inference/training	Development, legacy GPUs

Benchmarks Reales

ResNet50 Inference (A100 40GB)

MIG 7x 1g.5gb:
  Latency P50: 42ms
  Latency P95: 51ms
  Latency P99: 58ms
  Throughput: 4,970 images/sec

Time-Slice 4x:
  Latency P50: 89ms
  Latency P95: 128ms
  Latency P99: 145ms
  Throughput: 2,480 images/sec

Resultado: MIG tiene latency 2.5x mejor y throughput 2x superior.

BERT Fine-tuning (H100 80GB, 4 tenants)

MIG 4x 2g.20gb:
  Training time: 8.1 minutos
  Fairness: Perfect (cada tenant 25% exacto)

Time-Slice 4x:
  Training time: 2.1 horas (15x más lento!)
  Fairness: Terrible (tenant 1: 60%, tenant 4: 10%)

Resultado: MIG es 15x más rápido y garantiza fairness perfecto.

Cuándo Usar Cada Estrategia

Usa MIG si:

✅ Tienes GPUs A100, H100, A30, H200
✅ Necesitas QoS garantizado (SLAs)
✅ Production inference con latencia crítica
✅ Multi-tenant training con fairness
✅ Maximizar utilización (70%+ objetivo)

Usa Time-Slicing si:

✅ Tienes GPUs legacy (V100, T4, RTX 4090)
✅ Entorno development sin SLAs
✅ Necesitas flexibilidad máxima
✅ Budget limitado (no puedes A100/H100)
✅ Workloads batch sin real-time requirements

Más info: Homelab GPU 24GB+ Guía

Instalación GPU Operator con Helm {#instalacion-gpu-operator}

Requisitos Previos

# Verificar versión Kubernetes (mínimo 1.20+)
kubectl version --short

# Verificar que tienes nodos con GPUs NVIDIA
kubectl get nodes -o json | jq '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | .metadata.name'

# Instalar Helm 3 si no lo tienes
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Instalación Básica (Sin MIG)

# Agregar repositorio NVIDIA
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Instalar GPU Operator (versión Nov 2025)
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --create-namespace \
  --version=v25.10.0 \
  --wait

# Verificar instalación
kubectl get pods -n gpu-operator

# Deberías ver:
# gpu-operator-xxx            Running  (Control plane)
# nvidia-driver-daemonset     Running  (En cada nodo GPU)
# nvidia-device-plugin        Running  (En cada nodo GPU)
# nvidia-dcgm-exporter        Running  (En cada nodo GPU)
# gpu-feature-discovery       Running  (En cada nodo GPU)

Instalación con MIG Habilitado

# Instalar con estrategia MIG "mixed"
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --create-namespace \
  --version=v25.10.0 \
  --set mig.strategy=mixed \
  --set migManager.enabled=true \
  --set migManager.env[0].name=WITH_REBOOT \
  --set-string migManager.env[0].value=true \
  --wait

Verificación Post-Instalación

# Ver recursos GPU disponibles
kubectl get nodes -o json | \
  jq '.items[] | {name:.metadata.name, gpu:.status.allocatable."nvidia.com/gpu"}'

# Ver labels de nodos GPU
kubectl get nodes --show-labels | grep nvidia

# Test rápido: ejecutar nvidia-smi en Pod
kubectl run test-gpu --rm -it --restart=Never \
  --image=nvidia/cuda:12.3.0-base-ubuntu22.04 \
  --limits=nvidia.com/gpu=1 \
  -- nvidia-smi

# Deberías ver output de nvidia-smi mostrando tu GPU

Configuración MIG en Kubernetes {#configuracion-mig}

GPUs Compatibles con MIG

GPU	Año	VRAM	Max MIG Instances	Profiles Disponibles
A100 40GB	2020	40GB	7	1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.40gb
A100 80GB	2020	80GB	7	1g.10gb, 2g.20gb, 3g.40gb, 4g.40gb, 7g.80gb
H100 80GB	2022	80GB	7	1g.10gb, 2g.20gb, 3g.40gb, 4g.40gb, 7g.80gb
H200 141GB	2024	141GB	7	Profiles similares a H100
A30 24GB	2021	24GB	4	1g.6gb, 2g.12gb, 4g.24gb

Nota: Las RTX 4090, RTX 5090, y RTX 6000 Ada NO soportan MIG.

Configurar MIG Profile en Nodos

# Habilitar MIG mode (requiere reboot del nodo)
kubectl label nodes gpu-node-1 \
  nvidia.com/mig.config=all-1g.5gb

# Configuraciones predefinidas disponibles:
# all-1g.5gb   → 7 instancias pequeñas (5GB cada una)
# all-2g.10gb  → 3-4 instancias medianas (10GB cada una)
# all-3g.20gb  → 2 instancias grandes (20GB cada una)
# all-4g.20gb  → 1-2 instancias extra-large (20GB cada una)
# all-disabled → Sin MIG (GPU completa)

ConfigMap Personalizado para MIG

Si necesitas configuración custom (ej: mix de 2g.10gb + 1g.5gb):

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      # Configuración custom: 3x 2g.10gb + 1x 1g.5gb = 7 compute slices
      custom-mixed:
        - devices: all
          mig-enabled: true
          mig-devices:
            "2g.10gb": 3
            "1g.5gb": 1

Aplica la configuración:

kubectl apply -f mig-config.yaml

# Aplica al nodo
kubectl label nodes gpu-node-1 \
  nvidia.com/mig.config=custom-mixed

Deploy Pod con MIG Instance

apiVersion: v1
kind: Pod
metadata:
  name: inference-small
spec:
  containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
    args:
      - "--model=mistralai/Mistral-7B-Instruct-v0.3"
    resources:
      requests:
        nvidia.com/mig-1g.5gb: 1  # Solicitar MIG instance 1g.5gb
      limits:
        nvidia.com/mig-1g.5gb: 1

Más info: Fine-Tuning LLMs Guía Completa

Time-Slicing Setup {#time-slicing-setup}

Ventaja: Compatible con TODAS las GPUs

Time-slicing funciona con cualquier GPU NVIDIA, incluyendo:

✅ RTX 4090, RTX 5090 (homelabs)
✅ Tesla V100, T4 (cloud legacy)
✅ A5000, A6000 (workstations)

Configuración ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Cada GPU física → 4 GPUs virtuales

Aplica el ConfigMap:

kubectl create -f time-slicing-config.yaml

# Configurar GPU Operator para usar time-slicing
kubectl patch clusterpolicy/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config"}}}}'

# Reiniciar device plugin para aplicar cambios
kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

Verificar Time-Slicing Activo

# Ver recursos GPU (deberías ver 4x más GPUs disponibles)
kubectl get nodes -o json | \
  jq '.items[] | {name:.metadata.name, gpu:.status.allocatable."nvidia.com/gpu"}'

# Ejemplo: Si tenías 2 GPUs físicas, ahora verás "8"

Deploy Pods con Time-Sliced GPUs

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-cluster
spec:
  replicas: 8  # 8 pods compartiendo 2 GPUs físicas
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # Cada pod recibe 1/4 de GPU física

Limitaciones importantes:

⚠️ Sin aislamiento de memoria (pods comparten VRAM)
⚠️ Context switching overhead (~15-20%)
⚠️ Sin QoS garantizado (latencia variable)

Más info: Ollama vs LM Studio Comparativa

Deploy vLLM en Kubernetes {#deploy-vllm-kubernetes}

vLLM es el engine de LLM inference más rápido en 2025, con soporte para:

✅ OpenAI-compatible API
✅ Tensor parallelism (multi-GPU)
✅ Prefix caching (KV cache optimization)
✅ Continuous batching

Deployment Completo: Mistral-7B

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-mistral-7b
  namespace: ai-inference
spec:
  replicas: 2  # High availability
  selector:
    matchLabels:
      app: vllm-mistral
  template:
    metadata:
      labels:
        app: vllm-mistral
    spec:
      # GPU Scheduling
      nodeSelector:
        accelerator: nvidia

      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.10.0

        command: ["/bin/sh", "-c"]
        args:
        - |
          python3 -m vllm.entrypoints.openai.api_server \
            --model mistralai/Mistral-7B-Instruct-v0.3 \
            --trust-remote-code \
            --enable-chunked-prefill \
            --max-num-batched-tokens 1024 \
            --gpu-memory-utilization 0.95 \
            --kv-cache-dtype auto \
            --dtype float16 \
            --host 0.0.0.0 \
            --port 8000

        ports:
        - containerPort: 8000
          name: http

        resources:
          requests:
            cpu: "4"
            memory: 16Gi
            nvidia.com/gpu: "1"
          limits:
            cpu: "16"
            memory: 32Gi
            nvidia.com/gpu: "1"

        # Health Checks
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10

        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 5

        volumeMounts:
        - name: cache
          mountPath: /root/.cache/huggingface
        - name: shm
          mountPath: /dev/shm

      volumes:
      - name: cache
        emptyDir:
          sizeLimit: 20Gi
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 2Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ai-inference
spec:
  type: ClusterIP
  selector:
    app: vllm-mistral
  ports:
  - port: 8000
    targetPort: 8000
  sessionAffinity: ClientIP  # Importante para prefix caching
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-mistral-7b
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Test vLLM Deployment

# Port-forward local
kubectl port-forward -n ai-inference svc/vllm-service 8000:8000 &

# Test OpenAI-compatible API
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "prompt": "Explain GPU scheduling in Kubernetes:",
    "max_tokens": 256,
    "temperature": 0.7
  }'

Más info: vLLM Deploy LLMs Producción

Deploy Ollama con Multi-Model {#deploy-ollama-kubernetes}

Ollama es ideal para self-hosted LLMs con múltiples modelos simultáneos.

StatefulSet con Persistent Storage

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
  namespace: ai-inference
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi  # Múltiples modelos requieren espacio
  storageClassName: fast-ssd
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  namespace: ai-inference
spec:
  serviceName: ollama-headless
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      nodeSelector:
        accelerator: nvidia

      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

      initContainers:
      - name: preload-models
        image: ollama/ollama:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          ollama serve &
          sleep 10
          ollama pull llama3.2:3b
          ollama pull mistral:7b
          ollama pull codellama:13b
          ollama pull phi3:mini
          pkill ollama
        volumeMounts:
        - name: models
          mountPath: /root/.ollama

      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: http

        resources:
          requests:
            cpu: "4"
            memory: 16Gi
            nvidia.com/gpu: "1"
          limits:
            cpu: "16"
            memory: 64Gi
            nvidia.com/gpu: "1"

        volumeMounts:
        - name: models
          mountPath: /root/.ollama

        livenessProbe:
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 10

      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ai-inference
spec:
  type: ClusterIP
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434

Integrar con Open WebUI

apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
      - name: webui
        image: ghcr.io/open-webui/open-webui:main
        ports:
        - containerPort: 8080
        env:
        - name: OLLAMA_BASE_URL
          value: "http://ollama-service:11434"
        - name: WEBUI_SECRET_KEY
          value: "your-secret-key-here"
---
apiVersion: v1
kind: Service
metadata:
  name: open-webui
  namespace: ai-inference
spec:
  type: LoadBalancer
  selector:
    app: open-webui
  ports:
  - port: 80
    targetPort: 8080

Más info: Ollama Tutorial Completo

ComfyUI en Kubernetes {#comfyui-kubernetes}

ComfyUI es el engine más potente para workflows de generación de imágenes con Stable Diffusion.

Deployment con GPU

apiVersion: apps/v1
kind: Deployment
metadata:
  name: comfyui
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: comfyui
  template:
    metadata:
      labels:
        app: comfyui
    spec:
      nodeSelector:
        accelerator: nvidia

      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

      containers:
      - name: comfyui
        image: comfyui/comfyui:latest
        ports:
        - containerPort: 8188

        resources:
          requests:
            cpu: "4"
            memory: 16Gi
            nvidia.com/gpu: "1"
          limits:
            cpu: "16"
            memory: 64Gi
            nvidia.com/gpu: "1"

        volumeMounts:
        - name: models
          mountPath: /app/models
        - name: output
          mountPath: /app/output

      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: comfyui-models-pvc
      - name: output
        emptyDir:
          sizeLimit: 50Gi
---
apiVersion: v1
kind: Service
metadata:
  name: comfyui-service
  namespace: ai-inference
spec:
  type: LoadBalancer
  selector:
    app: comfyui
  ports:
  - port: 8188
    targetPort: 8188

Más info: ComfyUI vs Stable Diffusion WebUI

GPU Scheduling Strategies {#gpu-scheduling-strategies}

1. Bin-Packing (Máxima Utilización)

Agrupa workloads en el menor número de nodos para maximizar utilización GPU.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: gpu-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 100
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated  # Bin-packing mode
        resources:
        - name: nvidia.com/gpu
          weight: 100

2. Affinity y Anti-Affinity

apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  affinity:
    # Preferir nodos con GPUs A100
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - A100-SXM4-40GB
            - A100-PCIE-40GB

    # NO ejecutar en mismo nodo que otros training jobs
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: workload-type
            operator: In
            values:
            - training
        topologyKey: kubernetes.io/hostname

  containers:
  - name: pytorch
    image: pytorch/pytorch:latest
    resources:
      limits:
        nvidia.com/gpu: 1

3. Taints y Tolerations

# Marcar nodos GPU como "solo para AI workloads"
kubectl taint nodes gpu-node-1 \
  nvidia.com/gpu=true:NoSchedule

kubectl taint nodes gpu-node-2 \
  workload=training:NoSchedule

apiVersion: v1
kind: Pod
metadata:
  name: inference-pod
spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  - key: workload
    operator: Equal
    value: training
    effect: NoSchedule

  containers:
  - name: app
    image: my-app:latest
    resources:
      limits:
        nvidia.com/gpu: 1

4. PriorityClass para Workloads Críticos

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-high-priority
value: 1000000
globalDefault: false
description: "High priority para production inference"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-low-priority
value: 100
globalDefault: false
description: "Low priority para training/dev"
---
apiVersion: v1
kind: Pod
metadata:
  name: production-inference
spec:
  priorityClassName: gpu-high-priority  # Este pod puede evict pods low-priority
  containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    resources:
      limits:
        nvidia.com/gpu: 1

Monitoreo GPUs con Prometheus + Grafana {#monitoreo-gpus}

DCGM Exporter (Ya Incluido en GPU Operator)

El GPU Operator despliega DCGM Exporter automáticamente. Métricas disponibles:

# GPU Utilization (%)
DCGM_FI_DEV_GPU_UTIL

# GPU Memory Used (MB)
DCGM_FI_DEV_FB_USED

# GPU Temperature (°C)
DCGM_FI_DEV_GPU_TEMP

# Power Usage (W)
DCGM_FI_DEV_POWER_USAGE

# XID Errors (hardware failures)
DCGM_FI_DEV_XID_ERRORS

Prometheus Queries Útiles

# GPU utilization promedio por nodo
avg by (gpu, Hostname) (DCGM_FI_DEV_GPU_UTIL)

# GPU memory usage > 90%
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 0.9

# GPUs con temperatura > 80°C
DCGM_FI_DEV_GPU_TEMP > 80

# Pods con OOM (Out of Memory) en GPU
rate(DCGM_FI_DEV_XID_ERRORS{xid="48"}[5m]) > 0

Grafana Dashboard

Importa dashboard público ID 23382 (NVIDIA DCGM Exporter):

# Port-forward Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80 &

# Abrir http://localhost:3000
# Dashboard > Import > ID: 23382

Panels incluidos:

GPU Utilization over time
GPU Memory Usage
GPU Temperature
Power Consumption
XID Errors (hardware failures)
Per-Pod GPU metrics

Alertas PrometheusRule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
  namespace: monitoring
spec:
  groups:
  - name: gpu-health
    interval: 30s
    rules:
    # GPU temperatura > 85°C
    - alert: GPUHighTemperature
      expr: DCGM_FI_DEV_GPU_TEMP > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "GPU {{ $labels.gpu }} en nodo {{ $labels.Hostname }} temperatura alta"
        description: "Temperatura: {{ $value }}°C"

    # GPU utilization < 10% durante 30 min (desperdicio)
    - alert: GPULowUtilization
      expr: DCGM_FI_DEV_GPU_UTIL < 10
      for: 30m
      labels:
        severity: info
      annotations:
        summary: "GPU {{ $labels.gpu }} subutilizada"
        description: "Utilización: {{ $value }}% durante 30 min"

    # GPU memory > 95% (riesgo OOM)
    - alert: GPUMemoryHigh
      expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) > 0.95
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "GPU {{ $labels.gpu }} memoria crítica"
        description: "Memoria usada: {{ $value | humanizePercentage }}"

    # XID errors (hardware failure)
    - alert: GPUXIDError
      expr: rate(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "GPU {{ $labels.gpu }} XID error detectado"
        description: "XID: {{ $labels.xid }}"

Troubleshooting Común {#troubleshooting}

1. Pod Stuck en “Pending” con GPU Request

Síntoma:

kubectl describe pod my-pod
# Events: 0/3 nodes are available: insufficient nvidia.com/gpu

Causas posibles:

a) No hay GPUs disponibles en el cluster:

kubectl get nodes -o json | \
  jq '.items[] | {name:.metadata.name, gpu:.status.allocatable."nvidia.com/gpu"}'

b) GPU Operator no está corriendo:

kubectl get pods -n gpu-operator
# Verificar que todos los pods están Running

c) Nodos GPU tienen taints:

kubectl get nodes -o json | jq '.items[] | select(.spec.taints != null) | {name:.metadata.name, taints:.spec.taints}'

# Solución: Agregar tolerations al Pod

2. CUDA Out of Memory (OOM)

Síntoma:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 39.59 GiB total capacity)

Soluciones:

a) Reducir batch size o model size
b) Usar MIG en lugar de time-slicing (aislamiento de memoria)
c) Aumentar terminationGracePeriodSeconds para cleanup:

spec:
  terminationGracePeriodSeconds: 60

d) Monitoring activo:

kubectl exec -it <pod> -- nvidia-smi

3. Device Plugin No Reporta GPUs

Síntoma:

kubectl get nodes -o json | jq '.items[0].status.allocatable'
# No aparece "nvidia.com/gpu"

Debugging:

# Ver logs device plugin
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

# Verificar drivers en nodo
kubectl exec -n gpu-operator nvidia-driver-daemonset-xxx -- nvidia-smi

# Reiniciar device plugin
kubectl delete pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

4. MIG Configuration No se Aplica

Síntoma:

nvidia-smi -L
# No muestra MIG instances

Solución:

# Verificar label en nodo
kubectl get nodes --show-labels | grep mig.config

# Aplicar label correcto
kubectl label nodes gpu-node-1 \
  nvidia.com/mig.config=all-1g.5gb --overwrite

# Verificar logs MIG Manager
kubectl logs -n gpu-operator -l app=nvidia-mig-manager

# En algunos casos requiere reboot del nodo
kubectl drain gpu-node-1 --ignore-daemonsets --delete-emptydir-data
# SSH al nodo y reboot
kubectl uncordon gpu-node-1

5. XID Errors en DCGM

Síntoma:

DCGM_FI_DEV_XID_ERRORS{xid="48"} > 0

XID Error Codes:

XID	Significado	Acción
13	Graphics Engine Exception	Reiniciar driver
31	GPU Memory Error	Verificar hardware
43	GPU Memory Page Fault	Reducir memory usage
48	Double Bit ECC Error	Hardware failure – RMA GPU
63	ECC Page Retirement Failed	Hardware failure – RMA GPU

# Ver errores en nodo
kubectl exec -n gpu-operator nvidia-dcgm-exporter-xxx -- \
  dcgmi stats -g 0

# Si XID 48 o 63: GPU hardware defectuosa
# Taint el nodo y reemplaza GPU
kubectl taint nodes gpu-node-1 \
  gpu-health=failed:NoSchedule

Más info: Monitoreo Homelab con Uptime Kuma

Preguntas Frecuentes {#faq}

¿Cuántas GPUs necesito para empezar con GPU scheduling en Kubernetes?

Mínimo 1 GPU es suficiente para aprender. Con 1 GPU puedes:

Configurar GPU Operator
Probar time-slicing (4+ workloads en 1 GPU)
Deployar Ollama o vLLM

Para producción, recomiendo mínimo 2-4 GPUs para redundancia y scaling.

¿MIG funciona en RTX 4090 o RTX 5090?

No, las GPUs RTX consumer NO soportan MIG. MIG solo funciona en:

A100, A30 (datacenter Ampere)
H100, H200 (datacenter Hopper)
B200 (datacenter Blackwell)
RTX 6000 Ada (workstation, soporte limitado)

Para RTX 4090/5090, usa time-slicing.

¿Cuál es el overhead de time-slicing vs MIG?

Time-slicing:

Overhead: ~15-20% (context switching)
Latencia: 2.5x peor que MIG (P99: 145ms vs 58ms)

MIG:

Overhead: ~5% (aislamiento hardware)
Latencia: Predecible y baja

¿Puedo mezclar MIG y time-slicing en el mismo cluster?

Sí, puedes tener:

Nodos con A100 usando MIG (production)
Nodos con V100/T4 usando time-slicing (development)

Usa nodeSelector para direccionar workloads al tipo correcto de nodo.

¿Cómo optimizo costos con GPU scheduling?

Estrategias probadas:

MIG en producción → 4.4x más modelos por GPU ($450/modelo vs $2,000)
Time-slicing en dev/staging → Compartir 1 GPU entre 5+ developers
Bin-packing scheduler → Consolidar workloads en menos nodos
Autoscaling agresivo → Scale to zero cuando no hay tráfico
Spot instances → GPUs cloud hasta 70% más baratas

ROI real: Con 4 A100 ($120k) y MIG correcto:

Sin MIG: 4 modelos servidos → $30,000/modelo
Con MIG: 28 modelos servidos → $4,285/modelo (7x ROI)

¿Cómo escalo vLLM horizontalmente?

Usa HorizontalPodAutoscaler con métricas custom:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    kind: Deployment
    name: vllm-mistral-7b
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_running
      target:
        type: AverageValue
        averageValue: "10"  # Scale si avg > 10 requests

¿Qué modelos LLM puedo correr en 1x A100 40GB?

Modelo	VRAM Requerida	Fits en A100 40GB?	Configuración
Mistral-7B	~14GB (fp16)	✅ Sí	1x GPU completa o 2-3x MIG 2g.10gb
Llama-13B	~26GB (fp16)	✅ Sí	1x GPU completa
Llama-70B	~140GB (fp16)	❌ No	Requiere 4x A100 con tensor parallelism
Llama-70B	~70GB (int4 quantized)	⚠️ Sí	1x A100 80GB o 2x A100 40GB
Mixtral-8x7B	~90GB (fp16)	❌ No	Requiere 3x A100 40GB

Tip: Usa quantization (int8, int4) para correr modelos más grandes en menos VRAM.

¿GPU Operator funciona en clusters on-premise?

Sí, perfectamente. Funciona en:

✅ Kubernetes bare-metal
✅ Proxmox con GPU passthrough
✅ VMware vSphere con vGPU
✅ RKE, K3s, kubeadm clusters

Requisito único: Nodos deben tener GPUs NVIDIA físicas o pasadas via PCI passthrough.

¿Cómo monitoreo costos de GPU por tenant o namespace?

Usa Kubecost con DCGM metrics:

helm install kubecost kubecost/cost-analyzer \
  -n kubecost --create-namespace \
  --set kubecostToken="YOUR_TOKEN"

Kubecost detecta automáticamente nvidia.com/gpu resources y calcula:

Costo por namespace
Costo por deployment
GPU utilization efficiency
Idle GPU cost

Dashboard: http://kubecost/allocation

¿Puedo usar GPUs AMD o Intel con GPU Operator?

No, GPU Operator es específico de NVIDIA. Para GPUs AMD:

Usa AMD GPU Device Plugin: https://github.com/RadeonOpenCompute/k8s-device-plugin

Para Intel GPUs (Arc, Flex):

Usa Intel GPU Device Plugin: https://github.com/intel/intel-device-plugins-for-kubernetes

¿Qué pasa si un nodo GPU falla durante training de modelo?

Depende de tu configuración:

Sin checkpointing:

❌ Training se pierde completamente
❌ Reinicia desde epoch 0

Con checkpointing (recomendado):

import torch

# Guardar checkpoint cada N epochs
if epoch % 10 == 0:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, f'/persistent-volume/checkpoint_{epoch}.pt')

# PersistentVolume para checkpoints
volumes:
- name: checkpoints
  persistentVolumeClaim:
    claimName: training-checkpoints-pvc

Kubernetes features:

restartPolicy: OnFailure → Reinicia pod automáticamente
backoffLimit: 10 (para Jobs) → Reintentos automáticos

¿Cómo evito que un workload monopolice todas las GPUs?

Usa ResourceQuota por namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-alpha
spec:
  hard:
    requests.nvidia.com/gpu: "4"  # Max 4 GPUs para team-alpha
    limits.nvidia.com/gpu: "4"

Combina con PriorityClass para garantizar fairness:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: team-alpha-priority
value: 1000
preemptionPolicy: PreemptLowerPriority

¿GPU scheduling funciona con Kubernetes autoscaling (Karpenter, Cluster Autoscaler)?

Sí, perfectamente. Karpenter v0.32+ tiene soporte nativo para GPUs:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-pool
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: nvidia.com/gpu.product
        operator: In
        values: ["A100-SXM4-40GB", "L40", "H100-80GB-HBM3"]

      nodeClassRef:
        name: gpu-ec2-nodeclass

  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

Resultado: Karpenter provisiona nodos GPU automáticamente cuando hay pods pending con nvidia.com/gpu requests.

¿Cuánto tiempo tarda GPU Operator en provisionar un nodo nuevo?

Timeline típico:

0min: Nodo GPU añadido al cluster
1min: NFD detecta GPU (PCI ID 10de)
2min: GPU Operator despliega DaemonSets
5min: Driver DaemonSet compila módulo kernel
8min: Device Plugin registra GPUs con kubelet
10min: Nodo reporta recursos nvidia.com/gpu

Total: ~10 minutos desde nodo vacío hasta GPUs disponibles.

Tip para acelerar: Usa pre-built drivers (driver container con módulo pre-compilado).

¿Puedo correr workloads CPU-only y GPU en el mismo cluster?

Sí, es la configuración recomendada. Usa taints para separar:

# GPU nodes: solo workloads GPU
kubectl taint nodes gpu-node-1 nvidia.com/gpu=true:NoSchedule

# CPU nodes: sin taint (default workloads)

Workloads CPU-only irán a nodos CPU automáticamente.
Workloads GPU (con tolerations) irán a nodos GPU.

Conclusión

GPU scheduling en Kubernetes es esencial para maximizar ROI de tu infraestructura GPU. En esta guía has aprendido:

✅ Fundamentos: GPU Operator y arquitectura de componentes
✅ Estrategias: MIG vs Time-Slicing (cuándo usar cada uno)
✅ Deployments reales: vLLM, Ollama, ComfyUI con YAMLs funcionales
✅ Scheduling avanzado: Bin-packing, affinity, taints, PriorityClass
✅ Monitoreo: DCGM Exporter + Prometheus + Grafana dashboards
✅ Troubleshooting: OOM, scheduling failures, XID errors

Próximos Pasos Recomendados

Implementa GPU Operator en tu cluster (30 min)
Prueba time-slicing con 1 GPU (15 min)
Deploya vLLM con Mistral-7B (20 min)
Configura monitoreo DCGM + Grafana (30 min)
Si tienes A100/H100: Prueba MIG profiles (1 hora)

Recursos Adicionales

Stack *arr con Prowlarr – Automatización de media similar a K8s
Dockge vs Portainer – Gestión Docker antes de K8s
Physical AI: Robótica + LLMs – Siguiente nivel: IA en robots

¿Tienes un homelab con GPUs y quieres optimizarlo? Déjame un comentario con tu configuración y te ayudo a diseñar la mejor estrategia de GPU scheduling.

Etiquetas: kubernetes, gpu scheduling, nvidia, mig, time-slicing, vllm, ollama, comfyui, a100, h100, homelab, mlops, ai infrastructure

Última actualización: 4 de Noviembre de 2025

GPU Scheduling en Kubernetes: Deploy AI Workloads a Escala (Guía Completa 2025)

GPU Scheduling en Kubernetes: Deploy AI Workloads a Escala (Guía Completa 2025)

Por Qué Necesitas GPU Scheduling en Kubernetes

Tabla de Contenidos

Fundamentos: NVIDIA GPU Operator {#fundamentos-nvidia-gpu-operator}

Qué es y Por Qué lo Necesitas

Arquitectura del GPU Operator

Flujo de Operación

Ventajas Técnicas

MIG vs Time-Slicing: Estrategias de Particionamiento {#mig-vs-time-slicing}

Comparativa Técnica Completa

Benchmarks Reales

ResNet50 Inference (A100 40GB)

BERT Fine-tuning (H100 80GB, 4 tenants)

Cuándo Usar Cada Estrategia

Instalación GPU Operator con Helm {#instalacion-gpu-operator}

Requisitos Previos

Instalación Básica (Sin MIG)

Instalación con MIG Habilitado

Verificación Post-Instalación

Configuración MIG en Kubernetes {#configuracion-mig}

GPUs Compatibles con MIG

Configurar MIG Profile en Nodos

ConfigMap Personalizado para MIG

Deploy Pod con MIG Instance

Time-Slicing Setup {#time-slicing-setup}

Ventaja: Compatible con TODAS las GPUs

Configuración ConfigMap

Verificar Time-Slicing Activo

Deploy Pods con Time-Sliced GPUs

Deploy vLLM en Kubernetes {#deploy-vllm-kubernetes}

Deployment Completo: Mistral-7B

Test vLLM Deployment

Deploy Ollama con Multi-Model {#deploy-ollama-kubernetes}

StatefulSet con Persistent Storage

Integrar con Open WebUI

ComfyUI en Kubernetes {#comfyui-kubernetes}

Deployment con GPU

GPU Scheduling Strategies {#gpu-scheduling-strategies}

1. Bin-Packing (Máxima Utilización)

2. Affinity y Anti-Affinity

3. Taints y Tolerations

4. PriorityClass para Workloads Críticos

Monitoreo GPUs con Prometheus + Grafana {#monitoreo-gpus}

DCGM Exporter (Ya Incluido en GPU Operator)

Prometheus Queries Útiles

Grafana Dashboard

Alertas PrometheusRule

Troubleshooting Común {#troubleshooting}

1. Pod Stuck en “Pending” con GPU Request

2. CUDA Out of Memory (OOM)

3. Device Plugin No Reporta GPUs

4. MIG Configuration No se Aplica

5. XID Errors en DCGM

Preguntas Frecuentes {#faq}

¿Cuántas GPUs necesito para empezar con GPU scheduling en Kubernetes?

¿MIG funciona en RTX 4090 o RTX 5090?

¿Cuál es el overhead de time-slicing vs MIG?

¿Puedo mezclar MIG y time-slicing en el mismo cluster?

¿Cómo optimizo costos con GPU scheduling?

¿Cómo escalo vLLM horizontalmente?

¿Qué modelos LLM puedo correr en 1x A100 40GB?

¿GPU Operator funciona en clusters on-premise?

¿Cómo monitoreo costos de GPU por tenant o namespace?

¿Puedo usar GPUs AMD o Intel con GPU Operator?

¿Qué pasa si un nodo GPU falla durante training de modelo?

¿Cómo evito que un workload monopolice todas las GPUs?

¿GPU scheduling funciona con Kubernetes autoscaling (Karpenter, Cluster Autoscaler)?

¿Cuánto tiempo tarda GPU Operator en provisionar un nodo nuevo?

¿Puedo correr workloads CPU-only y GPU en el mismo cluster?

Conclusión

Próximos Pasos Recomendados

Recursos Adicionales

Por ziru

Entradas relacionadas

Backups Automatizados: Kopia vs Restic vs Duplicati (Guía Comparativa 2025)

Dify.AI: Framework LLM sin Código para Homelab (Guía Completa 2025)

LangChain vs LlamaIndex: Framework RAG para Homelab (Guía Comparativa 2025)

Te has perdido

Retrospectiva 2025 de ElDiarioIA: Lo que Pasó, lo que Aprendimos y Feliz 2026 🎉