K8s HPA 不扩容的几个隐藏陷阱

HPA（HorizontalPodAutoscaler）配上了，但流量压上来 Pod 死活不扩，是 K8s 运维的高频坑。本文列我见过的 6 个典型原因。

起手命令

kubectl describe hpa <name> -n <ns>
kubectl get hpa <name> -n <ns> -w
kubectl logs -n kube-system deploy/metrics-server | tail

describe hpa 底部的 Conditions 和 Events 是关键。

最常见。kubectl top pods 报错就是这个：

error: Metrics API not available

修复：

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

自签证书集群还得加 --kubelet-insecure-tls。

HPA 算的是 当前用量 / requests。没写 requests，HPA 啥也算不出来。

kubectl get hpa <name> -o yaml | grep -A5 currentMetrics
# 看到 <unknown> 就是这个原因

修：deployment 里给容器加 resources.requests.cpu: 100m。

HPA 公式：

desiredReplicas = ceil(currentReplicas × (currentMetric / desiredMetric))

举例：当前 2 个 Pod，CPU 平均使用率 60%，目标 50%：

desired = ceil(2 × (60/50)) = ceil(2.4) = 3

但有 10% 容忍区间（horizontal-pod-autoscaler-tolerance 默认 0.1），60/50=1.2 超过 1.1 才扩。如果 currentMetric 在 45%~55% 之间，永远不会扩缩。

HPA v2 引入 behavior，默认缩容窗口 5 分钟。流量降了但 Pod 不缩，可能正常：

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
  scaleUp:
    stabilizationWindowSeconds: 0

如果想更快响应，调小这两个值。

用 Prometheus Adapter / KEDA 的场景：

kubectl get apiservice | grep custom.metrics
# v1beta1.custom.metrics.k8s.io 必须是 Available

不 Available 就是 adapter 挂了或者配置没生效。Adapter 改完配置必须重启 Pod，热加载没用。

Argo CD / Flux 同步覆盖、PodDisruptionBudget 限制、StatefulSet 的 ordinal 限制都可能让 HPA 改了 replicas 立刻被改回去。

kubectl get events -n <ns> | grep -i hpa
kubectl get deployment <name> -o yaml | grep -B2 -A2 replicas

GitOps 场景的标准做法：让 HPA 管 replicas，Argo CD 忽略 replicas 字段：

# Application
spec:
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /spec/replicas

手动改 replicas 看 HPA 反应：kubectl scale deploy/x --replicas=5，5 分钟后 HPA 应该把它拉回来
临时打开 verbose 日志：kubectl -n kube-system edit deployment metrics-server 加 --v=4
看是 metric 没采到还是算出来不扩：describe hpa 里 currentMetrics 字段，<unknown> vs 有值走不同分支

教训一句话：HPA 不扩 90% 是 metrics-server 或 requests 没设，剩下 10% 是 behavior 窗口或者算法容忍区间。