K8s 所有状态都在 etcd 里。etcd 坏了集群就死了,但 etcd 备份恢复其实不难。本文是完整流程。

备份

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%F-%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

校验:

ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-xxx.db -w table

自动化(crontab)

0 2 * * * /usr/local/bin/etcd-backup.sh

脚本:

#!/bin/bash
BACKUP_DIR=/backup/etcd
mkdir -p $BACKUP_DIR
FILE=$BACKUP_DIR/etcd-$(date +%F-%H%M).db
ETCDCTL_API=3 etcdctl snapshot save $FILE \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# 保留 7 天
find $BACKUP_DIR -name "etcd-*.db" -mtime +7 -delete

# rsync 异地
rsync -av $FILE backup@<backup-server>:/etcd-backups/

恢复(单节点 etcd)

# 1. 停 kube-apiserver 和 etcd
systemctl stop kube-apiserver
mv /etc/kubernetes/manifests/etcd.yaml /tmp/   # 如果是 kubeadm

# 2. 清理 etcd 数据
mv /var/lib/etcd /var/lib/etcd.bak

# 3. 恢复
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-xxx.db \
  --data-dir /var/lib/etcd \
  --name <node-name> \
  --initial-cluster <node-name>=https://<node-ip>:2380 \
  --initial-advertise-peer-urls https://<node-ip>:2380

# 4. 起服务
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
systemctl start kube-apiserver

恢复(HA etcd 集群)

3 节点 etcd 恢复要在所有节点同时做:

# 在每个节点:
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-xxx.db \
  --data-dir /var/lib/etcd \
  --name <本节点名> \
  --initial-cluster node1=https://<n1ip>:2380,node2=https://<n2ip>:2380,node3=https://<n3ip>:2380 \
  --initial-advertise-peer-urls https://<本节点ip>:2380 \
  --initial-cluster-token <token>

# 同时启动所有 etcd

--initial-cluster-token 必须三个节点都一样且和原集群不同(避免脑裂)。

配合 Velero 做应用级备份

etcd 备份只保了 K8s 对象,PV 数据没保。生产再加 Velero:

velero install \
  --provider aws --bucket k8s-backup \
  --backup-location-config region=us-east-1 \
  --use-volume-snapshots=true

velero backup create daily-backup --include-namespaces production

Velero + etcd 双备份,定期演练恢复,是 K8s 生产标配。

教训:备份脚本写好后至少做一次完整恢复演练,没演练过的备份等同于没有。

标签: none

添加新评论