K8s etcd 备份与恢复:把整个集群拽回来
K8s 所有状态都在 etcd 里。etcd 坏了集群就死了,但 etcd 备份恢复其实不难。本文是完整流程。
备份
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%F-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
校验:
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-xxx.db -w table
自动化(crontab)
0 2 * * * /usr/local/bin/etcd-backup.sh
脚本:
#!/bin/bash
BACKUP_DIR=/backup/etcd
mkdir -p $BACKUP_DIR
FILE=$BACKUP_DIR/etcd-$(date +%F-%H%M).db
ETCDCTL_API=3 etcdctl snapshot save $FILE \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 保留 7 天
find $BACKUP_DIR -name "etcd-*.db" -mtime +7 -delete
# rsync 异地
rsync -av $FILE backup@<backup-server>:/etcd-backups/
恢复(单节点 etcd)
# 1. 停 kube-apiserver 和 etcd
systemctl stop kube-apiserver
mv /etc/kubernetes/manifests/etcd.yaml /tmp/ # 如果是 kubeadm
# 2. 清理 etcd 数据
mv /var/lib/etcd /var/lib/etcd.bak
# 3. 恢复
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-xxx.db \
--data-dir /var/lib/etcd \
--name <node-name> \
--initial-cluster <node-name>=https://<node-ip>:2380 \
--initial-advertise-peer-urls https://<node-ip>:2380
# 4. 起服务
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
systemctl start kube-apiserver
恢复(HA etcd 集群)
3 节点 etcd 恢复要在所有节点同时做:
# 在每个节点:
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-xxx.db \
--data-dir /var/lib/etcd \
--name <本节点名> \
--initial-cluster node1=https://<n1ip>:2380,node2=https://<n2ip>:2380,node3=https://<n3ip>:2380 \
--initial-advertise-peer-urls https://<本节点ip>:2380 \
--initial-cluster-token <token>
# 同时启动所有 etcd
--initial-cluster-token 必须三个节点都一样且和原集群不同(避免脑裂)。
配合 Velero 做应用级备份
etcd 备份只保了 K8s 对象,PV 数据没保。生产再加 Velero:
velero install \
--provider aws --bucket k8s-backup \
--backup-location-config region=us-east-1 \
--use-volume-snapshots=true
velero backup create daily-backup --include-namespaces production
Velero + etcd 双备份,定期演练恢复,是 K8s 生产标配。
教训:备份脚本写好后至少做一次完整恢复演练,没演练过的备份等同于没有。