运维笔记 · Ops & AI

MCP Server 开发入门：让 Claude 调用你的内部服务

作者: root
时间: 2026-05-17
分类: AI & Agents
暂无评论

MCP（Model Context Protocol）是 Anthropic 推的标准，让 LLM 能调用外部工具。本文以一个内部 CMDB 查询为例。

MCP 是啥

简单理解：把你的服务暴露成 LLM 能直接调用的 tool。LLM 不需要写 HTTP 请求代码，直接 query_cmdb(host="...") 就行。

一个最小例子

写一个 MCP server 提供"查 CMDB 主机信息"：

# server.py
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent

app = Server("cmdb")

@app.list_tools()
async def list_tools():
    return [
        Tool(
            name="query_host",
            description="按主机名查 CMDB 拿主机详细信息（IP、OS、负责人、机房）",
            inputSchema={
                "type": "object",
                "properties": {
                    "hostname": {"type": "string", "description": "主机名"}
                },
                "required": ["hostname"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "query_host":
        import httpx
        resp = httpx.get(f"http://cmdb.internal/api/hosts/{arguments['hostname']}")
        return [TextContent(type="text", text=resp.text)]

async def main():
    async with stdio_server() as (r, w):
        await app.run(r, w, app.create_initialization_options())

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

注册到 Claude Code

~/.claude/mcp.json：

{
  "mcpServers": {
    "cmdb": {
      "command": "python",
      "args": ["/path/to/server.py"]
    }
  }
}

重启 Claude Code，再说"查一下 prod-node1 的负责人"，Claude 会自动调用 cmdb__query_host(hostname="prod-node1")。

进阶：让 MCP 提供 Resource

Tools 是"动作"，Resources 是"数据集"。比如把整个 K8s 集群状态做成 resource：

@app.list_resources()
async def list_resources():
    return [
        Resource(
            uri="k8s://cluster/pods",
            name="K8s Pods",
            mimeType="application/json"
        )
    ]

@app.read_resource()
async def read_resource(uri: str):
    if uri == "k8s://cluster/pods":
        return [TextContent(type="text", text=subprocess.check_output(["kubectl","get","pods","-A","-o","json"]).decode())]

Claude 可以"读"这个 resource 而不需要每次都跑命令。

实际生产用法

我们公司 MCP server 提供：

cmdb__query_host：CMDB 查询
monitor__query_metric：Prometheus 查询
incident__list_recent：最近故障列表
runbook__search：从 wiki 搜 runbook
oncall__current：当前值班人

Claude 排障时一个问题能调多个 tool，比手动翻 5 个 dashboard 快太多。

安全注意

MCP server 跑在本地，但有"代行权"。要：

白名单参数：CMDB 查询限制只能查自己有权限的主机
审计日志：每次 tool 调用记日志
read-only 优先：写操作走人工确认
超时控制：MCP 调用别 hang 死

MCP vs 直接 Bash

为啥不直接让 Claude curl http://cmdb.internal/api/...？

MCP 有 schema，Claude 知道参数怎么填
MCP 跨工具有标准，可以接入 Cursor、Continue 等
MCP 调用记录可以审计
复杂请求（OAuth、签名）封装在 MCP 里

教训：MCP 是接入企业内部工具的最佳方式，比让 Agent 写一堆 curl 命令稳定 10 倍。

K8s HPA 不扩容的几个隐藏陷阱

作者: root
时间: 2026-05-17
分类: Kubernetes
暂无评论

HPA（HorizontalPodAutoscaler）配上了，但流量压上来 Pod 死活不扩，是 K8s 运维的高频坑。本文列我见过的 6 个典型原因。

起手命令

kubectl describe hpa <name> -n <ns>
kubectl get hpa <name> -n <ns> -w
kubectl logs -n kube-system deploy/metrics-server | tail

describe hpa 底部的 Conditions 和 Events 是关键。

陷阱 1：metrics-server 没装/挂了

最常见。kubectl top pods 报错就是这个：

error: Metrics API not available

修复：

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

自签证书集群还得加 --kubelet-insecure-tls。

陷阱 2：Pod 没设 resources.requests

HPA 算的是 当前用量 / requests。没写 requests，HPA 啥也算不出来。

kubectl get hpa <name> -o yaml | grep -A5 currentMetrics
# 看到 <unknown> 就是这个原因

修：deployment 里给容器加 resources.requests.cpu: 100m。

陷阱 3：阈值算法和你想的不一样

HPA 公式：

desiredReplicas = ceil(currentReplicas × (currentMetric / desiredMetric))

举例：当前 2 个 Pod，CPU 平均使用率 60%，目标 50%：

desired = ceil(2 × (60/50)) = ceil(2.4) = 3

但有 10% 容忍区间（horizontal-pod-autoscaler-tolerance 默认 0.1），60/50=1.2 超过 1.1 才扩。如果 currentMetric 在 45%~55% 之间，永远不会扩缩。

陷阱 4：扩容/缩容窗口（v2 behavior）

HPA v2 引入 behavior，默认缩容窗口 5 分钟。流量降了但 Pod 不缩，可能正常：

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
  scaleUp:
    stabilizationWindowSeconds: 0

如果想更快响应，调小这两个值。

陷阱 5：自定义指标接不上

用 Prometheus Adapter / KEDA 的场景：

kubectl get apiservice | grep custom.metrics
# v1beta1.custom.metrics.k8s.io 必须是 Available

不 Available 就是 adapter 挂了或者配置没生效。Adapter 改完配置必须重启 Pod，热加载没用。

陷阱 6：目标 deployment 的 replicas 被其他东西管着

Argo CD / Flux 同步覆盖、PodDisruptionBudget 限制、StatefulSet 的 ordinal 限制都可能让 HPA 改了 replicas 立刻被改回去。

kubectl get events -n <ns> | grep -i hpa
kubectl get deployment <name> -o yaml | grep -B2 -A2 replicas

GitOps 场景的标准做法：让 HPA 管 replicas，Argo CD 忽略 replicas 字段：

# Application
spec:
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /spec/replicas

几个调试 trick

手动改 replicas 看 HPA 反应：kubectl scale deploy/x --replicas=5，5 分钟后 HPA 应该把它拉回来
临时打开 verbose 日志：kubectl -n kube-system edit deployment metrics-server 加 --v=4
看是 metric 没采到还是算出来不扩：describe hpa 里 currentMetrics 字段，<unknown> vs 有值走不同分支

教训一句话：HPA 不扩 90% 是 metrics-server 或 requests 没设，剩下 10% 是 behavior 窗口或者算法容忍区间。

用 Claude Code 做运维自动化的几个心得

作者: root
时间: 2026-05-17
分类: AI & Agents
暂无评论

我用 Claude Code 做日常运维有半年了，从最初的"AI 写脚本"到现在的"Agent 半自动跑操作"，踩了不少坑。

心得 1：永远先 read 再 write

Claude Code 默认会 Read 文件然后 Edit。但如果你跳过这步直接让它写，它会编造内容。

✅ 好："看看 /etc/nginx/nginx.conf，把 worker_processes 改成 8"
❌ 差："把 nginx worker_processes 改成 8"（没 read，可能改错地方）

心得 2：用 sandbox 限制 Bash 权限

settings.json 里配命令白名单：

{
  "permissions": {
    "allow": [
      "Bash(kubectl get *)",
      "Bash(kubectl describe *)",
      "Bash(kubectl logs *)",
      "Bash(ceph status)",
      "Bash(ceph osd *)"
    ],
    "ask": [
      "Bash(kubectl apply *)",
      "Bash(kubectl delete *)"
    ],
    "deny": [
      "Bash(rm -rf /*)",
      "Bash(* | sudo *)"
    ]
  }
}

allow 自动通过，ask 弹窗确认，deny 直接拒绝。运维任务大量是 read-only，配好白名单后 Agent 跑得飞快。

心得 3：把 SOP 写成 Skill

每个团队都有一堆 SOP：发布流程、回滚流程、扩容流程……手动告诉 Agent 太累。写成 Skill：

~/.claude/skills/k8s-deploy/SKILL.md
---
name: k8s-deploy
description: K8s 应用发布 SOP，包括灰度、监控、回滚
---

## 发布前
1. 检查 PR 是否合并到 main
2. 看 CI 是否通过
3. 确认值班人在线

## 发布
1. 改 values.yaml 里的 image tag
2. helm diff 确认变更
3. helm upgrade --set replicas=1 先灰度
4. 观察 5 分钟
5. 灰度 OK 全量

## 回滚
helm rollback <release> <revision>

之后说"发布 user-api 到 prod"，Agent 自动按 SOP 走。

心得 4：长任务一定分步确认

我让 Agent 一次"清理 30 个 namespace 的过期资源"，它真的删了……结果删多了。

改进：

"清理 30 个 ns 的过期资源。先列出每个 ns 要删什么，等我确认 OK 再删。"

Agent 会先 dry-run，把删除清单给你看，确认后再执行。

心得 5：跨机器操作用 SSH

Claude Code 默认只能操作本地。跨机器需要 SSH（最好 key 登录）：

# 让 Agent 执行远程命令
ssh prod-node1 'docker ps'

# 复杂任务用 heredoc
ssh prod-node1 'bash -s' <<'EOF'
set -e
systemctl status nginx
nginx -t
systemctl reload nginx
EOF

心得 6：日志收集让 Agent 总结

排障最累的是翻日志。让 Agent 来：

"过去 24 小时 nginx access log 里 5xx 的请求，按 URL 分组统计 top 10"

Agent 自己组装 grep+awk 命令，几秒出结果。

心得 7：用 Plan 模式做大改动

任何超过 5 步的任务，先让 Agent 输出 plan：

"我要把整个集群从 1.27 升到 1.30。先给我升级 plan，不要执行。"

Agent 输出详细步骤后，你审核完再让它执行。

反面教训

❌ 让 Agent 直接改生产配置不留 git 痕迹：所有 IaC 改动走 PR
❌ 让 Agent 决定容量规划：它会给"经验值"，不是基于你环境的真实数据
❌ 让 Agent 写 incident report：可以辅助，但不能代写——细节会被编造

最大的体会

Agent 适合做"重复、低风险、有明确成功标准"的事。决策和判断还得靠人。

教训：把 Agent 当成"超快的初级运维"，给清晰的任务+边界，验证它的产出。

Claude Code Hooks 实战：让 Agent 自动遵守团队规范

作者: root
时间: 2026-05-16
分类: AI & Agents
暂无评论

Hooks 是 Claude Code 中容易被忽视但极强大的特性——它能拦截工具调用，让 Agent 在每次操作前后跑一段你定义的命令。本文给几个生产用法。

Hook 是什么

~/.claude/settings.json 里配置：

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "/usr/local/bin/check-bash.sh"
        }]
      }
    ]
  }
}

Agent 每次要跑 Bash 命令前，先调用 check-bash.sh。脚本返回非 0 就阻断。

用法 1：禁止某些命令

#!/bin/bash
# /usr/local/bin/check-bash.sh
INPUT=$(cat)
CMD=$(echo "$INPUT" | jq -r '.tool_input.command')

# 禁止 rm -rf /
if [[ "$CMD" =~ "rm -rf /" ]]; then
  echo "DANGEROUS COMMAND BLOCKED" >&2
  exit 2
fi

# 禁止 force push 到 main
if [[ "$CMD" =~ "git push.*--force.*main" ]]; then
  echo "Force push to main is forbidden" >&2
  exit 2
fi

exit 0

用法 2：每次文件修改后自动 lint

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [{
          "type": "command",
          "command": "bash -c 'PATH=$PATH:/usr/local/bin; FILE=$(cat | jq -r .tool_input.file_path); [[ $FILE == *.py ]] && ruff check $FILE'"
        }]
      }
    ]
  }
}

修改 Python 文件后自动跑 ruff，有问题 Agent 会看到并修复。

用法 3：操作生产前必须二次确认

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "/usr/local/bin/prod-check.sh"
        }]
      }
    ]
  }
}

#!/bin/bash
INPUT=$(cat)
CMD=$(echo "$INPUT" | jq -r '.tool_input.command')

# 涉及生产环境的命令
if echo "$CMD" | grep -qE "kubectl.*-n.*prod|ssh.*prod-"; then
  # 通过 macOS 通知 + 阻断，让人确认
  osascript -e "display notification "Claude wants to: $CMD" with title "Prod Action""
  echo "Prod operation needs manual approval. Add 'CONFIRMED:' prefix to bypass." >&2
  exit 2
fi

exit 0

用法 4：所有 git commit 加固定 footer

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "/usr/local/bin/inject-footer.sh"
        }]
      }
    ]
  }
}

#!/bin/bash
INPUT=$(cat)
CMD=$(echo "$INPUT" | jq -r '.tool_input.command')

if [[ "$CMD" =~ ^git\ commit ]]; then
  # 修改命令，注入 footer
  NEW_CMD=$(echo "$CMD" | sed 's/\(-m "[^"]*\)/\1\n\nReviewed-by: AI/')
  jq -n --arg cmd "$NEW_CMD" '{"hookSpecificOutput":{"hookEventName":"PreToolUse","permissionDecision":"allow","permissionDecisionReason":"ok","updatedInput":{"command":$cmd}}}'
fi

exit 0

Hook 类型一览

事件	触发时机
PreToolUse	工具调用前
PostToolUse	工具调用后
Notification	需要用户输入时
Stop	对话结束时
SubagentStop	子 Agent 结束时

调试 Hook

加 2>>/tmp/hook.log 把 stderr 留下来：

"command": "/usr/local/bin/check-bash.sh 2>>/tmp/hook.log"

然后 tail -f /tmp/hook.log 看实际触发情况。

教训：Hook 是给 Agent 加"硬约束"的唯一方式，比 CLAUDE.md 里写"不要做 X"可靠 10 倍——Agent 可能忘记规则，但 Hook 不会。

K8s Pod 一直 Pending 的全套排查 checklist

作者: root
时间: 2026-05-16
分类: Kubernetes
暂无评论

kubectl get pods 看到 Pending，新手最容易蒙圈。本文按出现频率排了 8 类原因，配定位命令，照着走一遍基本能搞定。

万能起手式

kubectl describe pod <pod> -n <ns> | tail -30
kubectl get events -n <ns> --sort-by=.lastTimestamp | tail -20

describe 底部的 Events 是最直接的线索，80% 的 Pending 看这里就够了。

1. 资源不足（FailedScheduling: Insufficient cpu/memory）

0/10 nodes are available: 10 Insufficient cpu

检查：

kubectl top nodes
kubectl describe node <node> | grep -A5 "Allocated resources"

注意 Allocated 是 requests 之和，不是实际用量。Pod requests 写得太大也会卡：

kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'

2. 没有合适 node（NodeAffinity / NodeSelector 不匹配）

0/10 nodes are available: 10 node(s) didn't match Pod's node affinity/selector

kubectl get pod <pod> -o yaml | grep -A20 "nodeSelector\|affinity"
kubectl get nodes --show-labels

经常是 nodeSelector 写了 node-role.kubernetes.io/worker: "true" 但节点上压根没这个 label。

3. 污点未容忍（Taints / Tolerations）

0/10 nodes are available: 10 node(s) had untolerated taint

kubectl describe node <node> | grep Taint

修复有两种：要么给 Pod 加 toleration，要么去掉 taint。生产建议加 toleration，taint 通常是有意打的。

4. PVC 还没 bound

pod has unbound immediate PersistentVolumeClaims

kubectl get pvc -n <ns>
kubectl describe pvc <pvc> -n <ns>

可能问题：

StorageClass 不存在或者拼错
provisioner 挂了（如 ebs-csi-controller）
PV 静态绑定时 selector 不匹配

5. 镜像拉不下来

Pod 状态可能是 ImagePullBackOff 或 ErrImagePull，但有时候表现为 ContainerCreating + Events 里报错：

Failed to pull image "xxx": rpc error: code = Unknown

私有镜像没配 imagePullSecrets
镜像 tag 写错（:latest vs :v1.2.3）
国内拉 docker.io 慢/不通 → 用 mirror

6. ResourceQuota / LimitRange 拦住了

exceeded quota: compute-resources

kubectl describe quota -n <ns>
kubectl describe limitrange -n <ns>

经常是 namespace 加了 LimitRange 强制要求 requests/limits，而 Pod 没写。

7. PodSecurityAdmission 拦截（1.25+）

violates PodSecurity "restricted:v1.28"

kubectl get ns <ns> -o jsonpath='{.metadata.labels}'

namespace 加了 pod-security.kubernetes.io/enforce=restricted，Pod 里有 privileged: true 或者 hostNetwork: true 都会被拦。

8. CNI 还没就绪 / Webhook 阻塞

罕见但很坑：

节点 Ready 了但 CNI 插件还没 ready，Pod 调度上去后卡在 ContainerCreating
ValidatingWebhook 挂了，Pod 创建 API 都过不了，根本不会 schedule

kubectl get pods -A | grep -v Running | grep -v Completed
kubectl get validatingwebhookconfigurations

Debug 神器

如果 events 都没有，开个 ephemeral container 看节点视角：

kubectl debug node/<node> -it --image=busybox

或者直接看 kubelet 日志：

ssh <node> 'journalctl -u kubelet -f | grep <pod-name>'

教训一句话：Pending 看 describe 的 Events 永远是第一步，不要直接翻 controller-manager 日志。