OpenStack 监控选型:Ceilometer 还是 Prometheus
OpenStack 自带监控生态有 Ceilometer/Gnocchi/Aodh,新生代有 Prometheus exporter。怎么选?
老栈:Ceilometer 全家桶
- Ceilometer:采集
- Gnocchi:时序存储
- Aodh:告警
- Panko(已弃):事件存储
特点:
- 深度集成 OpenStack,支持计费用例
- 配置复杂,运维难
- Gnocchi 性能瓶颈明显(大集群跑不动)
新栈:Prometheus
- openstack-exporter:抓 OpenStack API
- libvirt-exporter:抓 hypervisor
- node-exporter:抓物理机
- Alertmanager:告警
- Grafana:可视化
特点:
- 性能强、生态广
- 不支持计费场景(只看实时状态)
- 配置简单
我的选型
| 场景 | 选 |
|---|---|
| 公有云、要计费 | Ceilometer + Gnocchi |
| 私有云、只要监控告警 | Prometheus |
| 大规模(>500 节点) | Prometheus |
| 小规模 + 简单需求 | Prometheus |
Prometheus 部署关键
scrape_configs:
- job_name: 'openstack'
static_configs:
- targets: ['exporter-host:9180']
scrape_interval: 60s
scrape_timeout: 30s
- job_name: 'libvirt'
static_configs:
- targets:
- 'compute1:9177'
- 'compute2:9177'
scrape_interval: 30s
openstack-exporter 默认抓很多指标,scrape_interval 别小于 60s,否则把 Keystone 打挂。
必装的几个告警
groups:
- name: openstack
rules:
- alert: NovaComputeDown
expr: openstack_nova_agent_state{service="nova-compute"} == 0
for: 5m
- alert: NeutronAgentDown
expr: openstack_neutron_agent_state == 0
for: 5m
- alert: HypervisorMemoryHigh
expr: openstack_nova_used_memory_bytes / openstack_nova_memory_bytes > 0.85
for: 10m
- alert: VolumeStuck
expr: openstack_cinder_volumes{status=~"creating|deleting|error"} > 0
for: 30m
Grafana dashboard
Grafana 官方 dashboard 库搜 "OpenStack",推荐 ID 9701(openstack-exporter 配套)。
教训:Ceilometer 链路长(采集→存→查),任何一环挂都看不到数据;Prometheus 简单粗暴 pull,定位故障容易 10 倍。