实时数据监控
实时数据监控
本节介绍如何利用 Grafana 实现实时数据监控,包括告警规则配置、通知渠道设置和多维度数据监控。学习完成后,您将能够:
- 配置 Grafana 告警规则
- 设置多渠道告警通知(邮件、Telegram)
- 实现多维度数据监控
- 创建运维监控仪表板
在开始本节之前,请确保:
- Grafana 仪表板已创建
- InfluxDB 中持续有数据写入
- 了解基本告警概念
Real-Time Monitoring Architecture
Section titled “Real-Time Monitoring Architecture”Data Update Flow
Section titled “Data Update Flow”ESP32 传感器 (每 3 秒) ↓ MQTTNode-RED (实时处理) ↓ InfluxDB 写入InfluxDB (时序存储) ↓ Flux 查询Grafana (5 秒轮询) ↓ 告警规则检查Notification (邮件/Telegram)Latency Budget
Section titled “Latency Budget”| 环节 | 延迟 | 说明 |
|---|---|---|
| 传感器采集 | < 100ms | ESP32 读取传感器 |
| MQTT 传输 | < 50ms | 局域网内 |
| Node-RED 处理 | < 50ms | JSON 解析 + 转换 |
| InfluxDB 写入 | < 50ms | 时序数据库写入 |
| Grafana 轮询 | 5s | 自动刷新间隔 |
| 端到端延迟 | ~5.25s | 从采集到展示 |
售前提示:在演示时解释,数据有约 5 秒延迟。如果需要更低的延迟,可以将 Grafana 刷新间隔设为 1 秒,但会增加服务器负载。
Grafana Alerting Setup
Section titled “Grafana Alerting Setup”Step 1: Enable Unified Alerting
Section titled “Step 1: Enable Unified Alerting”1. 左侧菜单 → Alerting → Alert rules2. 如首次使用,需要启用统一告警3. 点击 "Go to configuration" 配置通知渠道Step 2: Configure Contact Points
Section titled “Step 2: Configure Contact Points”邮件通知:
1. Alerting → Contact points2. 点击 "Add contact point"3. 配置: - Name: Email Alert - Integration: Email - Addresses: ops@factory.com4. 点击 "Test" 验证Telegram 通知(可选):
1. 创建 Telegram Bot: - 在 Telegram 搜索 @BotFather - 发送 /newbot - 获取 Bot Token2. Grafana 配置: - Integration: Telegram - Bot Token: YOUR_BOT_TOKEN - Chat ID: YOUR_CHAT_IDStep 3: Create Alert Rules
Section titled “Step 3: Create Alert Rules”温度过高告警规则:
1. Alerting → Alert rules → "New alert rule"2. 配置查询: Name: High Temperature Alert Folder: Factory Monitoring
Query: from(bucket: "nodered") |> range(start: -2m) |> filter(fn: (r) => r._measurement == "environment") |> filter(fn: (r) => r._field == "temperature") |> aggregateWindow(every: 30s, fn: mean) |> last()
Condition: last() > 30°C Evaluation: Every 1m, For: 1m
3. 配置通知: - Contact Point: Email Alert - Message: "⚠️ 温度告警: 当前温度 {{ $values.A.Value }}°C,超过 30°C 阈值"
4. 点击 "Save rule"湿度异常告警规则:
Query:from(bucket: "nodered") |> range(start: -2m) |> filter(fn: (r) => r._measurement == "environment") |> filter(fn: (r) => r._field == "humidity") |> aggregateWindow(every: 30s, fn: mean) |> last()
Condition: last() < 20% OR last() > 90%Evaluation: Every 5m, For: 5mMessage: "⚠️ 湿度异常: 当前湿度 {{ $values.A.Value }}%,正常范围 20%-90%"Multi-Dimensional Monitoring
Section titled “Multi-Dimensional Monitoring”Device Status Monitoring
Section titled “Device Status Monitoring”创建一个运维监控面板,显示所有设备状态:
面板 1: 设备在线状态 (Stat)查询:from(bucket: "nodered") |> range(start: -1m) |> filter(fn: (r) => r._measurement == "environment") |> filter(fn: (r) => r._field == "temperature") |> last() |> group(columns: ["device_id"]) |> count()
如果最近 1 分钟没有数据 → 设备离线Network Quality Monitoring
Section titled “Network Quality Monitoring”面板 2: WiFi 信号强度 (Time Series)查询:from(bucket: "nodered") |> range(start: now() - 1h) |> filter(fn: (r) => r._measurement == "environment") |> filter(fn: (r) => r._field == "signal_rssi") |> aggregateWindow(every: 1m, fn: mean)Data Update Health
Section titled “Data Update Health”面板 3: 数据更新频率 (Stat)查询 - 最近一次数据更新的时间差:import "influxdata/influxdb/monitor"from(bucket: "nodered") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "environment") |> filter(fn: (r) => r._field == "temperature") |> last() |> map(fn: (r) => ({ r with time_since: now() - r._time }))Advanced Dashboard Features
Section titled “Advanced Dashboard Features”Dashboard Annotations
Section titled “Dashboard Annotations”添加事件注解标记重要事件:
1. 仪表板 Settings → Annotations2. 添加 Annotation: - Name: Manual Events - Enabled: ✓ - Type: Dashboard3. 使用时点击仪表板上的图标添加注解Template Variables
Section titled “Template Variables”添加变量实现多维度切换:
变量 1: device (设备选择)Type: QueryQuery: schema.tagValues(bucket: "nodered", tag: "device_id")
变量 2: time_range (时间范围)Type: CustomValues: 15m, 30m, 1h, 6h, 24h, 7d
变量 3: metric (指标选择)Type: QueryQuery:import "influxdata/influxdb/schema"schema.measurementTagValues( bucket: "nodered", measurement: "environment", tag: "_field")Dashboard Permissions
Section titled “Dashboard Permissions”User Management
Section titled “User Management”创建只读用户供买家访问:1. Configuration → Users → Invite2. 配置: - Email: buyer@factory.com - Role: Viewer (只读)3. 设置仪表板权限: - Dashboard Settings → Permissions - Add Permission: Viewers → ViewSnapshot Sharing
Section titled “Snapshot Sharing”生成仪表板快照与买家分享:1. 点击仪表板 "Share" 图标2. 选择 "Snapshot" 标签3. 设置: - Time range: 1h - Include current time: ✓4. 点击 "Publish to snapshots.raintanks.io"5. 复制分享链接告警功能验证
Section titled “告警功能验证”1. 手动提高温度(用手捂住传感器)2. 观察 Grafana 告警状态: - Dashboard → Alert 图标应变红 - Alerting 页面应显示 triggered3. 检查邮箱是否收到告警邮件4. 温度恢复正常后,告警自动恢复监控验证检查清单
Section titled “监控验证检查清单”- 温度告警在超过 30°C 时触发
- 告警邮件在 2 分钟内送达
- 设备离线超过 1 分钟标记为离线
- Dashboard 自动刷新正常
- 变量切换设备后数据正确更新
- 快照链接可以正常访问
Issue 1: 告警未触发
Section titled “Issue 1: 告警未触发”症状:数据超过阈值但未收到通知
可能原因:
- 评估频率设置过大
- For 持续时间未满足
- 通知渠道配置错误
解决方案:
- 缩短评估间隔(如 30s)
- 减少 For 持续时间(如 0m 立即触发)
- 测试通知渠道(Contact points → Test)
Issue 2: 仪表板加载缓慢
Section titled “Issue 2: 仪表板加载缓慢”症状:切换时间范围后仪表板卡顿
可能原因:
- 查询时间范围过大
- 数据点过多
- 面板数量过多
解决方案:
- 使用聚合查询减少数据点
- 设置合理的最大数据点限制
- 将仪表板拆分为多个
- ✅ 推荐: 为不同类型的异常设置不同优先级的告警(P1/P2/P3)
- ✅ 推荐: 告警信息包含具体数值、位置和建议行动
- ✅ 推荐: 设置告警静默期,避免夜间误报打扰
- ❌ 避免: 过于敏感的告警阈值导致告警疲劳
- ❌ 避免: 所有指标都设置告警(只关注关键指标)
- ❌ 避免: 将告警信息包含在公共仪表板中
Summary
Section titled “Summary”本节要点总结:
- 实时监控:Grafana 每 5 秒刷新,端到端延迟约 5 秒
- 告警配置:基于数值阈值的告警规则,多渠道通知
- 多维度监控:设备状态、信号质量、数据健康度全面监控
- 仪表板增强:注解、变量、快照、权限管理
- 运维实践:告警分级、通知渠道、静默期管理