常见问题排查
常见问题排查
本节汇总 Docker IoT 环境中最常见的故障及其解决方案。学习完成后,您将能够:
- 快速诊断和解决容器运行问题
- 排查 MQTT 通信故障
- 处理和恢复数据持久化问题
- 向客户提供常见问题的合理解释
Quick Diagnostic Script
Section titled “Quick Diagnostic Script”创建快速诊断脚本 iot-diagnose.sh:
#!/bin/bash
# IoT Stack 快速诊断脚本echo "=== IoT Stack Diagnostic Report ==="echo "Generated: $(date)"echo ""
# 1. 系统信息echo "--- System Info ---"echo "OS: $(uname -a)"echo "Disk:"df -h / | tail -1echo "Memory:"free -h | grep Memecho ""
# 2. Docker 状态echo "--- Docker Status ---"echo "Docker: $(docker --version 2>/dev/null || echo 'NOT INSTALLED')"echo "Compose: $(docker-compose --version 2>/dev/null || echo 'NOT INSTALLED')"echo ""
# 3. 容器状态echo "--- Container Status ---"docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" 2>/dev/nullecho ""
# 4. 资源使用echo "--- Resource Usage ---"docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" 2>/dev/nullecho ""
# 5. 端口占用echo "--- Port Usage ---"for port in 1883 8883 1880 8086 3000 9000; do if lsof -i :${port} > /dev/null 2>&1; then echo "Port ${port}: IN USE" else echo "Port ${port}: AVAILABLE" fidoneecho ""
# 6. 磁盘空间echo "--- Docker Disk Usage ---"docker system df 2>/dev/nullecho ""
echo "=== Diagnostic Complete ==="Category 1: Docker Installation Issues
Section titled “Category 1: Docker Installation Issues”Issue 1.1: Docker 命令需要 sudo
Section titled “Issue 1.1: Docker 命令需要 sudo”症状:
docker: permission denied while trying to connectGot permission denied while trying to connect to the Docker daemon socket原因: 当前用户不在 docker 用户组
解决方案:
# 1. 将用户添加到 docker 组sudo usermod -aG docker $USER
# 2. 重新登录(或刷新组)newgrp docker
# 3. 验证docker psIssue 1.2: Docker 服务未启动
Section titled “Issue 1.2: Docker 服务未启动”症状:
Cannot connect to the Docker daemon at unix:///var/run/docker.sockIs the docker daemon running?解决方案:
# 1. 启动 Docker 服务sudo systemctl start docker
# 2. 设置开机自启sudo systemctl enable docker
# 3. 检查状态sudo systemctl status dockerIssue 1.3: 磁盘空间不足
Section titled “Issue 1.3: 磁盘空间不足”症状:
No space left on devicewrite /var/lib/docker: no space left on device解决方案:
# 1. 查看磁盘使用docker system df
# 2. 清理未使用的资源docker system prune -a --volumes
# 3. 清理特定资源docker image prune -a # 清理未使用的镜像docker container prune # 清理已停止的容器docker volume prune # 清理未使用的卷docker builder prune # 清理构建缓存
# 4. 查看各目录占用du -sh /var/lib/docker/*Category 2: Container Issues
Section titled “Category 2: Container Issues”Issue 2.1: 容器不断重启
Section titled “Issue 2.1: 容器不断重启”症状:
nodered Restarting (1) 10 seconds ago诊断步骤:
# 1. 查看容器日志docker logs nodered
# 2. 查看最近重启原因docker inspect nodered --format '{{.State.ExitCode}}'docker inspect nodered --format '{{.State.Error}}'
# 3. 检查资源限制docker stats nodered --no-stream常见原因及解决方案:
| 原因 | 日志特征 | 解决方案 |
|---|---|---|
| 端口冲突 | Error: listen EADDRINUSE | 修改端口映射 |
| 权限问题 | Permission denied | 检查数据卷权限 |
| 配置错误 | Error loading config | 检查配置文件 |
| 内存不足 | Killed / OOM 错误 | 增加内存限制 |
Issue 2.2: 端口冲突
Section titled “Issue 2.2: 端口冲突”症状:
Error: cannot assign requested address: listen tcp4 :1880: bind: address already in use解决方案:
# 1. 查找占用端口的进程lsof -i :1880# 或netstat -tlnp | grep 1880
# 2. 停止占用进程kill -9 <PID>
# 3. 或在 docker-compose.yml 中修改端口services: nodered: ports: - "1881:1880" # 修改宿主机端口Issue 2.3: 容器无法启动
Section titled “Issue 2.3: 容器无法启动”症状: 容器启动后立即退出
诊断:
# 1. 查看退出状态docker ps -a --filter "name=nodered"
# 2. 查看详细日志docker logs --tail 50 nodered
# 3. 检查配置docker run --rm -it nodered/node-red:latest shCategory 3: MQTT Issues
Section titled “Category 3: MQTT Issues”Issue 3.1: MQTT 连接被拒绝
Section titled “Issue 3.1: MQTT 连接被拒绝”症状:
Connection refused: not authorised解决方案:
# 1. 检查是否启用密码认证docker exec mosquitto cat /mosquitto/config/mosquitto.conf | grep allow_anonymous
# 2. 检查密码文件docker exec mosquitto cat /mosquitto/config/password.txt
# 3. 如果是密码问题,重新生成密码docker exec mosquitto mosquitto_passwd -b /mosquitto/config/password.txt iot_user NewPasswordIssue 3.2: MQTT 无法连接
Section titled “Issue 3.2: MQTT 无法连接”症状:
Error: No connection could be made because the target machine actively refused it故障排查:
# 1. 检查 Mosquitto 是否运行docker ps | grep mosquitto
# 2. 检查端口监听netstat -tlnp | grep 1883
# 3. 测试端口连通性(从外部)nc -zv <server-ip> 1883
# 4. 检查防火墙规则sudo ufw statussudo iptables -L -n | grep 1883
# 5. 查看 Mosquitto 日志docker logs mosquitto --tail 50Issue 3.3: MQTT TLS 连接错误
Section titled “Issue 3.3: MQTT TLS 连接错误”症状:
Error: A TLS error occurred.解决方案:
# 1. 检查证书路径和权限docker exec mosquitto ls -la /mosquitto/certs/
# 2. 验证证书openssl x509 -in /mosquitto/certs/server.crt -text -noout
# 3. 测试 TLS 连接openssl s_client -connect localhost:8883 -CAfile ca.crt
# 4. 检查 Mosquitto TLS 配置docker exec mosquitto grep -A 10 "listener 8883" /mosquitto/config/mosquitto.confCategory 4: Node-RED Issues
Section titled “Category 4: Node-RED Issues”Issue 4.1: Node-RED 登录无法访问
Section titled “Issue 4.1: Node-RED 登录无法访问”症状: 浏览器访问 1880 端口无响应
解决方案:
# 1. 检查容器状态docker ps | grep nodered
# 2. 查看日志docker logs nodered --tail 20
# 3. 检查 Node-RED 内部端口docker exec nodered cat /data/settings.js | grep uiPort
# 4. 重启 Node-REDdocker restart noderedIssue 4.2: Flow 数据丢失
Section titled “Issue 4.2: Flow 数据丢失”症状: 容器重启后 Flow 配置消失
原因: 未配置数据卷持久化
解决方案:
# 1. 创建数据目录mkdir -p /opt/iot/nodered/data
# 2. 修改 docker-compose.yml 添加卷volumes: - /opt/iot/nodered/data:/data
# 3. 重启容器docker-compose up -dIssue 4.3: 缺少节点模块
Section titled “Issue 4.3: 缺少节点模块”症状:
Error: Cannot find module 'node-red-contrib-influxdb'解决方案:
# 1. 进入容器docker exec -it nodered sh
# 2. 安装缺失的节点npm install node-red-contrib-influxdb
# 3. 重启服务docker restart nodered
# 4. 或者持久化安装(推荐)# 创建 package.jsoncat > nodered/data/package.json << 'EOF'{ "name": "nodered-custom", "version": "1.0.0", "dependencies": { "node-red-contrib-influxdb": "latest", "node-red-dashboard": "latest" }}EOFCategory 5: InfluxDB Issues
Section titled “Category 5: InfluxDB Issues”Issue 5.1: InfluxDB 无法写入数据
Section titled “Issue 5.1: InfluxDB 无法写入数据”症状: Node-RED 写入数据时出错
解决方案:
# 1. 检查 InfluxDB 状态docker ps | grep influxdb
# 2. 检查健康状态curl http://localhost:8086/health
# 3. 检查 Tokendocker exec influxdb influx auth list
# 4. 重新创建 Token(如果需要)docker exec influxdb influx auth create \ --org iot-demo \ --all-accessIssue 5.2: InfluxDB 磁盘空间不足
Section titled “Issue 5.2: InfluxDB 磁盘空间不足”症状:
error writing data: engine: disk space is exhausted解决方案:
# 1. 查看数据大小du -sh /var/lib/docker/volumes/influxdb_data/*
# 2. 设置数据保留策略docker exec influxdb influx bucket update \ --name nodered \ --retention 30d
# 3. 手动压缩docker exec influxdb influxd inspect verify-wal \ --engine-path /var/lib/influxdb2/engine/Category 6: Network Issues
Section titled “Category 6: Network Issues”Issue 6.1: 容器间无法通信
Section titled “Issue 6.1: 容器间无法通信”症状: Node-RED 无法连接 Mosquitto
解决方案:
# 1. 检查容器是否在同一网络docker inspect nodered --format '{{json .NetworkSettings.Networks}}'docker inspect mosquitto --format '{{json .NetworkSettings.Networks}}'
# 2. 测试容器间通信docker exec nodered ping mosquitto
# 3. 确保服务名正确docker exec nodered cat /data/settings.js | grep broker
# 4. 手动连接测试(容器内)docker exec -it nodered shmosquitto_sub -h mosquitto -t "test" -vIssue 6.2: 外部无法访问服务
Section titled “Issue 6.2: 外部无法访问服务”症状: 从外部浏览器无法访问 Node-RED
解决方案:
# 1. 检查端口映射docker port nodered 1880
# 2. 检查防火墙sudo ufw statussudo ufw allow 1880/tcp
# 3. 检查云服务商安全组# AWS: EC2 → Security Groups → Inbound Rules# 阿里云: 安全组 → 入方向 → 添加规则
# 4. 检查绑定地址# docker-compose.yml 中的 ports 应使用 "1880:1880"# 而非 "127.0.0.1:1880:1880"(仅本地访问)Quick Reference Card
Section titled “Quick Reference Card”常用排错命令速查表
Section titled “常用排错命令速查表”| 命令 | 用途 | 示例 |
|---|---|---|
docker ps -a | 查看所有容器状态 | docker ps -a | grep nodered |
docker logs <container> | 查看容器日志 | docker logs --tail 50 mosquitto |
docker inspect <container> | 查看容器详情 | docker inspect nodered | grep "IPAddress" |
docker stats | 查看资源使用 | docker stats --no-stream |
docker exec -it <container> sh | 进入容器 | docker exec -it influxdb sh |
docker system df | 查看磁盘使用 | docker system df |
docker system prune | 清理未使用资源 | docker system prune -a |
lsof -i :<port> | 查看端口占用 | lsof -i :1883 |
netstat -tlnp | 查看监听端口 | netstat -tlnp | grep 1880 |
Summary
Section titled “Summary”- 半数以上的问题可以通过查看日志定位
- 端口冲突和权限问题是最常见的容器问题
- MQTT 连接问题多数是认证或网络配置导致
- 数据丢失通常是因为未配置卷持久化
- 诊断脚本可以帮助快速定位常见问题