实验室服务器又出问题了,为了能够之后方便监控资源和定位服务器问题,这次在重装系统后设计了这套方案

  • 操作系统:Ubuntu 24
  • Web 查看端口:20000
  • 目标:易部署、易运维、出问题能追到具体用户
  • 范围:CPU / 内存 / 磁盘 / 网络 / GPU 资源监控 + 高危操作审计
    其中 Prometheus 负责存储时间序列指标数据、Audit 记录 Linux 审计事件、Atop 记录系统和进程级活动历史;node_exporter 采集主机 CPU、内存、磁盘、网络等指标、dcgm-exporter 采集 NVIDIA GPU 指标,提供 /metrics;Grafana 作为唯一 Web 门户,端口 20000

各组件介绍

本系统奉行单入口、读写分离、架构轻量的原则,在不引入重型调度器的情况下,重点满足“看状态”与“能追责”。各核心组件功能边界如下:

1. 展示与视图层

  • Grafana:作为唯一的 Web 门户(监听 20000 端口),对外提供统一的可视化看板。通过权限分级,让普通用户查看整机资源大盘(无敏感信息),管理员查看详细的用户及进程排障数据,实现前端与底层数据的隔离。

2. 资源监控层(指标采集与存储)

主要用于采集和存储时间序列状态数据(建议保留 15~30 天):

  • Prometheus:核心时间序列数据库(TSDB),负责从各 exporter 抓取 metrics 数据并长期落盘存储,作为 Grafana 的查询数据源。
  • node_exporter:负责采集服务器基础硬件层指标,包括 CPU 使用率/负载、内存剩余空间、磁盘 I/O 与系统网络吞吐等。
  • dcgm-exporter:负责采集 NVIDIA GPU 的核心监控指标,涵盖各 GPU 的算力利用率、显存占用、功耗及温度等状态。

3. 用户追踪与行为审计层(日志与溯源)

主要负责高危行为留痕和历史异常回放,解决“出问题能找到对应责任人”的需求:

  • auditd:Linux 核心级别的系统审计守护进程。负责记录提权(如 sudo)、敏感命令执行、配置文件(如 SSH、用户组)被篡改等高危操作。满足出故障时对误操作及违规操作的审查记录与追责功能。
  • atop:强大的系统与进程级历史状态记录工具。周期性保存系统进程快照日志。当出现过往时段的资源打满死机或算力异常时,管理员可通过其历史日志精准定位耗尽资源的具体 PID、运行命令及对应用户

部署方式

注意!:该服务器有 500g 系统固态盘 + 3 块 8T 机械盘,分别挂载为/home /data1 /data2,以下部署方式中出现的 /data1 /data2根据实际情况调整

创建日志目录

sudo mkdir -p /data2/prometheus
sudo mkdir -p /data2/audit
sudo mkdir -p /data2/atop
sudo mkdir -p /data2/reports
sudo chown -R prometheus:prometheus /data2/prometheus 2>/dev/null || true
sudo chmod 750 /data2/prometheus

安装基础包

sudo apt install -y prometheus prometheus-node-exporter auditd atop sysstat curl wget gnupg2 ca-certificates apt-transport-https

确认服务

systemctl status prometheus --no-pager
systemctl status prometheus-node-exporter --no-pager
systemctl status auditd --no-pager
systemctl status atop --no-pager

添加 Grafana 官方源

sudo mkdir -p /etc/apt/keyrings
sudo wget -O /etc/apt/keyrings/grafana.asc https://apt.grafana.com/gpg-full.key
sudo chmod 644 /etc/apt/keyrings/grafana.asc
echo "deb [signed-by=/etc/apt/keyrings/grafana.asc] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update

安装 Grafana

sudo apt install -y grafana

修改 Grafana 监听端口

sudo cp /etc/grafana/grafana.ini /etc/grafana/grafana.ini.bak
sudo sed -i 's/^;http_port = 3000/http_port = 20000/' /etc/grafana/grafana.ini
grep -n "^http_port" /etc/grafana/grafana.ini

开放 端口

sudo ufw allow 20000/tcp

启动并设置开机自启动

sudo systemctl daemon-reload
sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server --no-pager

备份 Prometheus 配置并写入新配置

sudo cp /etc/prometheus/prometheus.yml /etc/prometheus/prometheus.yml.bak

sudo tee /etc/prometheus/prometheus.yml > /dev/null <<'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['127.0.0.1:9100']

  - job_name: 'dcgm'
    static_configs:
      - targets: ['127.0.0.1:9400']
EOF

修改 Prometheus systemd 启动参数

sudo mkdir -p /etc/systemd/system/prometheus.service.d
sudo tee /etc/systemd/system/prometheus.service.d/override.conf > /dev/null <<'EOF'
[Service]
ExecStart=
ExecStart=/usr/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/data2/prometheus \
  --storage.tsdb.retention.time=15d \
  --web.listen-address=127.0.0.1:9090
EOF

重载并启动

sudo systemctl daemon-reload
sudo systemctl restart prometheus
sudo systemctl enable prometheus
sudo systemctl status prometheus --no-pager

权限部分修复

sudo mkdir -p /data2/prometheus
sudo chown -R prometheus:prometheus /data2/prometheus
sudo chmod 750 /data2/prometheus
sudo chmod 755 /data2
namei -l /data2/prometheus
sudo mkdir -p /etc/apparmor.d/local
sudo tee /etc/apparmor.d/local/usr.bin.prometheus > /dev/null <<'EOF'
/data2/prometheus/ rw,
/data2/prometheus/** rwk,
EOF

sudo apparmor_parser -r /etc/apparmor.d/usr.bin.prometheus
sudo apparmor_parser -r /etc/apparmor.d/usr.bin.prometheus
sudo systemctl restart prometheus
sudo systemctl status prometheus --no-pager -l
curl -s http://127.0.0.1:9090/-/healthy

配置 node_exporter 只监听本机

sudo mkdir -p /etc/systemd/system/prometheus-node-exporter.service.d
sudo tee /etc/systemd/system/prometheus-node-exporter.service.d/override.conf > /dev/null <<'EOF'
[Service]
ExecStart=
ExecStart=/usr/bin/prometheus-node-exporter --web.listen-address=127.0.0.1:9100
EOF
sudo systemctl daemon-reload
sudo systemctl restart prometheus-node-exporter
sudo systemctl enable prometheus-node-exporter
sudo systemctl status prometheus-node-exporter --no-pager

配置 docker

sudo apt install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL http://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] http://mirrors.aliyun.com/docker-ce/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo vim /etc/docker/daemon.json

改为:
{
"registry-mirrors": [
"https://docker.1ms.run"
]
}

sudo systemctl restart docker

安装 NVIDIA Container Toolkit

sudo apt-get update && sudo apt-get install -y --no-install-recommends ca-certificates curl gnupg2
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

部署 dcgm-exporter,并限制为本机监听

sudo tee /etc/systemd/system/dcgm-exporter.service > /dev/null <<'EOF'
[Unit]
Description=DCGM Exporter for Prometheus
Requires=docker.service
After=docker.service

[Service]
Type=simple
Environment=DCGM_EXPORTER_VERSION=2.1.4-2.3.1
ExecStartPre=-/usr/bin/docker rm -f dcgm-exporter
ExecStart=/usr/bin/docker run --rm --name dcgm-exporter \
  --gpus all \
  --net host \
  --cap-add SYS_ADMIN \
  nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \
  -a 127.0.0.1:9400 \
  -f /etc/dcgm-exporter/dcp-metrics-included.csv
ExecStop=/usr/bin/docker stop dcgm-exporter
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now dcgm-exporter
sudo systemctl status dcgm-exporter --no-pager
ss -lntp | grep 9400
curl -s http://127.0.0.1:9400/metrics | head

配置 auditd,把日志放到 /data2/audit

sudo mkdir -p /data2/audit
sudo chown root:root /data2/audit
sudo chmod 700 /data2/audit

sudo cp /etc/audit/auditd.conf /etc/audit/auditd.conf.bak

sudo sed -i 's#^log_file = .*#log_file = /data2/audit/audit.log#' /etc/audit/auditd.conf
sudo sed -i 's/^max_log_file = .*/max_log_file = 100/' /etc/audit/auditd.conf
sudo sed -i 's/^num_logs = .*/num_logs = 20/' /etc/audit/auditd.conf
sudo sed -i 's/^max_log_file_action = .*/max_log_file_action = ROTATE/' /etc/audit/auditd.conf

sudo tee /etc/audit/rules.d/server-monitor.rules > /dev/null <<'EOF'
## 用户与身份文件
-w /etc/passwd -p wa -k identity
-w /etc/shadow -p wa -k identity
-w /etc/group -p wa -k identity
-w /etc/gshadow -p wa -k identity
-w /etc/sudoers -p wa -k scope
-w /etc/sudoers.d/ -p wa -k scope

## SSH、防火墙、systemd
-w /etc/ssh/sshd_config -p wa -k sshd
-w /etc/ufw/ -p wa -k firewall
-w /etc/systemd/system/ -p wa -k systemd
-w /lib/systemd/system/ -p wa -k systemd

## 关键命令
-a always,exit -F arch=b64 -S execve -F euid=0 -k privileged-cmd
-a always,exit -F arch=b64 -S chmod,fchmod,fchmodat,chown,fchown,fchownat,lchown -k perm_mod
-a always,exit -F arch=b64 -S unlink,unlinkat,rename,renameat,rmdir -k delete_ops
EOF

sudo augenrules --load
sudo systemctl restart auditd
sudo systemctl enable auditd
sudo systemctl status auditd --no-pager

sudo true
sudo ausearch -k privileged-cmd | tail
sudo aureport -x --summary | head

配置 atop,并把历史日志落到 /data2/atop

sudo mkdir -p /data2/atop
sudo chown root:root /data2/atop
sudo chmod 755 /data2/atop
sudo cp /etc/default/atop /etc/default/atop.bak
cat /etc/default/atop
sudo systemctl stop atop
sudo systemctl stop atopacct
sudo mv /var/log/atop /var/log/atop.bak
sudo ln -s /data2/atop /var/log/atop
sudo systemctl start atopacct
sudo systemctl start atop
sudo systemctl status atop --no-pager
sudo systemctl status atopacct --no-pager
ls -ld /var/log/atop
ls -lah /data2/atop

把 Grafana 接上 Prometheus

Grafana 安装好后,浏览器访问:

http://服务器IP:20000

默认账号通常是:
admin
admin

然后添加数据源:
类型:Prometheus
URL:http://127.0.0.1:9090
Save & Test

Grafana 配置

创建两个 DashBoard,一份面向普通用户,一份面向管理员
Public:

  • Panel 1: CPU Usage
    • PromQL: 100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))
    • Visualizations: Stat
    • Unit: percent (0-100)

如图:PromQL 配置时需要选用 Code 模式,填写完成后点击 Run queries

compressed_image.png

  • Panel 2: Memory Usage

    • PromQL: 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
    • Visualizations: Stat
    • Unit: percent (0-100)
  • Panel 3: Root Filesystem Usage

    • PromQL: 100 * (1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{mountpoint="/",fstype!~"tmpfs|overlay"}))
    • Visualizations: Stat
    • Unit: percent (0-100)
  • Panel 4: /data1 Usage

    • PromQL: 100 * (1 - (node_filesystem_avail_bytes{mountpoint="/data1",fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{mountpoint="/data1",fstype!~"tmpfs|overlay"}))
    • Visualizations: Stat
    • Unit: percent (0-100)
  • Panel 5: /data2 Usage

    • PromQL: 100 * (1 - (node_filesystem_avail_bytes{mountpoint="/data2",fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{mountpoint="/data2",fstype!~"tmpfs|overlay"}))
    • Visualizations: Stat
    • Unit: percent (0-100)
  • Panel 6: Network Traffic

    • PromQL1: sum(rate(node_network_receive_bytes_total{device!~"lo"}[5m]))
    • PromQL2: sum(rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))
    • Visualizations: Time series
    • Unit: bytes/sec
  • Panel 7: GPU Utilization

    • PromQL: 100 * clamp_min(clamp_max((DCGM_FI_DEV_POWER_USAGE - 25) / (349 - 25), 1), 0)
    • Visualizations: Time series
    • Unit: percent (0-100)
      由于无法直接获取显卡使用时负载百分比,所以通过粗略计算功耗的方式来实现,该服务器为 3090 350W,上机后 4 张卡最低功耗维持在 9-25W,因此使用这种形式来表达
  • Panel 8: GPU Memory Usage

    • PromQL: DCGM_FI_DEV_FB_USED
    • Visualizations: Time series
    • Unit: mebibytes (MiB)

Admin:

  • Panel 1: CPU Usage Over Time

    • PromQL: 100 * (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))
    • Visualizations: Time series
    • Unit: percent (0-100)
  • Panel 2: System Load

    • PromQL 1: node_load1
    • PromQL 2: node_load5
    • PromQL 3: node_load15
    • Visualizations: Time series
    • Unit: percent (0-100)
  • Panel 3: Disk Read/Write

    • PromQL Read: sum(rate(node_disk_read_bytes_total[5m]))
    • PromQL Write: sum(rate(node_disk_written_bytes_total[5m]))
    • Visualizations: Time series
    • Unit: bytes/sec
  • Panel 4: CPU IOWait

    • PromQL: 100 * avg(rate(node_cpu_seconds_total{mode="iowait"}[5m]))
    • Visualizations: Time series
    • Unit: percent (0-100)
  • Panel 5: GPU Power Usage

    • PromQL: DCGM_FI_DEV_POWER_USAGE
    • Visualizations: Time series
    • Unit: Watt(W)
  • Panel 6: GPU SM Clock

    • PromQL: DCGM_FI_DEV_SM_CLOCK
    • Visualizations: Time series
    • Unit: MHz
  • Panel 7: GPU Memory Detail

    • PromQL: 100 * (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) # 或者仅使用 DCGM_FI_DEV_FB_USED
    • Visualizations: Time series
  • Panel 8: GPU Temperature

    • PromQL: DCGM_FI_DEV_GPU_TEMP
    • Visualizations: Time series
    • Unit: °C
  • Panel 9: Host Up

    • PromQL: up
    • Visualizations: Stat (值为 1 说明正常)

最终监控看板效果呈现

compressed_image-1.png

compressed_image-2.png