Operations Products¶
Operations monitoring has its own raw snapshots, monitoring Zarr, and archived quicklook products.
Raw snapshots¶
- rolling latest snapshot:
/project/aurora/raw/ops_monitor/latest.json - time-stamped JSONL snapshots:
/project/aurora/raw/ops_monitor/ops_monitor_YYYYMMDD.jsonl - observe-only health assessment:
/data/aurora/products/ops_monitor/health/latest_health.json - human-readable daily health report:
/data/aurora/products/ops_monitor/health/health_report_YYYYMMDD.md
These snapshots capture:
- snapshot collection time as both
time_utcandsnapshot_time_utc - source-host disk usage and probe reachability
- resolved
pwd -Ppaths for the filesystem locations that were actually probed - per-stream source recency, with a red operational state if a source has not
produced data in the last 1.5 hours; HATPRO uses a
3 hsource-recency window because it arrives as hourly batches and should only go stale after two missed batches - Aurora Power Supply battery voltage from the latest
DCInverterVoltssample in the power Zarr, scored green above52 V, amber from50-52 V, and red below50 V - Aurora Power Supply battery state of charge from the latest
BatterySOCsample in the power Zarr, scored green at or above50 %, amber from25-50 %, and red below25 % - Aurora Power Supply internal temperature from the latest
InternalTemperaturesample in the power Zarr, scored green below40 C, amber from40-45 C, and red at45 Cor above - local
/project,/data, and/filesystem usage - GWS usage and reachability
- per-stream mirror coverage, lag, and mismatch counts
- prune-gate and product-gate summaries
- systemd health for source sync, processing, and transfer units
- dashboard HTTP health and response time
- dashboard and infrastructure git branch, commit, dirty state, and local ahead/behind counts
The health assessment is deliberately observe-only. It summarizes the raw snapshot into green/amber/red checks, but it does not restart services, delete data, rebuild stores, or modify code.
Operations Zarr¶
Path:
/data/aurora/products/ops_monitor/ops_monitor.zarr
When checked on 2026-05-21:
- dimension:
time - shape:
time=4200 - data variables:
399 - time coverage when checked:
2026-05-09 16:01:00.181057to2026-05-21 20:16:17 - sorted unique
timecoordinate after the2026-05-18rebuild - useful attrs:
instrument = "ops-monitor"source = "ops_monitor_YYYYMMDD.jsonl"title = "Aurora operations monitoring"
The appender can rebuild automatically if the collector schema expands. It also checks the existing store before append; if time samples are out of order or duplicated, the old store is backed up and rebuilt from the raw JSONL snapshots.
Quicklooks¶
Archived operations products live under:
/data/aurora/products/quicklooks/ops_monitor
They currently include:
- summary operations quicklooks
HK_Operations
HK_Operations is intentionally curated diagnostics rather than an exhaustive
plot of every numeric monitoring field.
Live dashboard relationship¶
The top-level Operations Dashboard tab reads the latest snapshot directly, so the live status view can be useful even before enough archived samples exist to generate meaningful historical PNGs.
The live tab also reads the Operations Zarr for cached seven-day trend cards covering worst storage pressure, APS state of charge, APS battery voltage, worst source lag, and worst GWS lag. Those cards are context for operators; the latest snapshot remains the authority for the traffic-light state.
Phase 1 sentinel outputs¶
The collector now acts as the Phase 1 operations sentinel:
- every run writes the raw JSONL record and
latest.json - every run writes
latest_health.json, a compact machine-readable health assessment withoverall_level, check counts, and current observations - every run refreshes
latest_report.mdand the date-stamped daily report - the report explicitly states that no automated healing actions were taken
Email Alerts¶
send_ops_alerts.py evaluates the latest operations snapshot and sends email
alerts to gamb2le@ncas.ac.uk when operational thresholds are crossed.
Alert rules:
- any monitored storage filesystem reaches
80 %used - Aurora Power Supply battery state of charge is at or below
20 % - Aurora Power Supply internal temperature is at or above
45 C - Aurora Power Supply battery voltage from
DCInverterVoltsis below50 V - any stream source is stale for at least
3 h - any stream service-health component remains off/unhealthy for at least
3 h
Alert state and logs:
- state:
/data/aurora/products/ops_monitor/alerts/state.json - event log:
/data/aurora/products/ops_monitor/alerts/alerts.jsonl
The state file records active alerts, first-seen time, last-seen time, and last
sent time so the system does not send an email every five minutes. Alerts send
when they first become active after any hold period, repeat every 12 h while
still active, and send a recovery email after a previously emailed alert clears.
The recommended delivery stack for this headless VM is:
bsd-mailxas the script-facing command-line mail interfacemsmtp/msmtp-mtaas the lightweight outbound SMTP delivery layer
In other words, dashboard scripts use mailx because it is a simple command
line interface, while msmtp does the actual SMTP delivery through a relay. A
full Sendmail MTA is intentionally not required unless the VM later needs to run
a real mail server.
The script prefers mailx when it is available, then a sendmail-compatible
interface, then direct SMTP from OPS_ALERT_SMTP_* environment variables. Set
OPS_ALERT_TRANSPORT to mailx, sendmail, smtp, or auto to force a
specific path. The VM still needs a real SMTP relay configured for msmtp
before real email can be delivered through the recommended mailx path. When
sendmail is the msmtp wrapper, the alert script only treats mail delivery as
configured if an msmtp config exists at /home/aurora/.msmtprc, /etc/msmtprc,
/etc/msmtp/msmtprc, or a path named by OPS_ALERT_MSMTP_CONFIG.
Useful checks: