Alerts & Incidents
Alerts is the default landing page for Operations and the favourite pin in the sidebar. It is where you define what “unhealthy” means for your fleet, watch things fire in real time, acknowledge and resolve incidents, and confirm that notifications actually made it out the door.
What you see
Section titled “What you see”Open Alerts in the sidebar. The page is tabbed:
| Tab | What’s in it |
|---|---|
| Rules | Every alert rule you have defined — source, target, threshold, channel, enabled toggle |
| Events | The live stream of firing events, grouped by rule + target so a flapping host does not drown the list |
| Incidents | Correlated incidents (same server, same severity, same title collapsed into one row) with status and last update |
| Notifications | Every attempted delivery — Slack, Discord, Email, Webhook, PagerDuty — with success / failure and the response body from the receiver |
Filters at the top of each tab: server, severity, source type, time window.
Alert rules
Section titled “Alert rules”A rule watches one source and points at one target:
| Source | What it watches |
|---|---|
| Server metric | CPU, RAM, disk (per mount point), load, network, process counts, custom metrics |
| Anomaly | The anomaly detector’s output for a specific server |
| Uptime check | HTTP / TCP / ICMP monitor result |
| Heartbeat | A named heartbeat missed its cadence |
| Backup monitor | A watched backup path is stale, missing, or shrinking |
| Certificate | SSL certificate approaching expiry or invalid |
| Log pattern | A regex hit in tailed log lines |
You can start from a blank rule or pick a template (Disk 90 %, RAM 95 %, backup missed, SSL 14 days, and so on). Every rule has: name, severity (info, warn, crit), evaluation window, dedupe window, and one or more notification channels.
Disk alerts support per-mount-point targeting — you can alert on /var at 85 % without noise from a healthy /.
Events and incidents
Section titled “Events and incidents”An event is a single firing (crossed threshold, missed heartbeat, failed check). Events are grouped in the UI so you see one row per problem, not one row per evaluation cycle.
An incident collects related events for a server into a single record you can:
- Acknowledge — takes ownership and silences reminders while you investigate.
- Resolve — closes the incident and clears the badge on the server row.
- Comment — inline notes appear on the incident timeline.
- Attach a postmortem — once resolved, an incident can carry a structured writeup for retros.
Bulk actions on the Incidents tab let you resolve or delete a filtered selection at once.
Notification channels
Section titled “Notification channels”Channels are managed under Settings → Notification Channels and picked per rule. Supported types:
- Email — one or many recipients, HTML template.
- Slack — incoming webhook, includes severity colour + Ack link.
- Discord — webhook with embed and severity colour.
- Telegram — bot token + chat ID.
- Generic Webhook — raw JSON POST, HMAC-signed with the channel’s signing secret. Format presets are available for Slack, Discord, Microsoft Teams MessageCards, and PagerDuty Events API v2 (routing key goes in the signing-secret field).
Every delivery attempt lands on the Notifications tab with the receiver’s response — the first thing to check when someone says “I didn’t get the alert.”
How it works
Section titled “How it works”- Metric rules are evaluated on each agent metrics push (typically every 15 seconds). A rule fires when the threshold has been continuously breached for its evaluation window.
- Uptime, heartbeat, backup, and certificate rules are evaluated by the HostAtlas backend on their own cadence.
- Firings are deduplicated — the same event won’t spam you every cycle; you get one notification, then reminders per the rule’s cadence until acknowledged or resolved.
- Maintenance windows suppress notifications for their scope while active. See Maintenance.
- Recovery rules can react to a firing by restarting a service or running a recipe before the operator is paged. See Recovery rules.
Related
Section titled “Related”- Maintenance — silence alerts during planned work.
- Recovery rules — auto-remediate before paging a human.
- Recipes — the scripts recovery rules and automations run.