Last updated July 2, 2026

Recovery rules

Recovery rules let HostAtlas try to fix a known-safe problem before the on-call phone lights up. If nginx dies, restart it. If disk crosses 90 %, run the log-cleanup recipe. If the fix works, the incident closes itself and lands in the audit log; if it doesn’t, the alert escalates normally.

Where to find it

Under Alerts → Recovery in the sidebar. The page lives inside the Automations screen (Recovery is a category of automation), so /recovery-rules redirects to the Monitoring tab of /automations.

What you see

A table of recovery rules, each with:

Trigger — which alert source and target this rule reacts to (for example: “nginx service down on web-01”).
Action — the remediation to attempt.
Enabled toggle.
Last run — timestamp, outcome, and a link into the audit trail.

What you can do

Create a rule and pick:

Trigger — a specific alert rule, or a service on a specific server.
Action from the library:
- Restart service — systemctl restart <name> via the agent (nginx, mysql, postgres, redis, php-fpm, custom).
- Restart container — Docker container by name or ID.
- Run recipe — any Recipe you have defined, with optional parameters.
- Block IP — add a rule to the server firewall (for auto-block on repeated auth failures).
Guardrails:
- Cooldown — minimum interval between attempts, so a broken service does not get restarted 200 times a minute.
- Max attempts per window — after N failures inside the cooldown window, give up and let the alert escalate.
- Only during business hours — optional time gate.
Dry-run mode — the rule matches, the action is logged to the audit trail as if it had run, but nothing is actually executed. Useful for confirming the trigger is scoped correctly before turning it live.

How it works

When a matching alert fires, HostAtlas checks the guardrails. If the cooldown or attempt cap is exceeded, the rule is skipped and the alert follows the normal notification path.
Otherwise the action is dispatched to the agent on the target server. The command runs asynchronously and reports back its exit status.
On success, the underlying alert is auto-resolved and the incident closes with a note pointing at the recovery entry.
On failure, the alert stays open and notifications go out as usual, plus a “recovery attempted, failed” line on the incident timeline.
Every attempt — success, failure, or dry-run — is recorded in the audit log with who, what, when, and the command output.

Alerts & Incidents — the rules that recovery reacts to.
Automations — the broader event-driven automation surface.
Recipes — reusable scripts a recovery rule can invoke.

Was this page helpful?

Report issue ↗