An alert that nobody sees is not an alert. It is just a log entry with better formatting.
The job of downtime alerting is simple: get the right message to the right person quickly enough that they can act before users complain.
The four questions every alerting setup must answer
- Who needs to know?
- Where will they actually notice it?
- When should the system alert versus wait for confirmation?
- What information helps them act immediately?
Pick channels by behavior, not by feature count
Email is good for records. Telegram, Slack, or Discord are often better for immediate visibility. The best channel is the one your operator already checks habitually.
Do not alert on every twitch
If every short transient failure creates noise, people learn to ignore alerts. A confirmation step before paging can reduce false alarms without hiding real incidents.
What a useful alert should include
- Which service failed.
- What changed: down, degraded, or recovered.
- When it happened.
- Enough context to begin diagnosis immediately.
One rule worth keeping
If the alert does not change what someone does next, it probably should not exist.
Bottom line
Good alerting is not "more notifications". It is faster human response with less noise.