Email - Incident Management

Sample 1: Initial Alert (Sent Immediately When an Incident is Detected)

Subject: URGENT: Service Disruption - Customer Login System
Dear [Stakeholder Name(s)],
We are currently experiencing a major service disruption affecting our customer login system. Customers are unable to access their accounts, and this is impacting our ability to process transactions.
Our engineering team is actively investigating the issue and working to restore service. We will provide an update within the next 30 minutes.
We understand the severity of this issue, and we are working hard to fix it.
Regards,
[Your Name]
[Your Title]

Sample 2: Ongoing Incident Update (Sent During a Prolonged Outage)

Subject: UPDATE: Customer Login System Disruption - Investigation Ongoing
Dear [Stakeholder Name(s)],
This email provides an update on the ongoing disruption to our customer login system.
Current Status: The issue persists. Customers are still unable to access their accounts.
Timeline:
10:00 AM PST: Initial reports of login failures.
10:15 AM PST: Engineering team initiated investigation.
10:30 AM PST: Issue escalated to on-call engineers.
Impact:
All customers are currently unable to log in.
Order processing is impacted.
Root Cause: The root cause is still under investigation.
Next Steps:
The engineering team is continuing to investigate and working on a fix.
We estimate to have a fix in 2 hours.
We will provide another update within the next hour.
We will keep you informed of our progress.
Regards,
[Your Name]
[Your Title]

Sample 3: Resolution Report (Sent After the Incident is Resolved)

Subject: RESOLVED: Customer Login System Outage - Incident Report
Dear [Stakeholder Name(s)],
This email reports that the customer login system outage has been resolved. Customers are now able to log in and access their accounts.
Timeline:
10:00 AM PST: Initial reports of login failures.
10:15 AM PST: Engineering team initiated investigation.
10:30 AM PST: Issue escalated to on-call engineers.
11:00 AM PST: Root cause identified.
1:30 PM PST: Service fully restored.
Root Cause: The outage was caused by a misconfiguration in our load balancer settings following a recent software update.
Impact:
Approximately 100,000 customers were unable to log in for 3.5 hours.
500 orders were not processed.
Estimated revenue loss: $50,000.
Corrective Actions:
Immediate: The configuration error was corrected, and service was restored.
Preventive: We have implemented new automated checks to prevent this type of configuration error. We will also review the change control process to ensure better testing.
Communication Plan:
A banner was displayed on our website to inform the customer.
A report will be sent to the customers in the next 24h.
We will review the root cause analysis with more details at the next meeting.
Regards,
[Your Name]
[Your Title]

Key Takeaways from These Samples:

Urgency: Use "URGENT" or "RESOLVED" in the subject line to grab attention.
Conciseness: Get to the point quickly.
Clarity: Use plain language, not technical jargon.
Impact: Clearly state the impact on customers and the business.
Next Steps: Outline what's being done or what will happen next.
Transparency: Be honest about what happened and what's being done.

Cheat Sheet For Developers

Search This Blog

Email - Incident Management

Labels

Comments

Post a Comment

Popular posts from this blog

Hexagonal Architecture (Ports & Adapters Pattern)

Recursion & Choice

Frameworks