Assess, Act, Inform, Review

Phase 1: Assess (Understand the Situation)

Identify: What is the issue? What is happening?
Scope: How widespread is the issue? Is it affecting a small group of users, or is it a major outage?
Impact: What is the impact on users, the business, or the system? Is it critical, high, medium, or low?
Severity: How severe is the issue ?
Gather Information: What are the key details about the issue.
Resources: What resources (people, tools, systems) are available to address the issue?
Prioritize: Is this the most important thing you and your team should be working on right now?

Phase 2: Act (Take Action to Resolve the Issue)

Contain: Take immediate action to contain the issue and minimize its impact. This could involve:
- Rolling back a recent change.
- Restarting a service.
- Taking a system offline temporarily.
Diagnose: Identify the root cause of the issue.
Resolve: Implement a solution to fix the problem.
Test: Ensure the fix works and doesn't create new problems.
Restore: Bring systems and services back to normal operation.

Phase 3: Inform (Communicate to Stakeholders)

Initial Alert (If Necessary): For significant issues, send a very brief initial alert to leadership that an issue exists, you are working on it, and that more information will be coming.
Regular Updates (If Ongoing): If the issue is taking time to resolve, send brief updates at regular intervals (e.g., every 30 minutes, hourly).
Post-Resolution Report: Once the issue is resolved, send a full report that includes:
- What happened.
- The timeline.
- The root cause.
- The impact.
- The corrective actions.
Audience: Who are the key stakeholders that need to be informed ?
Channels: How will you inform the key stakeholders, and how often ?

Phase 4: Review (Learn and Improve)

Root Cause Analysis: Conduct a more in-depth root cause analysis to understand why the issue occurred.
Lessons Learned: What did you learn from this incident? What went well? What could have gone better?
Preventive Actions: What steps can you take to prevent this type of issue from happening again?
Process Improvement: How can you improve your processes (e.g., testing, monitoring, incident response) to minimize the impact of future issues?
Share Learnings: Share the lessons learned with your team and, if appropriate, with other teams.

Key Principles of This Framework:

Action-Oriented: It prioritizes action over bureaucracy.
Iterative: It recognizes that incident management is an iterative process.
Communication: It emphasizes the importance of clear and timely communication.
Continuous Improvement: It emphasizes learning from incidents and making changes to prevent future issues.

Cheat Sheet For Developers