Skip to main content

Assess, Act, Inform, Review

 Phase 1: Assess (Understand the Situation)

  1. Identify: What is the issue? What is happening?
  2. Scope: How widespread is the issue? Is it affecting a small group of users, or is it a major outage?
  3. Impact: What is the impact on users, the business, or the system? Is it critical, high, medium, or low?
  4. Severity: How severe is the issue ?
  5. Gather Information: What are the key details about the issue.
  6. Resources: What resources (people, tools, systems) are available to address the issue?
  7. Prioritize: Is this the most important thing you and your team should be working on right now?

Phase 2: Act (Take Action to Resolve the Issue)

  1. Contain: Take immediate action to contain the issue and minimize its impact. This could involve:
    • Rolling back a recent change.
    • Restarting a service.
    • Taking a system offline temporarily.
  2. Diagnose: Identify the root cause of the issue.
  3. Resolve: Implement a solution to fix the problem.
  4. Test: Ensure the fix works and doesn't create new problems.
  5. Restore: Bring systems and services back to normal operation.

Phase 3: Inform (Communicate to Stakeholders)

  1. Initial Alert (If Necessary): For significant issues, send a very brief initial alert to leadership that an issue exists, you are working on it, and that more information will be coming.
  2. Regular Updates (If Ongoing): If the issue is taking time to resolve, send brief updates at regular intervals (e.g., every 30 minutes, hourly).
  3. Post-Resolution Report: Once the issue is resolved, send a full report that includes:
    • What happened.
    • The timeline.
    • The root cause.
    • The impact.
    • The corrective actions.
  4. Audience: Who are the key stakeholders that need to be informed ?
  5. Channels: How will you inform the key stakeholders, and how often ?

Phase 4: Review (Learn and Improve)

  1. Root Cause Analysis: Conduct a more in-depth root cause analysis to understand why the issue occurred.
  2. Lessons Learned: What did you learn from this incident? What went well? What could have gone better?
  3. Preventive Actions: What steps can you take to prevent this type of issue from happening again?
  4. Process Improvement: How can you improve your processes (e.g., testing, monitoring, incident response) to minimize the impact of future issues?
  5. Share Learnings: Share the lessons learned with your team and, if appropriate, with other teams.

Key Principles of This Framework:

  • Action-Oriented: It prioritizes action over bureaucracy.
  • Iterative: It recognizes that incident management is an iterative process.
  • Communication: It emphasizes the importance of clear and timely communication.
  • Continuous Improvement: It emphasizes learning from incidents and making changes to prevent future issues.

Comments

Popular posts from this blog

Hexagonal Architecture (Ports & Adapters Pattern)

Hexagonal Architecture , also known as the Ports and Adapters pattern, is a software design pattern that aims to create a decoupled and maintainable application by separating the core business logic from external concerns (like databases, APIs, and UIs). Structure of Hexagonal Architecture A typical Hexagonal Architecture has three main layers: 1️⃣ Core Domain (Application Logic) This contains the business rules and domain models. It is completely independent of external technologies . Example: If you’re building a banking system , this part would include logic for transactions, withdrawals, and deposits . 2️⃣ Ports (Interfaces) These are interfaces that define how the core interacts with external components. Two types of ports: Inbound Ports (driven by external inputs like APIs, UI, or events) Outbound Ports (used to interact with external services like databases, messaging systems, etc.) 3️⃣ Adapters (Implementation of Ports) These are concrete implementations of the ports, re...

Recursion & Choice

Understanding Recursion and Choice Diagrams with Examples Understanding Recursion and Choice Diagrams with Examples Recursion is a powerful concept in programming where a function calls itself to solve smaller instances of the same problem. It's often used in solving complex problems that can be broken down into simpler subproblems. In this blog post, we'll explore the basics of recursion, understand choice diagrams, and see examples to illustrate these concepts. What is Recursion? Recursion occurs when a function calls itself directly or indirectly to solve a problem. A recursive function must have a base case to terminate the recursive calls and prevent infinite recursion. Here's a simple example of a recursive function to calculate the factorial of a number: public class RecursionExample { public static void main(String[] args) { int number = 5; int result = factorial(...

Frameworks

  Communication Frameworks: BLUF:  Google's culture strongly emphasizes efficiency and directness, so getting to the "bottom line up front" is very common. SCQA:  Used in presenting proposals, making recommendations, and structuring project plans. PAS : Used in selling ideas and influencing others. BAB : Used in selling ideas and influencing others. Sparklines : Used in presentation to influence others. STAR:  Widely used in Google's interview process and performance evaluations. Problem-Solving/Decision-Making Frameworks: 5 Whys:  A fundamental technique for root cause analysis, and Google is known for its emphasis on data-driven decision-making, which often involves digging into the root causes of problems. Systems Thinking:  Given the complexity of Google's systems, a systems thinking approach is essential. The Four Questions : Used in post-mortem to review an incident. Human factors : Used in post-mortem to avoid the blame culture. Time Management/Prior...