Merikanto

一簫一劍平生意,負盡狂名十五年

Incident Management for Operations

This post is a collection of notes from Incident Management for Operations (2017).

Being in operations is like being a goalkeeper on a soccer team.
They only remember the ones that get past you, not all the wonderful saves you made along the way.


1 - Process

  • IRT - Incident Response Team
  • Triage - Sort & assign priority
  • MTTR - Mean Time To Resolution

IR Evaluation

  • How many incidents occurred per month over the last two years?
  • What reports/data analysis regarding incident response do you have?
  • Are incidents managed and directed in a consistent and efficient manner?
  • Are incident voice communications (e.g., conference bridges) recorded / archived / reviewed?

Time is the most important during incident response: Mitigate first, ask questions later.

Focus on fixing the issue first and restoring services.


Assessment

Incident response is a team sport, and leaves no room for ego, attitute, or lack of trust within the team.

1. Predicable

  • Clarity of roles & responsibilities

    If you’re identified as available oncall, it’s not optional.

  • Expected behavior of each person before & after they are assembled for an incident


2. Repeatable

  • Have runbooks & mechanisms for different incident types
  • Respond in the same way every time

3. Optimized

  • Escalation policies: Conditions for escalation
  • Do your resolvers know how to respond & participate, and do they share the same sense of urgency?

4. Clear

  • Anyone in the call knows exactly why they’re being paged, their role in the incident, and when will be released by IC

5. Evaluated

  • Conduct AAR regularly and correctly (learning culture to improve operations)

6. Scalable

  • Not dependent on a few people to respond all the time

7. Sustainable

  • Being oncall should be respected in the company, intended to attract skilled talent


2 - IMS

  • IMS - Incident Management System
  • SitStat - Situational Status (Scribe)

Attitude

  • Responders should be calm & collected, able to think clearly under pressure

    “Fire is not an emergency to the fire department, it’s what we do.”

  • Identify yourself by name & function, when enter the call bridge

  • Dispatch is different from notify: It’s not a request, it’s an order.


Peacetime & Wartime org chart

  • In wartime: Make the best deicision in the shortest amount of time

    Note: Make the best decision, but not the quickest decision

  • Solving wartime incidents require wartime mentality, and supporting org structure

  • Many executives bring peacetime rank to the wartime incident, potentially intimidating other responders.
    Executives also may feel the need to demonstrate leadership, even when they don’t have the IC leadership skills or be the most technically proficient expert.


IMS

  • Define how resolves should arrive & interact with IC

  • Any company adopting IMS should conduct IMS training across its workforce

  • Acronyms: Without common terminology, abbreviations / jargons could slow down communications

    Idea: Create acronym / jargon page for Incident Response only.



3 - IC

  • IC - Incident Commander
  • CAN - Conditions, Actions, Needs
  • SA - Situational Awareness

Attention

  • Must arrive at a call bridge with a sense of urgency

    Define clear expectations for being oncall ( “right-now” )

  • IC must be an expert in the process & function of incident command. This is the most important thing.

    Do not act as Tech Leader, even though you have the domain expertise

    Get info from resolvers to understand the issue, and formulate a plan

  • Identify as IC & provide CAN report, when new resolvers join the bridge


Duties

  • Ensure responders / resolvers have clear understanding of the issue, and keep them focused

    Periodical CAN report

  • Direct incident resolution effort: Build action plan, set clear resolution objectives

    Always have backup plan 📌, also fully discuss tradeoffs.

  • Understand the necessity of high-risk actions, and ready to build from a failed action

    Command presence: Actions & behaviors should inspire confidence.
    Hesitation from the IC generates doubt, fear & uncertainty.


Planning

  • Regarding the current plan, what are key indicators that the plan is working or not working?
  • What is the trigger point to abandon the initial plan and move to the next one?
  • How will I communicate to the resolvers to abandon the current plan and move to an alternative plan?

CAN report 📌

  • Conditions - current status, Actions - what’s been done, Needs - additional resources / actions needed
  • Distill a message down into its most basic parts, when:
    • New resolvers enter the conversation
    • The IC needs to make periodic briefings / refocus the group (Re-assess the situation)

Active Listener

  • Listen for tone, inflection, and meaning, not just the actual words that are said.

    Human emotions conveyed in the conversation: confidence, hesitation, fear, uncertainty, etc.

    Those subtle emotions are critical to understanding the meaning behind the words of a solution

  • How they are saying it. Are they confident or are they just randomly throwing ideas out?

    Test the certainty & conviction of resolvers. Pick up on the subtle clues.

  • Listen for those who are not contributing, or who are being talked over by stronger voices.


Examples

  • Crisp & direct communication

    “Network, can you go off the bridge, take a look at your monitoring tools, and get back to me in 10 minutes?”

  • Identify resolver by team / function

    “Database, please stand by.”

  • Make assignment: objective, request, action, with specific time frame

  • Planning

    “We will execute plan X. Is anyone aware of any issues with this plan?”

    “While we are waiting for plan X, I want to discuss what our plan Y will be.”

    “Storage, do you support the plan? / Storage, what’s your concern?”

  • Set expectations & urgency: Assigning a time limit to critical tasks will put the recipient under time pressure.

    “I want to hear from everyone, and I’ll take notes on what you come up with. We have 10 minutes, starting now.”


Star

1. Size up the incident

Situational Awareness

  • Continuous process of human focus & observation of data inputs

  • Make decisions based on verifiable inputs


2. Triage

  • Important for IC to announce currenct sev level

  • As the sev level increases, so does the involvement & value of the IC

    High sev incidents require strong leadership


3. Action plan

  • Maintain timeline for incident resolution effort

  • Backup plans (Always)

  • Set expectations for desired outcome

    Determine if objections are strong enough to rethink the plan


Confirm support from appropriate resolvers, which literally means the IC asks each one to say out loud
I support the plan“. (Reinforce accountability of all resolvers)

  • The IC should get all relevant SMEs to affirmatively support the plan and verbalize the support on the bridge,
    or require the SME to provide a compelling reason for why it is not supported.
  • Critical phase: Pay attention to the tone, inflection & emotions of each resolver’s response.
  • Ask questions: If the IC only asks if there is disagreement with a particular plan, typically the time it takes to complete polling the resolvers is far less.

Attention

  • Information is not the same as opinion
  • Ask one question. Get one answer
  • Summarize often (CAN report): Consistent understanding among all participants

4. Review

  • Operational peroid: frequency for specific actions to be taken

    e.g. CAN report / Comms email every 30 min

  • End state of incident: Return to pre-event conditions, or some adaptive state

    Document & understand all temporary fixes and pulled levers


Time

What’s a minute worth to your company?

  • Correlate technical issues with business impact (e.g. lost orders)

1. Tone

  • IC should be conscious of the need to create and maintain a positive and directed work environment
  • Clarity, alignment, attitude:
    Getting the right people to the right place at the right time to make the right decision.

2. Interaction

  • Respectfully, truthfully, ego-free, and with the goal of resolving the incident as a team in the shortest time

  • Unproductive behaviors:

    • Make these observations as part of AAR, in a positive & constructive way
    • Offline conversation with resolvers or their managers
  • Resolvers personality type analysis: Appendix


3. Management

Balance: Resolvers’ personalities & Situational issues (good personalities can also lead to bad situations)

Example situations:

  • Long unproductive conversations
  • Background noise
  • Language & culture challenges
  • Pressure from executives

4. Engagement

False Alarms

To be clear, it’s annoying at times, but it’s the job that comes with accepting the duties of incident response.

An alarm is a false alarm, only after the issue was detected & investigated by the IR team.

  • Dispatch: agreement between the resolvers and the dispatch function that guarantees an incident response

  • If the IC no longer needs the resolver: release it

  • Common errors:

    • Ineffective dispatch procedures
    • Ambiguity about available resources
    • Poor notification technology

Summary

  • IC’s job: bring order & direction to the chaotic nature of an incident

  • Star & Time: Important elements of incident response

    Be mindful about human interactions



4 - UC & Scaling Up

  • GL - Group Leader
  • UC - Unified Command
  • OCE - On Call Executives
  • IAP - Incident Action Plan

Span of control

  • 5 - 7 resolvers per person

  • Recognize your limits, and the limits of others

  • Group Leader: group resolvers by functions, set rules & timing for resolvers engagement


Transfer of Command

  • IC might be struggling, and a more qualified person is available
  • Handover: provide CAN report to the incoming IC (off the call bridge)

Training

  • Build skills & proficiency: regularly use IMS for less severe incidents (green & yellow box issues)
  • All side communications should be documented in the timeline
  • All resolvers tasked with response should be fully trained prior to taking on the responsibilities

UC & OCE

  • UC makes business impact decisions (strategy), IC drivers incident resolution

  • UC: Business impact transcends solving the technical problem

    When each resolution has different business impact: (competing interests at stake)
    Executive engagement is required to make the best business decision

  • Liason Officer

    • Move between communication channels (UC should be on a separate call bridge from resolvers)
    • Provide updates & briefings on behalf of IC
    • Gather info from those channels to bring back to IC

UC Leader

  • Establish UC call bridge
  • Determine the need for OCE involvement
  • Maintain communications between IC & OCE
  • Set briefing cadence (e.g. CAN report every 30 min)


5 - AAR

  • AAR - After Action Review

  • RCA - Root Cause Analysis

Culture

  • Positive & blameless post-incident evaluation, with honest & in-depth review of IR actions

  • Technology failure is the perfect chance to learn about operating environment and make improvements

    Foster trust and respect among team members, allow mistakes to be made and learning to take place

  • Look at the contributing factors that may have prevented the resolvers from arriving at a quicker resolution.

  • Debriefing: Description 📌

    Complex events require exploration with an open mind.

    As facilitators of debriefings, if we start questions with why, we typically get an explanation.
    When we start questions with how, what, when, then we get much more data during a debrief.


AAR

  • Technology - What broke (RCA)
  • People - How did people respond to what broke? 📌
  • Successful AAR requires prepartion

    • An accurate incident timeline is the basis of AAR
    • Collect resolver rosters & orgs, notes on discussion of possible resolutions
  • AAR is a collection of lessons learned from each incident

    • Help to see a blind spot in the service architecture
    • Perhaps some mistakes were made in detection, which leads to improvement in monitoring
    • Perhaps a junior IC was covering the shift for a senior, and gained valuable experience from the incident
    • Change Management
      For each release / change, there should be an IAP, ready to transition CM to incidents if things go wrong

Evaluation

  • Problem
  • Detection / Monitoring
  • Dispatch
  • Incident response
  • Resolution

Description Question
1. Problem description What happened?
2. Cause of the incident What caused/contributed to the problem?
(capture what caused a change from uptime to downtime)
3. Timestamp, initial responders & resolvers Were the right people assembled in the right spot,
to make the right decisions at the right time?
4. Solution Did the incident responders choose the right solution?
Why a decision was made, at the time it’s made, with the available info
5. Localizations & MTTR How long did it take to assemble and solve the problem?

Human Factors

  • Use of IMS framework

  • Focused leadership of IC

    • How did the IR team perform?
  • Process & results of collective problem-solving effort

    • What’s the business impact?
    • What’s the behaviour of those people participating on the call bridge? e.g. Duplication of effort, and no accountability

Talent

Evaluating soft skills
e.g. Not just the Scribe, but also track involvement of IC, TL & resolvers

1. Training

  • Provide company wide training for all resolvers & all teams

2. Accountability

  • Respond with urgency, be accountable for resolution plans

3. Leadership

  • High sev incidents / UC is activated:

    Evaluate if the executives of the company performed well according to their function during the incident

4 . Empowerment

  • Joined resolvers should be the right people with the right skills, knowledge, and authority

5. Notification

  • Incidents should be treated as something until proven that they are nothing

6. Trust


Summary

  • To conduct an effective AAR, first need a team knowledgeable enough to have opinions on the evaluted aspects

  • Always include the person skilled in IMS framework (i.e. the IC)

  • Key elements of AAR:

    • Incident timeline, revelant communications & participation

    • Foster an open, honest, and blameless culture

    • Identify & implement improvements to the people part of the response as well as the technical problems

      Ensure changes are followed up on and actually get implemented




Appendix

Common resolver personality types and IC tactics


The Awesome Contributor

The perfect resolver.

  • Arrive with sense of urgency, announce by name & function

  • Operationally ready to contribute (e.g. access to monitoring tools)

  • Clear & direct, support decisions with facts

  • Respect IC’s timeline and instructions


The Quiet One

  • Uncomfortable speaking up in a group
  • May be reserved in speech and actions, but can provide invaluable insights and info

IC Tactics

  • Don’t deem them as uninterested / unqualified based on the amount of interaction
  • Ask direct questions by name / function to create entry point for engagement

The Naysayer

  • Often cites past history as justification why a current idea or plan won’t work
  • Finds many reasons why something won’t work but few reasons why it will
  • Often “what ifs” all plans or ideas to the extreme

IC Tactics

  • The Naysayer isn’t always wrong. Beware of discounting the issues they raise, just because they may be difficult to deal with
  • Require specific details / data points when Naysayers throw up obstacles or dissension

The Overbearing One

  • Frequently identified in the peacetime organization as a “know-it-all
  • May not easily accept being wrong or that others are making contributions that appear to be more useful
  • Puts people on the defensive
  • May be impulsive. Quick to pass judgment and make snap decisions

IC Tactics

  • Stay calm. Don’t engage verbally
  • Beware of very opinionated suggestions that may be more colorful and grandiose than helpful
  • Avoid pointing out directly that the person is wrong
  • May be useful to assign tasks to an Overbearing One that take them off the main communications channel.
    Make the assignment technically challenging but not busy work, and indicate its importance

The Over Explainer

  • Intelligent, competent, confident, and talented
  • Generally doesn’t provide yes or no answers
  • Provides lengthy explanations when asked for solutions

IC Tactics

  • Interrupt if necessary once obtain required info

  • Beware when two Over Explainers are engaged: A journey of unnecessary details

  • Give a timeline prior to asking a question.

    “Can you explain xx to me in a minute or less?”


The Joker

  • Most have high self-esteem and are usually good resolvers
  • Constantly injecting humor or irrelevant comments into the conversation
  • Doesn’t take the urgency seriously

IC Tactics

  • Be direct in reminding the Joker to stay focused
  • Even though the behavior is not toxic, yet do not encourage it

The Uncertain Contributor

  • Frequently needs more time to “check one more thing” before taking action

  • The higher the stakes, the higher the level of uncertainty

    Words: “well,” “maybe,” “perhaps,” “it could be,” or “it might.”

IC Tactics

  • Demand specificity and accuracy

  • Phrase questions that only require yes or no answers

  • Rephrase certain questions into a range or certainty

    On a scale of 1 to 5, with 5 being the most certain, what’s the number you’d assign to your suggestion?

  • Get another SME from the same domain to replace or assist


The Gunslinger

  • Talented, knowledgeable, and extremely valuable to the organization. And the Gunslinger knows it
  • Often gets called specifically, because they have built a reputation as key problem-solver and go-to person
  • Drop names, jargon, and obscure facts to solidify their position
  • Gunslingers may be more intimidating than helpful

IC Tactics

  • Prevent the Gunslinger from informally assuming the role of IC

  • Don’t engage technically one on one with the Gunslinger

  • Don’t confuse the confidence of the Gunslinger with the ability to get to the right answer


The Interrupter

  • Routinely cuts off others during conversation
  • Continuously seeks attention. Only want to talk instead of carrying on a two-way conversation

IC Tactics

  • Ask Interrupters to wait for their turn
  • Need to be firm and assertive with the Interrupter, otherwise IC will get rolled over

The Grenade Thrower

  • May derail a plan or line of thinking after decisions have been made
  • Creates fear, uncertainty, and doubt
  • Plan with “what ifs” to the point that nothing looks like a good idea

IC Tactics

  • Refocus the discussion with a CAN report and stick to the verifiable facts

  • Distinguish between possible and likely. An event may be possible but unlikely.

    On a scale of 0 to 100%, what’s the probability that an asteroid will crash the earth today and cause massive damage?


The Chicken Little

  • Views every incident as a catastrophe
  • Thinks conservatively when it comes to taking action

IC Tactics

  • Focus on facts and known conditions

The Jumper (to Conclusions)

  • Quick to arrive at a conclusion without fully investigating a situation, idea, thought, etc
  • Likes others to think of them as smart
  • Use intuition, pattern recognition, past experience
  • Believe sev of incident warrants quick decisions

IC Tactics

  • Keep the Jumper focused on fact-based decision-making
  • Remember: Make the best decision in the shortest amount of time, not to just make quick decisions

The Tunnel Rat

  • Focus on a single priority
  • Use strong language of conviction. e.g. “absolutely,” “we must,” “can’t you see it?”
  • Ignore or downplay information that does not support their position

IC Tactics

  • Be factual in briefing
  • Prevent the Tunnel Rat from forming opinions that cannot be verified