This post is a collection of notes from Incident Management for Operations (2017).
Being in operations is like being a goalkeeper on a soccer team.
They only remember the ones that get past you, not all the wonderful saves you made along the way.
1 - Process
- IRT - Incident Response Team
- Triage - Sort & assign priority
- MTTR - Mean Time To Resolution
IR Evaluation
- How many incidents occurred per month over the last two years?
- What reports/data analysis regarding incident response do you have?
- Are incidents managed and directed in a consistent and efficient manner?
- Are incident voice communications (e.g., conference bridges) recorded / archived / reviewed?
Time is the most important during incident response: Mitigate first, ask questions later.
Focus on fixing the issue first and restoring services.
Assessment
Incident response is a team sport, and leaves no room for ego, attitute, or lack of trust within the team.
1. Predicable
Clarity of roles & responsibilities
If you’re identified as available oncall, it’s not optional.
Expected behavior of each person before & after they are assembled for an incident
2. Repeatable
- Have runbooks & mechanisms for different incident types
- Respond in the same way every time
3. Optimized
- Escalation policies: Conditions for escalation
- Do your resolvers know how to respond & participate, and do they share the same sense of urgency?
4. Clear
- Anyone in the call knows exactly why they’re being paged, their role in the incident, and when will be released by IC
5. Evaluated
- Conduct AAR regularly and correctly (learning culture to improve operations)
6. Scalable
- Not dependent on a few people to respond all the time
7. Sustainable
- Being oncall should be respected in the company, intended to attract skilled talent
2 - IMS
- IMS - Incident Management System
- SitStat - Situational Status (Scribe)
Attitude
Responders should be calm & collected, able to think clearly under pressure
“Fire is not an emergency to the fire department, it’s what we do.”
Identify yourself by name & function, when enter the call bridge
Dispatch is different from notify: It’s not a request, it’s an order.
Peacetime & Wartime org chart
In wartime: Make the best deicision in the shortest amount of time
Note: Make the best decision, but not the quickest decision
Solving wartime incidents require wartime mentality, and supporting org structure
Many executives bring peacetime rank to the wartime incident, potentially intimidating other responders.
Executives also may feel the need to demonstrate leadership, even when they don’t have the IC leadership skills or be the most technically proficient expert.
IMS
Define how resolves should arrive & interact with IC
Any company adopting IMS should conduct IMS training across its workforce
Acronyms: Without common terminology, abbreviations / jargons could slow down communications
Idea: Create acronym / jargon page for Incident Response only.
3 - IC
- IC - Incident Commander
- CAN - Conditions, Actions, Needs
- SA - Situational Awareness
Attention
Must arrive at a call bridge with a sense of urgency
Define clear expectations for being oncall ( “right-now” )
IC must be an expert in the process & function of incident command. This is the most important thing.
Do not act as Tech Leader, even though you have the domain expertise
Get info from resolvers to understand the issue, and formulate a plan
Identify as IC & provide CAN report, when new resolvers join the bridge
Duties
Ensure responders / resolvers have clear understanding of the issue, and keep them focused
Periodical CAN report
Direct incident resolution effort: Build action plan, set clear resolution objectives
Always have backup plan 📌, also fully discuss tradeoffs.
Understand the necessity of high-risk actions, and ready to build from a failed action
Command presence: Actions & behaviors should inspire confidence.
Hesitation from the IC generates doubt, fear & uncertainty.
Planning
- Regarding the current plan, what are key indicators that the plan is working or not working?
- What is the trigger point to abandon the initial plan and move to the next one?
- How will I communicate to the resolvers to abandon the current plan and move to an alternative plan?
CAN report 📌
- Conditions - current status, Actions - what’s been done, Needs - additional resources / actions needed
- Distill a message down into its most basic parts, when:
- New resolvers enter the conversation
- The IC needs to make periodic briefings / refocus the group (Re-assess the situation)
Active Listener
Listen for tone, inflection, and meaning, not just the actual words that are said.
Human emotions conveyed in the conversation: confidence, hesitation, fear, uncertainty, etc.
Those subtle emotions are critical to understanding the meaning behind the words of a solution
How they are saying it. Are they confident or are they just randomly throwing ideas out?
Test the certainty & conviction of resolvers. Pick up on the subtle clues.
Listen for those who are not contributing, or who are being talked over by stronger voices.
Examples
Crisp & direct communication
“Network, can you go off the bridge, take a look at your monitoring tools, and get back to me in 10 minutes?”
Identify resolver by team / function
“Database, please stand by.”
Make assignment: objective, request, action, with specific time frame
Planning
“We will execute plan X. Is anyone aware of any issues with this plan?”
“While we are waiting for plan X, I want to discuss what our plan Y will be.”
“Storage, do you support the plan? / Storage, what’s your concern?”
Set expectations & urgency: Assigning a time limit to critical tasks will put the recipient under time pressure.
“I want to hear from everyone, and I’ll take notes on what you come up with. We have 10 minutes, starting now.”
Star
1. Size up the incident
Situational Awareness
Continuous process of human focus & observation of data inputs
Make decisions based on verifiable inputs
2. Triage
Important for IC to announce currenct sev level
As the sev level increases, so does the involvement & value of the IC
High sev incidents require strong leadership
3. Action plan
Maintain timeline for incident resolution effort
Backup plans (Always)
Set expectations for desired outcome
Determine if objections are strong enough to rethink the plan
Confirm support from appropriate resolvers, which literally means the IC asks each one to say out loud
“I support the plan“. (Reinforce accountability of all resolvers)
- The IC should get all relevant SMEs to affirmatively support the plan and verbalize the support on the bridge,
or require the SME to provide a compelling reason for why it is not supported. - Critical phase: Pay attention to the tone, inflection & emotions of each resolver’s response.
- Ask questions: If the IC only asks if there is disagreement with a particular plan, typically the time it takes to complete polling the resolvers is far less.
Attention
- Information is not the same as opinion
- Ask one question. Get one answer
- Summarize often (CAN report): Consistent understanding among all participants
4. Review
Operational peroid: frequency for specific actions to be taken
e.g. CAN report / Comms email every 30 min
End state of incident: Return to pre-event conditions, or some adaptive state
Document & understand all temporary fixes and pulled levers
Time
What’s a minute worth to your company?
- Correlate technical issues with business impact (e.g. lost orders)
1. Tone
- IC should be conscious of the need to create and maintain a positive and directed work environment
- Clarity, alignment, attitude:
Getting the right people to the right place at the right time to make the right decision.
2. Interaction
Respectfully, truthfully, ego-free, and with the goal of resolving the incident as a team in the shortest time
Unproductive behaviors:
- Make these observations as part of AAR, in a positive & constructive way
- Offline conversation with resolvers or their managers
Resolvers personality type analysis: Appendix
3. Management
Balance: Resolvers’ personalities & Situational issues (good personalities can also lead to bad situations)
Example situations:
- Long unproductive conversations
- Background noise
- Language & culture challenges
- Pressure from executives
4. Engagement
False Alarms
To be clear, it’s annoying at times, but it’s the job that comes with accepting the duties of incident response.
An alarm is a false alarm, only after the issue was detected & investigated by the IR team.
Dispatch: agreement between the resolvers and the dispatch function that guarantees an incident response
If the IC no longer needs the resolver: release it
Common errors:
- Ineffective dispatch procedures
- Ambiguity about available resources
- Poor notification technology
Summary
IC’s job: bring order & direction to the chaotic nature of an incident
Star & Time: Important elements of incident response
Be mindful about human interactions
4 - UC & Scaling Up
- GL - Group Leader
- UC - Unified Command
- OCE - On Call Executives
- IAP - Incident Action Plan
Span of control
5 - 7 resolvers per person
Recognize your limits, and the limits of others
Group Leader: group resolvers by functions, set rules & timing for resolvers engagement
Transfer of Command
- IC might be struggling, and a more qualified person is available
- Handover: provide CAN report to the incoming IC (off the call bridge)
Training
- Build skills & proficiency: regularly use IMS for less severe incidents (green & yellow box issues)
- All side communications should be documented in the timeline
- All resolvers tasked with response should be fully trained prior to taking on the responsibilities
UC & OCE
UC makes business impact decisions (strategy), IC drivers incident resolution
UC: Business impact transcends solving the technical problem
When each resolution has different business impact: (competing interests at stake)
Executive engagement is required to make the best business decisionLiason Officer
- Move between communication channels (UC should be on a separate call bridge from resolvers)
- Provide updates & briefings on behalf of IC
- Gather info from those channels to bring back to IC
UC Leader
- Establish UC call bridge
- Determine the need for OCE involvement
- Maintain communications between IC & OCE
- Set briefing cadence (e.g. CAN report every 30 min)
5 - AAR
AAR - After Action Review
RCA - Root Cause Analysis
Culture
Positive & blameless post-incident evaluation, with honest & in-depth review of IR actions
Technology failure is the perfect chance to learn about operating environment and make improvements
Foster trust and respect among team members, allow mistakes to be made and learning to take place
Look at the contributing factors that may have prevented the resolvers from arriving at a quicker resolution.
Debriefing: Description 📌
Complex events require exploration with an open mind.
As facilitators of debriefings, if we start questions with why, we typically get an explanation.
When we start questions with how, what, when, then we get much more data during a debrief.
AAR
- Technology - What broke (RCA)
- People - How did people respond to what broke? 📌
Successful AAR requires prepartion
- An accurate incident timeline is the basis of AAR
- Collect resolver rosters & orgs, notes on discussion of possible resolutions
AAR is a collection of lessons learned from each incident
- Help to see a blind spot in the service architecture
- Perhaps some mistakes were made in detection, which leads to improvement in monitoring
- Perhaps a junior IC was covering the shift for a senior, and gained valuable experience from the incident
- Change Management
For each release / change, there should be an IAP, ready to transition CM to incidents if things go wrong
Evaluation
- Problem
- Detection / Monitoring
- Dispatch
- Incident response
- Resolution
Description | Question |
---|---|
1. Problem description | What happened? |
2. Cause of the incident | What caused/contributed to the problem? (capture what caused a change from uptime to downtime) |
3. Timestamp, initial responders & resolvers | Were the right people assembled in the right spot, to make the right decisions at the right time? |
4. Solution | Did the incident responders choose the right solution? Why a decision was made, at the time it’s made, with the available info |
5. Localizations & MTTR | How long did it take to assemble and solve the problem? |
Human Factors
Use of IMS framework
Focused leadership of IC
- How did the IR team perform?
Process & results of collective problem-solving effort
- What’s the business impact?
- What’s the behaviour of those people participating on the call bridge? e.g. Duplication of effort, and no accountability
Talent
Evaluating soft skills
e.g. Not just the Scribe, but also track involvement of IC, TL & resolvers
1. Training
- Provide company wide training for all resolvers & all teams
2. Accountability
- Respond with urgency, be accountable for resolution plans
3. Leadership
High sev incidents / UC is activated:
Evaluate if the executives of the company performed well according to their function during the incident
4 . Empowerment
- Joined resolvers should be the right people with the right skills, knowledge, and authority
5. Notification
- Incidents should be treated as something until proven that they are nothing
6. Trust
Summary
To conduct an effective AAR, first need a team knowledgeable enough to have opinions on the evaluted aspects
Always include the person skilled in IMS framework (i.e. the IC)
Key elements of AAR:
Incident timeline, revelant communications & participation
Foster an open, honest, and blameless culture
Identify & implement improvements to the people part of the response as well as the technical problems
Ensure changes are followed up on and actually get implemented
Appendix
Common resolver personality types and IC tactics
The Awesome Contributor
The perfect resolver.
Arrive with sense of urgency, announce by name & function
Operationally ready to contribute (e.g. access to monitoring tools)
Clear & direct, support decisions with facts
Respect IC’s timeline and instructions
The Quiet One
- Uncomfortable speaking up in a group
- May be reserved in speech and actions, but can provide invaluable insights and info
IC Tactics
- Don’t deem them as uninterested / unqualified based on the amount of interaction
- Ask direct questions by name / function to create entry point for engagement
The Naysayer
- Often cites past history as justification why a current idea or plan won’t work
- Finds many reasons why something won’t work but few reasons why it will
- Often “what ifs” all plans or ideas to the extreme
IC Tactics
- The Naysayer isn’t always wrong. Beware of discounting the issues they raise, just because they may be difficult to deal with
- Require specific details / data points when Naysayers throw up obstacles or dissension
The Overbearing One
- Frequently identified in the peacetime organization as a “know-it-all”
- May not easily accept being wrong or that others are making contributions that appear to be more useful
- Puts people on the defensive
- May be impulsive. Quick to pass judgment and make snap decisions
IC Tactics
- Stay calm. Don’t engage verbally
- Beware of very opinionated suggestions that may be more colorful and grandiose than helpful
- Avoid pointing out directly that the person is wrong
- May be useful to assign tasks to an Overbearing One that take them off the main communications channel.
Make the assignment technically challenging but not busy work, and indicate its importance
The Over Explainer
- Intelligent, competent, confident, and talented
- Generally doesn’t provide yes or no answers
- Provides lengthy explanations when asked for solutions
IC Tactics
Interrupt if necessary once obtain required info
Beware when two Over Explainers are engaged: A journey of unnecessary details
Give a timeline prior to asking a question.
“Can you explain xx to me in a minute or less?”
The Joker
- Most have high self-esteem and are usually good resolvers
- Constantly injecting humor or irrelevant comments into the conversation
- Doesn’t take the urgency seriously
IC Tactics
- Be direct in reminding the Joker to stay focused
- Even though the behavior is not toxic, yet do not encourage it
The Uncertain Contributor
Frequently needs more time to “check one more thing” before taking action
The higher the stakes, the higher the level of uncertainty
Words: “well,” “maybe,” “perhaps,” “it could be,” or “it might.”
IC Tactics
Demand specificity and accuracy
Phrase questions that only require yes or no answers
Rephrase certain questions into a range or certainty
On a scale of 1 to 5, with 5 being the most certain, what’s the number you’d assign to your suggestion?
Get another SME from the same domain to replace or assist
The Gunslinger
- Talented, knowledgeable, and extremely valuable to the organization. And the Gunslinger knows it
- Often gets called specifically, because they have built a reputation as key problem-solver and go-to person
- Drop names, jargon, and obscure facts to solidify their position
- Gunslingers may be more intimidating than helpful
IC Tactics
Prevent the Gunslinger from informally assuming the role of IC
Don’t engage technically one on one with the Gunslinger
Don’t confuse the confidence of the Gunslinger with the ability to get to the right answer
The Interrupter
- Routinely cuts off others during conversation
- Continuously seeks attention. Only want to talk instead of carrying on a two-way conversation
IC Tactics
- Ask Interrupters to wait for their turn
- Need to be firm and assertive with the Interrupter, otherwise IC will get rolled over
The Grenade Thrower
- May derail a plan or line of thinking after decisions have been made
- Creates fear, uncertainty, and doubt
- Plan with “what ifs” to the point that nothing looks like a good idea
IC Tactics
Refocus the discussion with a CAN report and stick to the verifiable facts
Distinguish between possible and likely. An event may be possible but unlikely.
On a scale of 0 to 100%, what’s the probability that an asteroid will crash the earth today and cause massive damage?
The Chicken Little
- Views every incident as a catastrophe
- Thinks conservatively when it comes to taking action
IC Tactics
- Focus on facts and known conditions
The Jumper (to Conclusions)
- Quick to arrive at a conclusion without fully investigating a situation, idea, thought, etc
- Likes others to think of them as smart
- Use intuition, pattern recognition, past experience
- Believe sev of incident warrants quick decisions
IC Tactics
- Keep the Jumper focused on fact-based decision-making
- Remember: Make the best decision in the shortest amount of time, not to just make quick decisions
The Tunnel Rat
- Focus on a single priority
- Use strong language of conviction. e.g. “absolutely,” “we must,” “can’t you see it?”
- Ignore or downplay information that does not support their position
IC Tactics
- Be factual in briefing
- Prevent the Tunnel Rat from forming opinions that cannot be verified