Black Swan Scale

An event that comes as a surprise, has a major effect, and is often inappropriately rationalised after the fact with the benefit of hindsight.
…we’ve had enough of these that we can show at least some foresight in providing a basic checklist for discussion.

Regular scale discussions are tracked in meetings with Infrastructure, Operations, Services and leadership teams every two weeks.

Major Scale Event Checklist

Initiator

What source of information is the root of this event and what is the context?
When is the estimated time frame?
Who is running the incident watch?
Will this involve external teams?
- Corporate Tech Services
- Other BUs

All

What staff constraints do we expect? ie. PTO, religious observances, or known unavailability.
What failures were experienced during the last event?

Operations

Who are your stakeholder participants?
Launch info; pending launches

How many users will be added?
Do we need to communicate with customer to offer to delay launch? Customer communication
Banner on login?
Marketing ad inside the app to alert to details

Operations/Infrastructure

What external changes, maintenance, DR are planned that conflict with the event?

Corporate changes; can we ask for a change minimization window?
- Who will communicate with CIO and directors in Corp tech?
What other business units have major work being executed in the effected time?
Are there DR efforts for core providers that we need to check in on or ask for delay?

Infrastructure

Who are your stakeholder participants?

Datastores

What database constraints are known issues or ongoing?
What tuning has happened recently?
Are there any edge cases that need isolated from other services

Observability

temperature check
can our logging keep up
who needs access
Mission critical logging; can/should other logging be reduced?

Load Balancing

temperature check
Hardware failures or pending vendor maintenance?

System Load

any constrained systems?

Network Load

any constrained network links?
- jConnect - client connectivity
- Azure to JH data centers
- firewall device interfaces

Service Engineers

Who are your stakeholder participants?

Internal Service check-in
- What services are currently under performant or in duress.
- What teams are under duress?
What internal changes need to be put on hold?
- development branches not prepared to hotfix should get to a stable deployable.
- does release need to pause for data services or mobile clients?
- Are there high visibility development efforts that need put on hold for release?
What batch or caching jobs could be paused or run less frequently?
- when should they be put on hold
What is our highest peak to date? ie. Max users, active concurrent syncs
- can we identify how many unique syncs we had during the last event?
What rate limits are in place?
- Can those limits be increased immediately or will it require care and feeding?
- can we equate accurately how user addition correlates to load increase
Service Integrations with other teams; ie. IPay, JX, NTBSL, Symx
- What circuit breakers are an option?
- What is required to implement the circuit breaker?
- can we isolate or standup parallel MDS for Symx and JX to reduce blast radius of MDS failure?
Vendor Concerns

Open Mic

How can we reduce risk in any area and at any level?
Anything else…anything…Bueller