← Site Reliability

Black Swan Scale checklist

Black Swan Scale

An event that comes as a surprise, has a major effect, and is often inappropriately rationalised after the fact with the benefit of hindsight.
…we’ve had enough of these that we can show at least some foresight in providing a basic checklist for discussion.

Regular scale discussions are tracked in meetings with Infrastructure, Operations, Services and leadership teams every two weeks.

Major Scale Event Checklist

Initiator

  • What source of information is the root of this event and what is the context?
  • When is the estimated time frame?
  • Who is running the incident watch?
  • Will this involve external teams?
    • Corporate Tech Services
    • Other BUs

All

What staff constraints do we expect? ie. PTO, religious observances, or known unavailability.
What failures were experienced during the last event?

Operations

Who are your stakeholder participants?
Launch info; pending launches

  • How many users will be added?
  • Do we need to communicate with customer to offer to delay launch? Customer communication
  • Banner on login?
  • Marketing ad inside the app to alert to details

Operations/Infrastructure

What external changes, maintenance, DR are planned that conflict with the event?

  • Corporate changes; can we ask for a change minimization window?
    • Who will communicate with CIO and directors in Corp tech?
  • What other business units have major work being executed in the effected time?
  • Are there DR efforts for core providers that we need to check in on or ask for delay?

Infrastructure

Who are your stakeholder participants?

Datastores

  • What database constraints are known issues or ongoing?
  • What tuning has happened recently?
  • Are there any edge cases that need isolated from other services

Observability

  • temperature check
  • can our logging keep up
  • who needs access
  • Mission critical logging; can/should other logging be reduced?

Load Balancing

  • temperature check
  • Hardware failures or pending vendor maintenance?

System Load

  • any constrained systems?

Network Load

  • any constrained network links?
    • jConnect - client connectivity
    • Azure to JH data centers
    • firewall device interfaces

Service Engineers

Who are your stakeholder participants?

  • Internal Service check-in
    • What services are currently under performant or in duress.
    • What teams are under duress?
  • What internal changes need to be put on hold?
    • development branches not prepared to hotfix should get to a stable deployable.
    • does release need to pause for data services or mobile clients?
    • Are there high visibility development efforts that need put on hold for release?
  • What batch or caching jobs could be paused or run less frequently?
    • when should they be put on hold
  • What is our highest peak to date? ie. Max users, active concurrent syncs
    • can we identify how many unique syncs we had during the last event?
  • What rate limits are in place?
    • Can those limits be increased immediately or will it require care and feeding?
    • can we equate accurately how user addition correlates to load increase
  • Service Integrations with other teams; ie. IPay, JX, NTBSL, Symx
    • What circuit breakers are an option?
    • What is required to implement the circuit breaker?
    • can we isolate or standup parallel MDS for Symx and JX to reduce blast radius of MDS failure?
  • Vendor Concerns

Open Mic

How can we reduce risk in any area and at any level?
Anything else…anything…Bueller