On-call
The members of Dreamwork take turns rotating through an on-call schedule. Everyone is technically on the calendar, as indicated by roles including primary, secondary, tertiary, and so on. The on-call shifts are one week and change out on Tuesdays. Primary and secondary firefighters are generally not included in any mission-critical or time-sensitive feature work, focusing on customer issues and maintenance tasks instead.
Incidents
Incidents are downtime of a service, degraded service, or anything that negatively impacts our services up-time or customer experience. That can be for a single customer or many. If you notice a problem, post a message in #org-reliability-ff and start an incident. It’s better to be overly cautious. Here is the page that describes how to start an incident.
Starting a rotation
On the day before your rotation (currently Monday), find a clean stopping point for you current work and/or transition the rest to another team member. Tuesday the current primary on-call person and the next in line will meet to handoff customer issues in order to preserve context. This will typically happen after the daily standup.
Responsibilities
Primary
- Provide primary on call support through PagerDuty
- Coordinate with other teams to help resolve issues and help provide context
- Monitor these slack rooms:
- Cross team production incidents: #org-reliability-ff
- Deploys: #org-deployments
- New Customer Onboarding: #war-room-go-live
- Triage customer issues
- Review Customer Issues at least every 8 business hours, in accordance with the Engineering Triage Process
- Primarily work on active customer issues.
- When not doing those things above, pick a ticket from our project which is not related to scheduled project work. Instead focus on issues that will help your on-call situation or things that are easy to drop if a customer issue or a page comes up. It is also fine to pick up work that you look forward to doing to balance out the on-call stress.
Secondary
- Provide backup on call support through PagerDuty
- Provide backup if primary needs help
- Primarily work on maintenance and documentation issues.
- Respond to slack questions and requests for help (pings for @dreamwork or messages in #org-dreamwork{:target="_blank"}) so other team members can stay heads down
- This also helps us spread knowlege around the team by taking turns answering questions and learning about areas of our domain that you may not be familiar with.
- Moderate team meetings (standup, issue sweep, retro) during the week.
- Try to stay current on the larger issues as well as help review PRs that the primary creates.
- May still work on project work, but must switch if the primary needs help
Tertiary and beyond
- Primarily work on project/feature work or technical debt, as decided at the weekly issue sweep.
- Provide backup on call support through PagerDuty
Incident Analysis
- General incident analysis can be found here: https://github.com/Banno/incident-analysis
- We will list/link to our incident analysis for our team as we get those started.
- Analysis for
- anything that needs a roll back/forward (hotfix/quick fix)
- Exception: Check and Documentation services need to be deployed to production for testing. If they encounter issue and are rolled back during that testing phase, they will not require a incident analysis.
- incidents we start or get pulled into (that we do work for)
- incidents where Support updates our external status page
- anything that needs a roll back/forward (hotfix/quick fix)