Pupper escalation policy can be found here
Incidents
Incidents are downtime of a service, degraded service, or anything that negatively impacts our services up-time or customer experience. That can be for a single customer or many. If you notice a problem, post a message in #org-reliability-ff and start an incident. It’s better to be overly cautious. Here is the page that describes how to start an incident.
Issue flow
- The open customer issues with our pupper component can be found here
- Quick, consistent, and clear communication with support is required
- In many instances we will need to work well with other teams and analysts to debug and/or resolve an issue
- Team ownership can be found here
- Analysts can be reached in slack #org-analyst room.
- Don’t be shy to pull in subject matter experts or leadership to an issue as needed
Runbook
All symxchange requests failing for specific institution
- *Communicate with support-ff about issue.
- Verify behavior in kibana, looking at institution’s symxchange interaction logs.
- Capture root request error (ie: connection timeout, connection refused, other)
- Determine if issue is on Banno’s side or if jsource case needs to be open.
*Page support if no response. via slack command /pd-support *message*
Restarting Symxhcange servers
Restarting the Symxchange servers is done throught Jenkins. Currently the services are running in Marathon.
- Just this link to display Jenkins
- Select “Build with Parameters” on the left column
- Enter “symxchange-http-server”
- Press the “Build” button
- Repeat for “symxchange-rpc-server”
Once the services move to Kubernetes, you will use this link for Jenkins
Tooling
Most tooling links can be found here: https://docs.banno.com/infrastructure/urls/
- Kibana
- https://www.elastic.co/products/kibana
- Our current primary use case for Kibana is to view logging.
- Tech talk: https://drive.google.com/file/d/1vkIW9PYZaJuhFjFWVrGmuzTlUb0KGyxt/view?usp=sharing
- Grafana
- https://grafana.com/
- Grafana allows us to visualize our metrics.
- (hint: take off the
-lksin the url from infra urls for the new grafana)
- Marathon
- https://mesosphere.github.io/marathon/
- Marathon allows us to visualize our running instances and configurations for our services
- Mesos
- http://mesos.apache.org/
- A link to Mesos for each service is available in Marathon in the Debug section with a Sandbox that contains the stderr and stdout logs.
- Prometheus
- https://prometheus.io/
- https://infra-az-centralus.banno-production.com/prometheus/ (not in infrastructure urls page)
- Prometheus is used for metrics and alerting.
- Data Services/Fetch Reporting
- This gives us a UI to make queries to add in debugging customer issues and look up general data while testing, etc.
- Team Dreamwork owns this service.
Responsibilities
Primary - only doing FF work (no feature/project work)
- Acknowledge and resolve pagerduty alerts
- Page other teams when necessary. (via slack
/pd-*or pagerduty app) - Check for incoming customer issues at least once a day
- Triage issues jira issues https://banno-jha.atlassian.net/issues/?filter=10843
- see https://docs.banno.com/operations/customer-issues/#engineering-triage-process for more details
- Work on any customer issue related development/fixes, as needed
- Work on any incident related development/fixes, as needed
- Address slack messages to @pupper-ff group
- Address jira messages to @pupper & @pupper-ff
- Monitor #org-pupper room for triage
- Monitor #war-room-go-live for triage
- Monitor #auto-pupper-alerts for alerts
- Monitor #prod-people-reports for reports and mobile-admin triage
- Monitor graphs periodically and after deploys
- Work on non-feature related needs (such as logging, alerts, metrics, outside requests - infra/security, etc), as time allows
Secondary
- Acts as a safety net
- Primary to reach out when help is needed (whether underwater or primary will be unavailable for a period of time)
- Update Pagerduty as needed when swaps occur
Requirements
- Pagerduty account
- VPN access
- Be in Banno organization and Pupper team in Github
- General knowledge of how to use Kibana, Marathon, Grafana, Mesos, Prometheus, Data Services Reporting
Incident Analysis
- General incident analysis can be found here: https://github.com/Banno/incident-analysis
- We will list/link to our incident analysis for our team as we get those started.
- Incident Analysis for
- anything that needs a roll back/forward (hotfix/quick fix)
- incidents we start or get pulled into (that we do work for)
- incidents where Support updates our external status page