Pupper Firefighter · Banno Docs

Pupper escalation policy can be found here

Incidents

Incidents are downtime of a service, degraded service, or anything that negatively impacts our services up-time or customer experience. That can be for a single customer or many. If you notice a problem, post a message in #org-reliability-ff and start an incident. It’s better to be overly cautious. Here is the page that describes how to start an incident.

Issue flow

The open customer issues with our pupper component can be found here
Quick, consistent, and clear communication with support is required
In many instances we will need to work well with other teams and analysts to debug and/or resolve an issue
Team ownership can be found here
Analysts can be reached in slack #org-analyst room.
Don’t be shy to pull in subject matter experts or leadership to an issue as needed

Runbook

All symxchange requests failing for specific institution

^*Communicate with support-ff about issue.
Verify behavior in kibana, looking at institution’s symxchange interaction logs.
Capture root request error (ie: connection timeout, connection refused, other)
Determine if issue is on Banno’s side or if jsource case needs to be open.

^*Page support if no response. via slack command /pd-support *message*

Restarting Symxhcange servers

Restarting the Symxchange servers is done throught Jenkins. Currently the services are running in Marathon.

Just this link to display Jenkins
Select “Build with Parameters” on the left column
Enter “symxchange-http-server”
Press the “Build” button
Repeat for “symxchange-rpc-server”

Once the services move to Kubernetes, you will use this link for Jenkins

Tooling

Most tooling links can be found here: https://docs.banno.com/infrastructure/urls/

Kibana
- https://www.elastic.co/products/kibana
- Our current primary use case for Kibana is to view logging.
- Tech talk: https://drive.google.com/file/d/1vkIW9PYZaJuhFjFWVrGmuzTlUb0KGyxt/view?usp=sharing
Grafana
- https://grafana.com/
- Grafana allows us to visualize our metrics.
- (hint: take off the -lks in the url from infra urls for the new grafana)
Marathon
- https://mesosphere.github.io/marathon/
- Marathon allows us to visualize our running instances and configurations for our services
Mesos
- http://mesos.apache.org/
- A link to Mesos for each service is available in Marathon in the Debug section with a Sandbox that contains the stderr and stdout logs.
Prometheus
- https://prometheus.io/
- https://infra-az-centralus.banno-production.com/prometheus/ (not in infrastructure urls page)
- Prometheus is used for metrics and alerting.
Data Services/Fetch Reporting
- This gives us a UI to make queries to add in debugging customer issues and look up general data while testing, etc.
- Team Dreamwork owns this service.

Responsibilities

Primary - only doing FF work (no feature/project work)

Acknowledge and resolve pagerduty alerts
Page other teams when necessary. (via slack /pd-* or pagerduty app)
Check for incoming customer issues at least once a day
Triage issues jira issues https://banno-jha.atlassian.net/issues/?filter=10843
- see https://docs.banno.com/operations/customer-issues/#engineering-triage-process for more details
Work on any customer issue related development/fixes, as needed
Work on any incident related development/fixes, as needed
Address slack messages to @pupper-ff group
Address jira messages to @pupper & @pupper-ff
Monitor #org-pupper room for triage
Monitor #war-room-go-live for triage
Monitor #auto-pupper-alerts for alerts
Monitor #prod-people-reports for reports and mobile-admin triage
Monitor graphs periodically and after deploys
Work on non-feature related needs (such as logging, alerts, metrics, outside requests - infra/security, etc), as time allows

Secondary

Acts as a safety net
Primary to reach out when help is needed (whether underwater or primary will be unavailable for a period of time)
Update Pagerduty as needed when swaps occur

Requirements

Pagerduty account
VPN access
Be in Banno organization and Pupper team in Github
General knowledge of how to use Kibana, Marathon, Grafana, Mesos, Prometheus, Data Services Reporting

Incident Analysis

General incident analysis can be found here: https://github.com/Banno/incident-analysis
We will list/link to our incident analysis for our team as we get those started.
- Short Form
- Long Form
Incident Analysis for
- anything that needs a roll back/forward (hotfix/quick fix)
- incidents we start or get pulled into (that we do work for)
- incidents where Support updates our external status page