Services Engineering Software Deployment and Release Process

The goal of this document is to establish a set of semi-opinionated release processes to match where we are today, and the steps forward from that point.

Requirements

Services must have success criteria for their deployments. What does a successful release look like? Possible successful release indicators:
- Errors Consistent with pre-release state
- SLO’s
- Alerts on prometheus, honeycomb or datadog metrics
- Rate of logged errors does not increase

Pre-Merge

Pre-merge software requires at least one PR approval. Individual team policies may require more than one.
internal-api-docs or consumer-api-docs should be written and ready to merge simultaneously to document any changes/additions/deletions to existing endpoints. Consumer-api-docs should have sign-offs from our client engineering teams before merging. If you do not want others to call a particular endpoint, include the endpoint in the documentation and call that out rather than not including the API documentation at all.
No more than a single set of related changes (e.g., feature branch) in the pipeline at a time. We have a continuous deployment process, meaning that there is no need for multiple failure points in the process simultaneously. Some services may be granted exceptions, but this is the general rule. If another feature is in the pipeline, wait till after that is released.

Merge

The “release owner” is the person that presses the merge button. The release owner is responsible for the release process that is triggered after, which concludes once the release is in production and either verified to be functioning properly or rolled back.
The release owner is responsible for the evaluation of service health throughout the release process and is expected to have eyes on key indicators throughout the process.

Post-Merge

All success criteria must be checked in UAT before deploying to Production.
Release owners are expected to remain engaged working on getting approvals. Ping your EM or others to get approval to facilitate the process in a continuous fashion to production.

Approvals

Regulatory Compliance requires all releases to production to have business/operational awareness of the change. This means a manager must leave an auditable trail that they have approved the software release.
An approver of a software release cannot have any software written by them in the release that they are approving. An approver may not approve any deployments/releases containing changes authored by that approver.

Canary Deployments

Canary Deployments: Canaries will always be used, if reasonable to the service use case, to mitigate risk of issues that would be experienced only in a production environment. Canary deployments are integrated into the Kubernetes pipeline. (Pulsar is actively working on expanding this item)
All Service Success criteria must be checked in canary before proceeding to full production deployments. The canary should be allowed to run long enough to ensure stability as well. The canary container names have -canary at the end to identify them (deployment name will be kube_deployment:service-canary).
Canary Variations: Certain teams and code bases have unique circumstances that cause variations to how they execute canary deployments. Please read the specific deployment instructions below before deploying these services:
- Node-API-Gateway (NAG)

Maintenance Windows for Deployments

A deployment or release that carries any amount of risk that cannot be mitigated with a canary or other risk-mitigating mechanism must be deployed in a maintenance window.
Banno has regularly occurring maintenance windows (these are documented here and in the Operational Engineering Google calendar):
- Weekly on Wednesdays from midnight to 2:00 am Central.
- Monthly on the second Tuesday from 11:00 pm to 2:00 am Central.
If the weekly or monthly maintenance windows are not suitable for the change or the change is desired or expected to be released sooner, an ad-hoc window can be scheduled by coordinating with Abby. That window should be added to the Operational Engineering calendar.

In-Production

All Service Success criteria must be checked in production. Logs, Metrics, Honeycomb are all expected to be known and checked by the release owner.
The release owner is expected to be fully engaged for a minimum of 30 minutes after the release to production or until the deployment is adequately validated to be functioning properly for all users and all FIs.

FAQ

What about changes that are only intended to be released to UAT and not to production?
- These should be coded behind feature flags that are disabled by default. That way the code can be fully released to production. Later, changes can be released via the flipping of the feature flag.
What regulatory compliance aspects do we need to be aware of?
- Technical Approval: Code changes and technical implementations must be reviewed by at least one other person to verify that the changes are reasonable and appropriate (the best of one’s ability to do so; nobody is perfect). The pull request review counts as a technical approval.
- Business/Operational Approval: At least one non-technical person must be aware of the changes being made to the production environment. The email sent to the approvers group with a reply from an approver containing “Approved” in the body counts as business/operational approval. The approvers group is comprised of engineering managers and other leadership in the Digital organization.
What to do if you can’t get business/operational approval?
- If you cannot get business/operational approval, please ping either your EM directly, who is already designated as an approver, or reach out to @Approvers in Slack with the subject line of the email for the release in question.
What to do if you need to release in an incident, after hours or when business/operational approvers are not available?
- During an incident you can release to production without explicitly getting business/operational approval. Those approvals can be made retroactively to the deployment/release.
Do we really need to be fully engaged with eyes on screen for a full 30 minutes after a release to production?
- For now, yes, but we expect to loosen these expectations as we improve our deployment and release reliability and stability.