Active-Active
When considering architectures for our systems we’re looking at solving specific problems. Those problems being our regulatory burdens such as patch management and uptime, disaster recovery, business continuity, and build vs buy. Active-active system posture ends up solving a lot of these problems at their core.
Why are we talking about Active/Active so much and how does it affect our goals as a business? Active-active is a system posture where our systems run in multiple GCP regions where each region is able to take in traffic and work independently from the other. For example, if we take all of the services we have running today in US-East and run them simultaneously in US-Central with a load balancer in front, we’re now running in two separate regions. US-Central can now work without US-East working and vice-versa. This can only hold true for systems that need read after write consistency if we have external consistency in our database layer.
Disaster Recovery (DR) in a traditional data-center model requires many changes to happen including firewalls, databases, load balancing, routing tables, dns, and more. Regulators look at our meantime up comparing against a regulatory need of much less than the 99.5 in our SLA. Jack Henry is required to run in alternate data centers or regions once per year. We translate RTO (Recovery Time Objective) as the amount of time it takes for us to come back online. If we take our SLA of 99.5% we can have 72 hours of downtime in a year. This means zonally redundant architectures are fine. With a requirement of less than 24 hours regionally redundant is needed. Banno has a 4 hour RTO. This is not the only constraint we’re under. There is also RPO (recovery point objective) which is how much data we can lose. Banno’s RPO requirement is 1 hour. Replicating our PostgresDB is time consuming which continues to make the 1 hour RPO a challenge to meet. We’d like to give ourselves room for errors and mishaps which will happen during a disaster recovery.
In the event that the primary region becomes suddenly unavailable, our replication lag time is our data loss time (RPO). In the event of region wide failure while we’re active-active we direct traffic to the other already active region that is serving traffic and scale up. This cuts our downtime tremendously, and switches us from a disaster recovery posture to a disaster avoidance posture. It’s good to remember that our downtime for maintenance also counts against us for our regulatory needs and also from our customer’s perspective. Making DR faster, and easier (or preferably not needing one at all), is table stakes because of how much DB related downtime affects us. DB related downtime maintenance is the current majority of our downtime.
In GCP we have a few choices for databases that can perform reads and writes across regions and achieve consistency, but only one that allows for multi-region replication to other regions that is externally consistent. Most of the managed solutions fulfill our regulatory requirement for patching. RTO and RPO for Spanner are theoretically 0 while having uptime defined by Google as 5 9’s (99.999%). This gives us a healthy starting point for our own reliability as we are only going to be able to be as good as what we’re standing on. In the event of a wide-spread region failure we don’t need to fail-over to another region in a traditional sense, we just need to send our traffic to the other region since it’s already up and the data is consistent.