Overview
The servers that our automation tooling will provision in Lenexa can be delineated into a few groups: Mesos servers, Application servers, and databases. There is an additional exception in such that the HBase / Hadoop servers are not running atop vSphere / VMWare, but are bare metal HP servers that we have set up specifically for that purpose.
Mesos
Mesos is a platform for viewing your datacenter as a cluster of resources. Instead of the common model of provisioning specific applications on specific servers, the Mesos approach is to submit jobs to be run wherever Mesos has space within the datacenter. It will provision those applications atop one of the servers with available resources and our application will then be discovered via service discovery. Our load balancers work in tandem with service discovery to route the correct information to the correct application.
Application Servers
There remains a population of servers and applications that have not yet been ported into the Mesos model. These servers host many of our applications which have been provisioned via automation and can be re-created as needed. This is the more traditional model of placing applications on specific servers for specific purposes. These services are going to be transitioned into the Mesos model as soon as we can get to them.
Databases
Databases at Banno that hold customer data currently fall into the traditional provisioning model of servers with specific databases running on them, on specific ip addresses. We can provide a listing of the names of these servers that will be running in the Lenexa datacenter upon completion of the migration.
Automation and Server Uniformity
One characteristic of the automation platform we have written is that every server is provisioned atop the same security hardened Ubuntu 14.04 Server LTS image. We call that image the “stemcell”. Each server is then mutated into it’s speciality at the right time in the provisioning process. This ensures that when we create environments from scratch that the underlying systems have the same security policies applied to them, across the board. The ideal way to keep these up to date are to delete and recreate servers atop a new stemcell image as security requirements change. In lieu of that ideal, identical patches to the stemcell can be applied via automation tooling to keep everything uniformly up to date.
Disaster Recovery
From the beginning of this process, we have written the automation tooling in such a way that the entire production environment can be provisioned offsite in a disaster recovery situation via the same tooling that creates and maintains the existing environment. This means that instead of doing tabletop exercises of our disaster recovery, we can actually recreate an entire skeleton environment in Amazon Web Services that is ready to have the most recent data populated within it, and ready to accept production traffic.
If any more detail needs to be added to the answer to this question, or if you truly need an inventory of servers in the way that they will eventually stabilize in Lenexa, please just let me know and we will write that up to the best of our ability.
Layers of Abstraction in the Banno Lenexa Datacenter
At Banno we have a reasonably sophisticated way of deploying our applications within docker containers that find one another via service discovery. It is likely prudent to describe all the layers of abstraction from hardware all the way up to how the containers are provisioned and initialized.
- Bare Metal in Racks in Lenexa – This gear was installed and is maintained by the ETS (Enterprise Technology Services) group.
- vSphere / VMWare – This was installed by the ETS group and runs atop the bare metal gear in Lenexa.
- Virtual Machines – Using Terraform (http://www.terraform.io/) we use the vSphere API to spin up new virtual machines on the VMWare installation in Lenexa. These virtual machines come and go as needed. We do not rely on the fact that individual virtual machines will be permanent. By assuming that any virtual machine may lock up or disappear at any moment, we promote awareness that our system should be resilient to individual node failure.
- Mesos – After Terraform provisions all of the virtual machines with the newest “stemcell” image, it then sets up a fresh installation of Mesos on the correct virtual machines. Mesos has a Master that gossips with Agents on all of the virtual machines. The master keeps a record of what resources are free and where.
- Marathon – Next, Terraform provisions the Mesos Framework, Marathon, which controls which long running processes should be running in the datacenter. It interacts with the Mesos Master to ensure that whatever applications we tell Marathon to supervise stay online somewhere in the datacenter. In the event of virtual machine failure, Marathon and the Mesos Master conspire to determine where the failed applications should be placed.
- Docker & Sidecar – All of the applications that are initialized via Marathon on virtual machines with the Mesos agent installed are wrapped up in immutable docker containers. These containers contain one or two processes, in general. If it is a Java-based process then it has service discovery built into the process and can find where it’s dependencies are in the datacenter. If it is not a Java-based process then we run a tiny process in the container called the “sidecar”. That process informs the running application where its dependencies are. In the event of failure, the docker container may be started up on a different virtual machine wherein service discovery will help the system find the application and help the application find what it depends upon.
Separate from the VMWare / Mesos installation is an installation of bare metal HP servers that ETS installed in the racks but Banno sets up and maintains. This houses our hadoop installation, which is an important datastore for many of our products in the Banno platform.