You’ve decided to deploy CloudForms to manage your hybrid cloud environment – fantastic! This article discusses a few architectural options and considerations you need before you deploy your new region.
Consideration 1: use zones to separate your appliances within a region
A zone is a collection of CloudForms appliances logically grouped together within the same region. A region is a collection of CloudForms appliances that all share the same Virtual Management Database (VMDB).
Some roles for your appliances are zone-aware and can only be active on a single appliance within a zone. Others are region-aware and wlil only be active on a single appliance in an entire region.
As an example: the Provider Inventory role is zone-aware, and so there can only be one appliance performing this role in a zone. If you have more than one provider in a zone, then one appliance performs all the inventory refresh work for all of the providers. For small deployments that’s not so bad – for big deployments (thousands of instances) it will almost certainly overload your appliance, causing refresh times to blow out considerably.
Rule: one provider per zone, maximum. This allows you to spread the work of provider refresh tasks, as well as other load-heavy tasks such as capacity and utilisation (C&U) collection and processing work. As one provider grows, you can scale the appliances in the zone to match.
Rule: split your user-facing appliances into their own zone. User-facing appliances need to be isolated from zone and region-level tasks so that they can solely serve client requests. Place these into their own zone and behind a load balancer. Don’t forget to pump up the number of UI workers and Web Services workers – the default is one of each out of the box; that won’t be enough for a large userbase.
Consideration 2: multi-region; is it worth it?
Multi-region is typically deployed when the size of the managed environment becomes so large that dozens of appliances are required across multiple zones, creating an ever growing number of connections to the primary VMDB and therefore increasing load on the VMDB. In essence you’re deploying multiple databases and spreading the load of the appliances by splitting the appliances into dedicated regions.
With a multi-region design you can deploy a global region that provides a mostly read-only view of all of the subordinate regions beneath it. The subordinate regions will replicate some of their data using pglogical up to the global region where it becomes visible.
You don’t get the same level of management tools from the global region as there’s only a limited set of non-read only actions you can perform – most functionality will require you to login to the subordinate region. If you aren’t particularly concerned about service deployment across regions, and are happy with a primarily view-only experience, multi-region could be viable for you.
On the other hand, if you want the full set of functionality available across all of your providers you need to stick with a single region. If you’ve got the hardware, you can ensure your primary and standby databases are powered with the raw CPUs and RAM necessary to support the load.
My vote? If you’ve got the hardware to power the VMDB, stick with single region. You get the full suite of capabilities of CloudForms without the additional complexity of running several sub-regions.
Consideration 3: primary database, standby database, and failover
CloudForms operates with a primary/standby failover model, where the primary pgsql database uses streaming replication to write logs to one or more hot standby servers. These standby servers are read-only. If the primary fails one of the standbys will promote itself to the position of new primary and start accepting writes from the client. This is a so-called warm standby.
The problem is that the CloudForms process (evmserverd) cannot automatically detect and self-heal from a primary failover event.
CloudForms appliances therefore include a daemon monitor called evm-failover-monitor whose sole purpose is to monitor the primary database for failover events. If one is detected the daemon shuts down the CloudForms evmserverd process running on that node, updates the Rails database configuration file, then starts the evmserverd process again.
Here’s the problem: this process is slow. Out of the box it’s 60 seconds before the primary database role is taken up by a standby (configurable, but not documented – see repmgr.conf). Then it’s another 60 seconds before the evm-failover-monitor starts checking for a new primary. Then your entire region goes down because every evmserverd process is shut off so a Ruby-on-Rails YAML config file can be updated, before being started up again.
Bottom line: without further adjustment to your deployment, an out-of-the-box deployment will result in downtime for your entire region during a failover event.
To make matters worse, each appliance upon startup attempts to obtain an exclusive lock on the VMDB database for the purposes of seeding. Only one appliance can hold this lock at a time, meaning all other appliances have to form a very long queue. The net result is that your regions can take a long time to recover from a failover event – don’t expect change out of 20-30 minutes if you have dozens of appliances.
So what can be done?
Your major options here are load balancing to your database nodes or establishing a virtual IP for failover purposes. I’m going to talk about both of these in another post.
The database is the core of your CloudForms deployment and so is going to factor heavily into the architecture you choose.
Regardless of how you deploy your database you should always split your providers into their own zones, so that dedicated appliances can handle loads and tasks for a single provider. Move your user-facing appliances into their own zone as well and turn off all non-essential roles so they can focus solely on serving user requests.
So there you go! A few considerations for deploying a CloudForms region. Keep an eye out for my next post where I’ll discuss using haproxy/keepalived to reduce the impact of database failover in a CloudForms region.