Out of the box CloudForms comes with the ability to deploy PostgreSQL appliances that can be configured into a primary/standby relationship. If the primary fails, the standby takes over automatically.
Your non-database appliances are hardcoded to reference the primary via it’s IP address. Unfortunately, when the primary fails over to a standby this IP has changed but your appliances aren’t immediately aware. A watchdog service running on each appliance keeps an eye on the database and identifies when the primary has failed over. After a set period of time the watchdog updates the hardcoded database IP to the new primary and then restarts your evmserverd process to make the change take effect.
This occurs on every non-database appliance and so a primary failover event means an unavoidable outage across your entire region. Not good. But what if we could at least reduce the outage duration, perhaps by avoiding the restart of your main CloudForms service?
This post discusses one technique that doesn’t require CloudForms service restarts – use a virtual IP for your database. This VIP will live on the database that is the current primary and move when the role of primary fails over. With no more need to restart your CloudForms services recovery time from failover events is substantially reduced.
Show me the playbook!
Here is a playbook that will retrofit an existing region to use a virtual IP for the database. You need to provide it with the virtual IP to use, but after that it will:
- Install keepalived on all of your database nodes.
- Template a keepalived.conf file that deploys a virtual router.
- Configure firewalld rules on the hosts to permit VRRP traffic.
- Deploy a keepalived custom check script that verifies that pgsql is running on the host and is the current primary. If so the script returns 0 and keepalived assigns the VIP. If not, it returns 1.
- Note: this check script was taken from here and then modified to also consider being a standby as a ‘failure’. We only want keepalived to add the VIP if the database is alive and the primary.
- Update the database configuration of all of your region appliances to refer to the new VIP.
- Restart evmserverd across your region to have the above change take effect. That’s an outage – sorry!
Things to be aware of
- When configuring a standby, do not configure it to point to the VIP for replication. Always point it at the true IP address of the primary. The playbook takes care of this for you.
- Split-brain? VRRP is not a consensus protocol – in other words, a split-brain situation where the VIP promotes on both appliances is possible if connectivity to the primary is lost and connectivity between VRRP instances is lost also. In this situation you would also have a true split-brain situation for your pgsql database, so the VIP is the least of your concerns here 🙂
- Can’t see VRRP traffic! VRRP is a multicast protocol by default and not all environments permit multicast traffic without further configuration (if at all). If this is you, set the variable vrrp_use_unicast=True for the playbook. This will configure your virtual router to use unicast.
- You’ll know that multicast traffic isn’t permitted if you can’t see VRRP packets arriving on the interfaces of all of your database nodes (tcpdump -nevvi ethX vrrp). You will also see all of your database nodes add the same VIP as they can’t communicate with each other and so decide they will carry the VIP.
- Seamless? You will still have unavoidable errors on your appliances during this failover event – this will manifest as error messages in the UI and stack traces in the logs. This will be brief.
- Worker impacts? How processes that were running before the outage will cope – think Automation – is an unknown, but this is no different to if a failover event occurred without these modifications.
- What about haproxy? You have other options – if you don’t want to run keepalived and float a VIP between your DB appliances, consider adding a haproxy load balancer with a custom check script that only proxies to a backend if it is a master. See the article I linked above for a demonstration. Configure your non-VMDB appliances to connect to the haproxy instance instead (you can use the playbook above, just pass –skip-tags keepalived to have a VIP configured on your non-VMDB appliances but without installing keepalived on your DB nodes).
Post-script: why is the appliance restart a problem?
When the main CloudForms process (evmserverd) starts it takes an exclusive lock on the database. This is for the purposes of seeding the table with records when the region is created, or between version changes when new default records might be created.
Every appliance goes through this process during startup. While one appliance holds the exclusive lock it cannot be held by any other appliance. As a result we end up with a queue of appliances waiting for startup to hold an exclusive lock to perform a seeding tasks.
With a small handful of appliances this doesn’t take long, but at least a few seconds per appliance, multiplied by a few dozen appliances in a larger region…the time for full recovery adds up, quickly.