Applying TLS Everywhere to an existing OpenStack 13 (Queens) cloud

TLS-Everywhere was introduced in the Queens cycle to provide TLS security over pretty much all communication paths within OpenStack. Not just the public endpoints – that’s been present for a while – but also the internal endpoints, admin endpoints, RabbitMQ bus and Galera replication/connections too.

Unfortunately, out of the box you cannot apply the TLS everywhere environment files on an existing OSP13 cloud and expect it to just work. The TLS everywhere feature in Queens, and indeed Rocky, is based on the assumption that you are deploying a fresh cloud.

After some work over the last few days with some colleagues, there’s a solution to applying TLS-everywhere retrospectively on an OSP13 deployment. But be warned: it’s messy.

The Problems

To avoid an essay, I’ll summarise the problems that will appear if you attempt to run a stack update and simply add the TLS-everywhere environment files:

  • Novajoin requires server metadata that is not present on the Nova server instance on the undercloud. This metadata includes the details of which services and networks are on which host, along with the crucial “ipa_enrol = True” property. This metadata is only created when the EnableInternalTLS parameter is set to true.
    • Solution: easy, include the environment files ūüôā
  • The required hosts and services will not be present in FreeIPA. Novajoin creates hosts/services/etc in response to metadata queries, which do not happen during a stack update – only on stack create (via the config drive).
    • Solution: you can query the metadata service after the fact and trigger Novajoin enrolment. This has to be done during the stack update as you need the server metadata mentioned above for novajoin to work.
  • The galera, haproxy, rabbitmq and redis Pacemaker bundles will not be updated to bind mount the new certificates.
    • Solution: covered below. This requires additional hieradata overrides and specific patches from upstream puppet-pacemaker.

Triggering Novajoin via ExtraConfigPre

Internal certificates are obtained by Certmonger requesting them directly from FreeIPA. On server creation Novajoin takes care of establishing the hosts and services necessary in FreeIPA, making a one time password (OTP) available to the instance via metadata. The server can use this OTP to enrol in IPA and download the server’s keytab, enabling Certmonger to request certificates for the various services.

Novajoin downloads a registration script to the instance via static vendordata. This script first checks if metadata is provided on the config drive (which it will be if this is a fresh deploy with Novajoin), then checks if it is available over the metadata IP endpoint (169.254.169.254).

I pre-deploy this script onto all hosts under /tmp, then execute it via an ExtraConfigPre resource. The script is modified slightly to skip searching for data on the config drive – we need it to hit the metadata service directly. This resource looks like so:

heat_template_version: 2014-10-16

description: >
  Apply deep_compare patches and call setup-ipa-client.sh

parameters:
  server:
    type: string
  DeployIdentifier:
    type: string
    default: ''
    description: >
      Setting this to a unique value will re-run any deployment tasks which
      perform configuration on a Heat stack-update.

resources:
  TlsEverywhereExtraConfigPre:
    type: OS::Heat::SoftwareConfig
    properties:
      group: script
      config: |
        #!/bin/bash
        # execute the setup-ipa script
        # this calls novajoin for host registration
        /tmp/setup-ipa-client.sh

  TlsEverywhereExtraDeploymentPre:
    type: OS::Heat::SoftwareDeployment
    properties:
      config: {get_resource: TlsEverywhereExtraConfigPre}
      server: { get_param: server }
      actions: ['CREATE','UPDATE']
      input_values:
        deploy_identifier: {get_param: DeployIdentifier}

outputs:
  deploy_stdout:
    description: Deployment reference, used to trigger pre-deploy on changes
    value: {get_attr: [TlsEverywhereExtraDeploymentPre, deploy_stdout]}

The associated resource registries are:

resource_registry:
    OS::TripleO::ControllerExtraConfigPre: /home/stack/overcloud/templates/internal-tls/tls_everywhere_extraconfig_pre.yaml
    OS::TripleO::ComputeExtraConfigPre: /home/stack/overcloud/templates/internal-tls/tls_everywhere_extraconfig_pre.yaml

In this case I have used the [ROLE]ExtraConfigPre hooks as my deployment-wide ExtraConfigPre hook is taken by my RHSM registration resource.

The modified script looks like the below. It’s basically identical to the static version, except we’ve removed the config drive checks to specifically target the metadata service. We’ve also added a check to see if – for whatever reason – /etc/ipa/ca.crt is a directory, not a file, which we’ve seen once or twice (can’t pin down why yet though):

# MODIFIED
#
# check if /etc/ipa/ca.crt is a directoy
# remove it if so
if [ -d "/etc/ipa/ca.crt" ]; then
  rmdir /etc/ipa/ca.crt
fi

# MODIFIED
#
# Config drive metadata function removed.
function get_metadata_network {
  # Get metadata over the network
  data=$(timeout 300 /bin/bash -c 'data=""; while [ -z "$data" ]; do sleep $[ ( $RANDOM % 10 )  + 1 ]s; data=`curl -s http://169.254.169.254/openstack/2016-10-06/vendor_data2.json 2>/dev/null`; done; echo $data')

  if [[ $? != 0 ]] ; then
    echo "Unable to retrieve metadata from metadata service."
    return 1
  fi
}

# MODIFIED
#
# Skip checking the config drive.
if ! get_metadata_network; then
  echo "FATAL: No metadata available"
  exit 1
fi

# Get the instance hostname out of the metadata
fqdn=`echo $data | python -c 'import json,sys;obj=json.load(sys.stdin);print obj.get("join", {}).get("hostname", "")'`

if [ -z "$fqdn" ]; then
  echo "Unable to determine hostname"
  exit 1
fi

realm=`echo $data | python -c 'import json,sys;obj=json.load(sys.stdin);print obj.get("join", {}).get("krb_realm", "")'`
otp=`echo $data | python -c 'import json,sys;obj=json.load(sys.stdin);print obj.get("join", {}).get("ipaotp", "")'`

# MODIFIED
#
# `hostname -f` returns the FQDN, so ipa-client-install will not change the hostname
# `hostname` will continue to return the short name. Certmonger will fail to retrieve certificates
# when `hostname` is returning the short name, as it will fail to find an appropriate principal in the keytab.
#
# Changed this from `hostname -f` to `hostname` to force the update.
hostname=`/bin/hostname`

# run ipa-client-install
OPTS="-U -w $otp"
if [ $hostname != $fqdn ]; then
  OPTS="$OPTS --hostname $fqdn"
fi
if [ -n "$realm" ]; then
  OPTS="$OPTS --realm=$realm"
fi

ipa-client-install $OPTS

Necessary hieradata overrides

You will require the following overrides set in your hieradata. This only impacts the Pacemaker nodes, so in this case I have added it to ControllerExtraConfig:

    ControllerExtraConfig:
      pacemaker::resource::bundle::deep_compare: true
      pacemaker::resource::ip::deep_compare: true
      pacemaker::resource::ocf::deep_compare: true
      pacemaker::resource::remote::deep_compare: true

These trigger a deep comparison of Pacemaker resources, which is needed to ensure the pacemaker bundles are re-created when the storage maps are updated with the TLS certificates.

These are set by default from this commit onwards, however.

Necessary upstream patches for deep comparison

The hieradata overrides are needed to trigger deep comparison but the comparison logic is buggy with certain kinds of resources. With the above settings RabbitMQ will restart successfully, but Galera, Redis and haproxy will not.

This upstream commit needs to be applied on your Pacemaker nodes to ensure the resources are rebuilt and restarted correctly.

To make this work I turned this commit into a straight diff and applied it as part of ExtraConfigPre, with a quick check to avoid failure if the patching fails. It’s ugly, it’s failure prone and I am not fond of this method – but it works as a proof of concept and does result in bundles being updated correctly.

It would be better if the fix was backported onto the Queens branch and made available as an update to the puppet-pacemaker package. In absence of that it may be simpler to deploy a patched version of the changed files as Overcloud deployment artifacts.

The ugly

While this process works, the control plane is not in a healthy state during the rollout. Specifically you will see the following:

  • Non-pacemaker services will start failing to connect to AMQP, Galera and Redis until they are restarted in Step 3 or 4. These pacemaker services start listening on SSL from Step 2, and although the paunch-mananged containers have their configuration updated in Step 1, they aren’t restarted until after Step 2.
  • Pacemaker bundles are restarted when configuration files are changed (Step 1), even though the certificates these changed configuration files reference are not bound into the containers. As a result, your Pacemaker resources will start failing from Step 1 until we get to Step 2 and the resources themselves have their storage maps updated. It’s pretty ugly, but it will get through it.

In short: your control plane is toast during this procedure, but it will come good by the end after everything has been updated and restarted.

I have been experimenting with bringing down the control plane during this procedure, specifically everything other than Pacemaker and the associated bundles.

The result

After the rollout completes, here’s what the internal endpoints look like:

(overcloud) [stack@undercloud overcloud]$ openstack endpoint list -c 'Service Name' -c 'Interface' -c 'URL' 
+--------------+-----------+--------------------------------------------------------------------+
| Service Name | Interface | URL                                                                |
+--------------+-----------+--------------------------------------------------------------------+
| cinderv3     | internal  | https://cloud.internalapi.os1.home.ajg.id.au:8776/v3/%(tenant_id)s |
| neutron      | internal  | https://cloud.internalapi.os1.home.ajg.id.au:9696                  |
| keystone     | public    | https://cloud.os1.home.ajg.id.au:13000                             |
| neutron      | public    | https://cloud.os1.home.ajg.id.au:13696                             |
| nova         | internal  | https://cloud.internalapi.os1.home.ajg.id.au:8774/v2.1             |
| cinder       | internal  | https://cloud.internalapi.os1.home.ajg.id.au:8776/v1/%(tenant_id)s |
| nova         | admin     | https://cloud.internalapi.os1.home.ajg.id.au:8774/v2.1             |
| glance       | internal  | https://cloud.internalapi.os1.home.ajg.id.au:9292                  |
| cinderv3     | public    | https://cloud.os1.home.ajg.id.au:13776/v3/%(tenant_id)s            |
| keystone     | internal  | https://cloud.internalapi.os1.home.ajg.id.au:5000                  |
| placement    | public    | https://cloud.os1.home.ajg.id.au:13778/placement                   |
| glance       | admin     | https://cloud.internalapi.os1.home.ajg.id.au:9292                  |
| cinder       | public    | https://cloud.os1.home.ajg.id.au:13776/v1/%(tenant_id)s            |
| keystone     | admin     | https://cloud.ctlplane.os1.home.ajg.id.au:35357                    |
| cinderv2     | internal  | https://cloud.internalapi.os1.home.ajg.id.au:8776/v2/%(tenant_id)s |
| nova         | public    | https://cloud.os1.home.ajg.id.au:13774/v2.1                        |
| cinderv3     | admin     | https://cloud.internalapi.os1.home.ajg.id.au:8776/v3/%(tenant_id)s |
| cinderv2     | public    | https://cloud.os1.home.ajg.id.au:13776/v2/%(tenant_id)s            |
| cinderv2     | admin     | https://cloud.internalapi.os1.home.ajg.id.au:8776/v2/%(tenant_id)s |
| cinder       | admin     | https://cloud.internalapi.os1.home.ajg.id.au:8776/v1/%(tenant_id)s |
| placement    | internal  | https://cloud.internalapi.os1.home.ajg.id.au:8778/placement        |
| placement    | admin     | https://cloud.internalapi.os1.home.ajg.id.au:8778/placement        |
| glance       | public    | https://cloud.os1.home.ajg.id.au:13292                             |
| neutron      | admin     | https://cloud.internalapi.os1.home.ajg.id.au:9696                  |
+--------------+-----------+--------------------------------------------------------------------+

Unresolved issues

Currently, Nova is complaining with the following exception when attempting to connect to Galera:

2019-02-24 13:16:34.987 18 ERROR nova.context [req-02512700-6ef0-4849-9300-2c938d1fdcd5 4ae46f1ff35d4cbd8dccb7bb9ca2df1e f0a344315e6f45e2ad13f6b94cfd3065 - default default] Error gathering result from cell 00000000-0000-0000-0000-000000000000: CertificateError: hostname  
u'172.16.2.15' doesn't match either of 'cloud.internalapi.os1.home.ajg.id.au', 'os1-controller-2.internalapi.os1.home.ajg.id.au'
2019-02-24 13:16:34.987 18 ERROR nova.context Traceback (most recent call last):
2019-02-24 13:16:34.987 18 ERROR nova.context   File "/usr/lib/python2.7/site-packages/nova/context.py", line 438, in gather_result
2019-02-24 13:16:34.987 18 ERROR nova.context     result = fn(cctxt, *args, **kwargs)
2019-02-24 13:16:34.987 18 ERROR nova.context   File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper
2019-02-24 13:16:34.987 18 ERROR nova.context     result = fn(cls, context, *args, **kwargs)
2019-02-24 13:16:34.987 18 ERROR nova.context   File "/usr/lib/python2.7/site-packages/nova/objects/instance.py", line 1513, in get_counts
2019-02-24 13:16:34.987 18 ERROR nova.context     return cls._get_counts_in_db(context, project_id, user_id=user_id)
2019-02-24 13:16:34.987 18 ERROR nova.context   File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 254, in wrapped
2019-02-24 13:16:34.987 18 ERROR nova.context     with ctxt_mgr.reader.using(context):
2019-02-24 13:16:34.987 18 ERROR nova.context   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
... snip ...

2019-02-24 13:16:34.987 18 ERROR nova.context   File "/usr/lib64/python2.7/ssl.py", line 267, in match_hostname
2019-02-24 13:16:34.987 18 ERROR nova.context     % (hostname, ', '.join(map(repr, dnsnames))))
2019-02-24 13:16:34.987 18 ERROR nova.context CertificateError: hostname u'172.16.2.15' doesn't match either of 'cloud.internalapi.os1.home.ajg.id.au', 'os1-controller-2.internalapi.os1.home.ajg.id.au'

This currently doesn’t make sense, as the configuration file specifically defines the host to connect to as cloud.internalapi.os1.home.ajg.id.au. Somewhere in there it seems to be resolving the FQDN back to the internal API VIP, and then complaining because the VIP is – rightly – not in the list of CNs presented by the certificate. The reverse record is established for that IP. Need to do more digging as to why it’s using the IP address instead of the hostname…perhaps somewhere in that stack someone is resolving the FQDN.

Edit: found it. The offending IP is held in the nova_api.cell_mappings table for cell0. Looks like this isn’t updated as part of a TLS everywhere deploy. A manual update later with the following query:

MariaDB [nova_api]> select name, database_connection from cell_mappings;                                
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| name    | database_connection                                                                                                                                                   |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| cell0   | mysql+pymysql://nova:kY3szuuwGvRpCYrCqWh9yqXsz@172.16.2.15/nova_cell0?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo |
| default | mysql+pymysql://nova:kY3szuuwGvRpCYrCqWh9yqXsz@cloud.internalapi.os1.home.ajg.id.au/nova?read_default_group=tripleo&read_default_file=/etc/my.cnf.d/tripleo.cnf       |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
MariaDB [nova_api]> update cell_mappings set database_connection=REPLACE(database_connection, '172.16.2.15','cloud.internalapi.os1.home.ajg.id.au');
Query OK, 1 row affected (0.02 sec)
Rows matched: 2  Changed: 1  Warnings: 0

And the error disappears.

Leave a Reply

Your email address will not be published. Required fields are marked *