In the last post on this topic I discussed a technique of separating your configuration from your code in CloudForms’ Automation Engine. The technique relied on creating class overrides in a higher priority domain, enabling attributes on those higher priority classes to be passed down.
It works but it’s…clunky, and it requires a duplicate of every class in a higher domain. Not the most scalable technique!
Below I present two other methods – one that uses $evm.instance_get, and another that loads the config directly onto the root object.
Read more “Configuration domains in CloudForms/ManageIQ – part deux”
The CloudForms/ManageIQ automation engine gives us great power to create classes and methods that we can execute in response to various actions (e.g. button presses, events in the environment, REST API calls, etc).
Using Ruby and Ansible we can execute custom code to customise and enhance the CloudForms experience for our customers and users – we can add buttons to interfaces, override provisioning workflows, override approval workflows…etc.
Normally, we’ll have at least two environments – a Quality Assurance environment, and a Production environment. We don’t want those two to mix, and quite often key settings will be different between the two environments. For example, the FQDNs and credentials used to access remote services (e.g. Single Sign On, corporate DNS, etc) may differ between QA and Production. But the underlying logic in the code remains the same – it’s just environment-specific configuration that changes.
To get to a full Continuous Integration/Continuous Deployment model we need to be able to promote Automate code between environments cleanly. Having to execute find/replace across our automate code to replace QA settings with Production settings is not clean.
So how do we keep our common automation logic from intermixing with ou environment-specific settings and configuration?
The answer is a configuration domain. Read more “Configuration Domains with CloudForms/ManageIQ”
CloudForms ships as an appliance as a means of greatly minimising the deployment and configuration required. Whilst this deployment method removes a substantial amount of complexity by shipping with all packages and configuration needed to get a working appliance in a very short time, it isn’t entirely without human intervention.
At a minimum you will need to:
- Set hostname and network configuration, particularly if you wish to use a static IP address.
- Create a new Virtual Management Database (VMDB) and associated Region, or join an existing one.
- Configure encryption keys, particularly if you are joining an existing region.
- Set up external authentication via IPA, if your deployment method calls for it.
- Start the EVM server processes.
These steps can all be performed using the appliance console that ships with the appliance. Unfortunately, this menu-based interface doesn’t lend itself to automation (unless you want to get your hands dirty with expect).
If you’ve got one or two appliances that’s not a big impost. But if you’ve got 5? 10? Then we start to look at Ansible and think “I wonder if I could automate this?”
Turns out, you can!
Read more “Automation of CloudForms appliance setup with Ansible”
Interesting! I didn’t see it when I wrote the post last night, but 18 days ago a commit went into ManageIQ that specifically addresses the issue of proportional set size (see here). Notably, they draw the same conclusions I did about increased PSS and the lifetime of the main server process, so it’s nice to have some confirmation I wasn’t off on a tangent!
In a nutshell, ManageIQ will now use the unique set size (USS) to calculate the memory consumption of a worker to decide whether to kill it. USS (introduced at the same time as PSS – see the Wikipedia article) is designed to give a guarantee that the returned value is physical memory that is exclusively owned by the process and will not include shared memory.
Using this new method workers will no longer be ‘penalised’ for having a large PSS (as occurs under a leaky miqserver process). The calculations will be reflective of the actual memory usage of the worker, hopefully making worker kills and restarts a more infrequent occurrence.
See also bugzilla 1479356.
At work we support an installation of multiple CloudForms appliances, where each one is customised for a particular set of roles on each appliance (e.g. web services, user interface, EMS operations, etc). It’s not uncommon to see alerts for “evm_worker_memory_exceeded” in the log files, followed by more log data that shows the worker process being terminated and re-created again.
The longer the appliance lives for, the more frequently this will tend to occur. In order to try and understand why, I took a dig through the ManageIQ (upstream CloudForms) code-base to see what’s going on. How can a small worker process possibly chew up 500MB of RAM so quickly?
Below is my theory.
Read more “ManageIQ workers, proportional set size (PSS) and workers out of memory”