Building resilient state machines with CloudForms/ManageIQ

State Machines are a powerful feature of CloudForms automation. If you are unfamiliar with the concept, a state machine in the context of CloudForms automation is a series of steps that are executed sequentially by the Automation Engine.

In particular, state machines give us:

  • The ability to retry steps if they fail.
  • Jump between steps by name, or skip the immediately following step.
  • Exceute code on enter or exit of a step, or if the step returns an error.
  • Store state variables in earlier steps that can be referenced in later steps.
  • Since CloudForms 4.6, we now have the ability to execute Ansible playbooks as steps in a state machine.

State machines are already used heavily for the provisioning workflows that ship with CloudForms out-of-the-box. If you’d like to know more about creating state machines, have a look at Mastering Automation in CloudForms and Manage IQ, available here on the customer portal.

State machines are undoubtedly powerful – when they work start to finish.

What happens if a state machine fails before it’s complete?

The problem

For example, let’s assume my state machine provisions an OpenStack tenancy and starts after the user submits a service request. It needs to do the following:

Step 1: Create the new OpenStack project.

Step 2: Create a private network within the project.

Step 3: Create a router that enables the project to reach the outside world.

Step 4: Assign the user as a member of the project.

Step 5: E-mail the customer when the environment is ready to go.

Let’s assume the state machine fails at step 3 – we’ve got errors in automate.log and the user’s request is now showing as “Failed”.

Now what? We now have a half-constructed OpenStack project that:

  • Exists in OpenStack (good!)
  • Has a private network attached (good!)
  • Has no router, and so instances on the private network can’t reach the outside world (not good).

Clearly, this is a problem. To remedy this we could:

  1. Undo steps 1 and 2 manually? That could work, assuming the underlying problem is solved and the state machine failed early enough. That might work in this case, but what if unpicking those steps requires a substantial amount of manual, error-prone effort? What if there’s a dozen steps prior to the failed one?
  2. Manually complete the remaining steps? Also possible, but what if there’s a dozen steps to go? Do we really want to manually do all of that (the whole point was to automate, after all!)? What if we need to do this for a dozen requests, each of which failed for the same reason?
  3. Re-submit the request, to start the state machine again? Possibly, but will our automation code handle the case where it attempts to create a new project that already exists in the environment?

This is a contrived example but it illustrates a broader issue: we need to build a state machine that is resilient to failure.

Let’s look at some options!

Technique 1: Ansible!

Embedded Ansible has been part of CloudForms since 4.5, and has only got better with the release of 4.6. In particular, CF4.6 enables us to execute playbooks as steps in our state machines, offering the tantalising prospect of mixing Ruby and Ansible.

Use Ruby for what Ruby does well and use Ansible for what Ansible does well. In particular Ansible does idempotency very well, as long as you use modules that support idempotency (well written ones will). An idempotent method is one where when you call it multiple times with the same inputs, it always produces the same result with no side-effects. That makes it a very useful property for a state machine step that we might want to retry (see below).

In my hypothetical example above, I could use Ruby and make multiple calls to the Fog gem to execute OpenStack API queries. That’s entirely valid and would absolutely work. I would have to build in the idempotency logic myself though, which makes my methods longer, more complex, and introduces more chances for bugs.

Or I could write a playbook with the required steps and use the os_* modules that already ship with Ansible, and call it as one of my states in my state machine (after a guard check, of course). As an added bonus I get a very readable YAML syntax, rather than Ruby code.

No contest!

Technique 2: The Guard Check

Guard checks ensure that everything is ready to go before executing the following steps of the state machine, or before executing a specific step. If the checks don’t pass, the state machine won’t run because it will almost certainly fail if it does.

For example, your guard check could be the very first step of a state machine and it could:

  1. Ensure an API for a remote service (e.g IPA, ServiceNow, InfoBlox, Single Sign On…) is accessible.
  2. Ensure your credentials are still valid for that API (perhaps e-mail an administrator if they are not!).
  3. Ensure the input from the service dialog, or other parameters, are valid and what you expect to receive, massaged if necessary (e.g. trimming whitespace).

The idea behind a guard check is to fail fast and fail early. The sooner we stop our state machine, the fewer things we need to fix in our environment.

Some options to implement a guard check:

  1. Have the very first step of your state machine perform the checks, and do not run the rest of the state machine if they fail.
  2. Perform the check before each step using the On Enter method, and either trigger a step retry or fail if the required conditions aren’t met.
  3. Add logic to your service dialog that does not allow the request to proceed if the user inputs are not valid.

Technique 3: Retries

If the problem is transient, for example an API service is unavailable, then consider using a retry rather than failing outright. Retry the state machine after an appropriate delay – if the issue is resolved, the step will succeed on the next attempt.

If the issue isn’t resolved, what happens next is up to you: you could retry continuously until it succeeds or retry a maximum number of times and then fail.

You can trigger a retry of a step by setting the value of $evm.root[‘ae_result’] to ‘retry’ and exiting the method:

$evm.root[‘ae_result’] = ‘retry’
$evm.root[‘ae_retry_interval’] = ‘1.minute’

You can check how many retries have been performed by checking the value of $evm.root[‘ae_state_retries’].

Consider also implementing a back-off function, so that each time you need to retry the same method the retry interval gets longer and longer. Here’s some code that implements a basic exponential retry function, with a minimum retry interval of 60 seconds and a maximum interval of 1 hour:

retries = $evm.root['ae_state_retries']

# with the below exponential retry interval logic,
# 20 retries works out to approximately 10 hours.
# i.e., after retrying for 10 hours we will fail this state machine.
max_retries = 20

if retries < max_retries:
   # a very basic exponential backoff, but we clamp it to a minimum of
   # 60 seconds and a maximum of 3600 seconds (1 hour).
   retry_interval = (2**retries).clamp( 60, 3600 )
   $evm.root['ae_retry_interval'] = "#{back_off}.second"
   $evm.root['ae_result'] = 'retry']
else
   $evm.log(:warn, "State machine step failed after #{retries} retries!")
   exit MIQ_ABORT
end

 

Conclusion

With some careful thought and architecting we can build state machines that survive failures of their component steps. The key things to remember are:

  1. Never do in Ruby something you can do in Ansible. Use Ruby for what Ruby does well (guard checks, complex branching, skipping state machine steps) and use Ansible for what Ansible does well (a library of excellent modules and great readability). Remember you can execute Embedded Ansible playbooks as State Machine steps from CloudForms 4.6 onwards.
  2. Use guard checks to verify early on (e.g. at the start of the machine) that your access to external systems is still valid. Make service dialogs work for you by verifying user input before submission.
  3. Use retries to retry failed steps, especially if the problem is one that might be transient, such as unavailability of an external service. If you don’t want to keep fruitlessly hitting a remote service, consider a back-off function to gradually increase the delay between retry attempts.

Good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *