Debug and Benchmark Ansible Playbook Performance

Debugging and benchmarking can be heavy topics. Frankly, there are a lot of options when it comes to analyzing playbook performance. My goal will be to make this as simple as possible.

Playbook Debugging and Connection Logging

Ansible allows you to enable debug logging at a number of levels. Most importantly, at both the playbook run level, and the network connection level. Having that info will become invaluable with your future troubleshooting efforts.

From the Ansible CLI, you can set these as environment variables before running a playbook:

export ANSIBLE_DEBUG=true
export ANSIBLE_LOG_PATH=/tmp/ansible-debug.log
ansible-playbook … -v

-v will show increased logging around individual plays/tasks
-vvv will show Ansible playbook execution run logs
-vvvv will show SSH and TCP connection logging

And from within AWX/Tower, you can set the logging/debugging verbosity for each playbook job template:
tower job template verbosity

If standard debug options aren’t enough…if you want to truly see everything that’s happening with Ansible, say no more! You can keep remote log files and enable persistent log messages. And with great power comes great responsibility.

export ANSIBLE_KEEP_REMOTE_FILES=true
export ANSIBLE_PERSISTENT_LOG_MESSAGES=True

Both of these are super insecure. They will give you everything that Ansible does and sees — the deepest view of exactly what’s happening on a remote device. And you’ll expose vars/logs as they’re unencrypted and read. So do be careful!

On the network and non-OS side, I often enable ansible_persistent_log_messages to see netconf responses, system calls, and other such things from my network inventories.

2021-03-01 20:35:23,127 p=26577 u=ec2-user n=ansible | jsonrpc response: {"jsonrpc": "2.0", "id": "9c4d684c-d252-4b6d-b624-dff71b20e0d3", "result": [["vvvv", "loaded netconf plugin default from path /home/ec2-user/ansible/ansible_venv/lib/python3.7/site-packages/ansible/plugins/netconf/default.py for network_os default"], ["log", "network_os is set to default"], ["warning", "Persistent connection logging is enabled for lab1-idf1-acc-sw01. This will log ALL interactions to /home/ec2-user/ansible/ansible_debug.log and WILL NOT redact sensitive configuration like passwords. USE WITH CAUTION!"]]}

And then it comes to a traditional OS, setting ansible_keep_remote_files will allow you to see the equivalent levels of process events, system calls, and whatnot that have happened on the remote system.

Ansible and Network Debugging:
https://docs.ansible.com/ansible/latest/dev_guide/debugging.html
https://docs.ansible.com/ansible/latest/network/user_guide/network_debug_troubleshooting.html

Scale and Performance Testing

Now let’s talk benchmarking! We can easily get ourselves deep into the weeds here. So let’s start simple.

Simply noting Ansible CLI or Tower Job run times is the best place to start. Additionally, we can look at timing for individual tasks. To aid us in the process, there are a number of plugins that can be enabled both in Ansible.

Ansible is built around many types of plugins, one of which is callback plugins. Callback plugins allow you to add some very interesting capabilities to Ansible, such as making your computer read the playbook as it runs. Ansible ships with a number of callback plugins that are ready to use out-of-the-box — you simply have to enable them in your ansible.cfg. Add a comma separated list of callback plugins to callback_whitelist in your ansible.cfg.

The particular callback plugin that will help with performance tuning our playbooks is called profile_tasks. It prints out a detailed breakdown of task execution times, sorted from longest to shortest, as well as a running timer during play execution. Speaking of, timer is another useful callback plugin that prints total execution time, similar to time but with more friendly output.

Ultimately, let’s start with these. Edit your ansible.cfg file to enable these callback plugins:

callback_whitelist = profile_tasks, timer

With the profile_tasks and timer callback plugins enabled, run your playbook again and you’ll see more output. For example, here’s a profile of fact collection tasks on a single Cisco inventory host:

ansible-playbook facts.yml --ask-vault-pass -e "survey_hosts=cisco-ios"

ansible_facts : collect output from ios device ------------ 1.94s
ansible_facts : include cisco-ios tasks ------------------- 0.50s
ansible_facts : set config_lines fact --------------------- 0.26s
ansible_facts : set version fact -------------------------- 0.07s
ansible_facts : set management interface name fact -------- 0.07s
ansible_facts : set model number -------------------------- 0.07s
ansible_facts : set config fact --------------------------- 0.07s

And a profile of `change password` on a host:

ansible-playbook config_localpw.yml -e "survey_hosts=cisco-ios"

config_localpw : Update line passwords --------------------- 4.66s
ansible_facts : collect output from ios device ------------- 5.06s
ansible_facts : include cisco-ios tasks -------------------- 0.51s
config_localpw : Update line passwords --------------------- 0.34s
config_localpw : Update enable and username config lines --- 0.33s
config_localpw : debug ------------------------------------- 0.23s
config_localpw : Update enable and username config lines --- 0.22s
config_localpw : Update terminal server username doorbell -- 0.22s
config_localpw : Update line passwords --------------------- 0.20s
config_localpw : Update terminal server username doorbell -- 0.20s
config_localpw : Update terminal server username doorbell -- 0.19s
config_localpw : debug ------------------------------------- 0.19s
config_localpw : Identify if it has a modem ---------------- 0.14s
config_localpw : set_fact - Modem slot 2 ------------------- 0.11s
config_localpw : set_fact - Modem slot 1 ------------------- 0.10s
config_localpw : Update terminal server username doorbell -- 0.10s
config_localpw : set_fact - Modem slot 3 ------------------- 0.10s
config_localpw : Update line passwords --------------------- 0.10s
config_localpw : Update enable and username config lines --- 0.09s
config_localpw : Update enable and username config lines --- 0.09s

Automation will be unique to every organization, and it’s important to regularly track performance benchmarks as your roles evolves. Beyond the obvious benefit of being able to accurately estimate your automation run times, you can determine where improvements can be made while proactively monitoring for faulty code/logic that will inevitably slip through peer reviews.

Profiling tasks/roles:
https://docs.ansible.com/ansible/latest/plugins/callback/profile_tasks.html
https://docs.ansible.com/ansible/latest/plugins/callback/profile_roles.html

All plugins:
https://docs.ansible.com/ansible/latest/plugins/callback.html

Ansible CLI Process Monitoring

I use dstat when I’m trying to benchmark Ansible CLI performance, and it’s the first place I start when I need to troubleshoot performance.

dstat is an is an extremely versatile replacement for vmstat, iostat and ifstat — it’s an all-in-one stat collection tool. It will gather CPU, process, memory, I/O, network, etc… I use it to log/monitor any and all system stats related to Ansible/Tower.

dstat -tcmslrnpyi --fs --socket --unix --top-oom
   -t  timestamp
   -c total CPU usage
   -m memory usage
   -s swap space
   -l load average
   -r disk I/O
   -n network I/O
   -p processes
   -y linux system stats
   -i interrupts
   --fs filesystem open files and inodes
   --socket network sockets
   --unix unix sockets--top-oom watch for OOM process

Note: In RHEL8+, dstat is supplied by the pcp-system-tools package.

Scaling Ansible and AWX/Tower with Network and Cloud Inventories

This topic is covered more in-depth in my Red Hat Summit talk on Managing 15,000 Network Devices.

Quick primer: Ansible is a CLI orchestration application that is written in Python and that operates over SSH and HTTPS. AWX (downstream, unsupported) and Tower (upstream, supported) are the suite of UI/API, job scheduler, and security gateway functionalities around Ansible.

Ansible and AWX/Tower operate and function somewhat differently when configuring network, cloud, and generic platform endpoints, versus when performing traditional OS management or targeting APIs. The differentiator between Ansible’s connectivity is, quite frankly, OS and applications — things that can run Python — versus everything else that cannot run Python.

Ansible with Operating Systems

When Ansible runs against an OS like Linux and Windows, the remote hosts receive a tarball of python programs/plugins, Operating System, or API commands via SSH or HTTPS. The remote hosts unpack and runs these playbooks, while APIs receive a sequence of URLs. In either case, both types of OS and API configurations returns the results to Ansible/Tower. In the case of OS’ like Linux and Windows, these hosts process their own data and state changes, and then return the results to Ansible/Tower.

As an example with a Linux host, a standard playbook to enable and configure the host logging service would be initiated by Ansible/Tower, and would then run entirely on the remote host. Upon completion, only task results and state changes are sent back to Ansible. With OS automation, Tower orchestrates changes and processes data.

Ansible with Network and Cloud Devices

Network and cloud devices, on the other hand,  don’t perform their own data processing, and are often sending nonstop command output back to Ansible. In this case, all data processing is performed locally on Ansible or AWX/Tower nodes.

Rather than being able to rely on remote devices to do their own work, Ansible handles all data processing as it’s received from network cloud devices. This will have drastic, and potentially catastrophic, implications when running playbooks at scale against network/cloud inventories.

Ansible Networking at Scale — Things to Consider

In the pursuit of scaling Ansible and AWX/Tower to manage network and cloud devices, we must consider a number of factors that will directly impact playbook and job performance:

Frequency/extent of orchestrating/scheduling device changes
With any large inventory, there comes a balancing act between scheduling frequent or large-scale configuration changes, while avoiding physical resource contention. At a high level, this can be as simple as benchmarking job run times with Tower resource loads, and setting job template forks accordingly. This will become critical in future development. More on that later.

Device configuration size
Most network automation roles will be utilizing Ansible Facts derived from inventory vars and device configs. By looking at the raw device config sizes, such as the text output from show run all, we can establish a rough estimate of per-host memory usage during large jobs.

Inventory sizes and devices families, e.g. IOS, NXOS, XR
Depending on overall inventory size, and the likelihood of significant inventory metadata, it’s critical to ensure that inventories are broken into multiple smaller groups — group sizes of 500 or less are preferable, while it’s highly recommended to limit max group sizes to 5,000 or less.

It’s important to note that device types/families perform noticeably faster/slower than others. IOS, for instance, is often 3-4 faster than NXOS.

Making Use of Ansible Facts
Ansible can collect device “facts” — useful variables about remote hosts — that can be used in playbooks. These facts can be cached in Tower, as well. The combination of using network facts and fact caching can allow you to poll existing data rather than parsing real-time commands.

Effectively using facts, and the fact cache, will significantly increase Ansible/Tower job speed, while reducing overall processing loads.

Development methodology
When creating new automation roles, it’s imperative that you establish solid standards and development practices. Ideally, you want to outright avoid potentially significant processing and execution times that plague novice developers.

Start with simple, stepping through your automation workflow task-by-task, and understand the logical progression of tasks/changes.. Ansible is a wonderfully simple tool, but it’s easy to overcomplicate code with faulty or overly-complex logic.

And be careful with numerous role dependencies, and dependency recursion using layer-upon-layer of ’include’ and ’import’. If you’re traversing more than 3-4 levels per role, then it’s time to break out that automation/logic into smaller chunks. Otherwise, a large role running against a large inventory can run OOM simply from attempting to load the same million dependencies per host.

Easier said than done, of course. There’s a lot here, and to some extent, this all comes with time. Write, play, break, and learn!