How Do You Start Your Network Automation Adoption Journey?

A lot of people are lucky to see the inner workings of a single tech giant or Fortune 100 company. Red Hat has the pleasure of working with basically all of them. And I think that’s what I love most about being a consultant and architect at Red Hat — I get to speak to, and work with, so many different people and groups from all of the biggest companies all over the world.

I think it’s endlessly fascinating to hear about what everyone else out there is doing. And nowadays, I talk to all sorts of people about network automaton. This is the definition of a dream job!

After spending years building out massive networking automation projects with tens of thousands, and hundreds of thousands of devices, device management is easier than ever. Ansible has evolved staggeringly quick, and in a lot of ways, the actual device configuration is a problem that’s quickly being solved.

More than ever, the question I most often get is simply how do I get started doing a network automation project? Nowadays, the biggest hurdle is often just getting your head wrapped around the options, and ideas, and ways, about how to do things one way or another.

How do you begin even thinking about going from a small lab with a dozen devices…to some gigantic production network with thousands upon thousands of devices that you’re supposed to just…”manage and automate?”

From my perspective as an infrastructure architect, this is one of the best ways that I’ve found for people to begin managing their network in a practical way:

1. You should start your automation journey by gathering facts from everything on your network.

For example, this is my network fact role that I have been using for years. Anytime I come across a new network OS somewhere, I add it to the list:

Anyway, this role will give you parsed configs for everything Cisco/Arista/JunOS, and the raw config (show run all) for every other device it encounters:

Here’s more about how/why I do fact collection first:

2. Once you have fact collection running, then you’re ready to begin Ansible state/config management!

Building playbooks has never been quicker and easier. As of 2.9, Ansible’s network resource modules will let us do state management rather than just config mgmtIdentify variables, built templates, and the modules do the rest. This is the easiest way to build and implement backups/restores too.

The resource modules will determine which commands need to be sent and in which order, whether things need to be removed first, etc… The things that once took a year can be done in days or weeks now…

3. Next, use your network facts to build or enhance your CMDB!

Establishing a CMDB is the prerequisite to doing anything with Ansible long-term. With Tower as the API/UI around Ansible, I prefer pairing it with Elasticsearch [ELK] stacks to create a full-tilt CMDB and search engine combo.

This is all done through Ansible Facts and Tower logging — nothing else required. I gather facts against everything on the network, and I use playbooks to search Elasticsearch for that data from whatever time I’m interested in, so I can compare/diff or retrieve specific configs to be used as backups/restores.

Keep in mind that Tower itself is often not the best place to be doing heavy searching and log/job analysis. In general, we recommend you offload search and analytics to an external service. And at large scale — and certainly at high volume — facts and logging are the gateway to a big data project.

4. And now that we have all of these basic functions in place, it’s time to begin scale and performance testing. Part of this involves setting up a development and testing framework. The rest of it is purely and exercise in establishing standards that allow people to efficiently learn how to work with and create content using Ansible and Git.

Everything I’ve covered so far can be stood up and configured with basic functionality rather quickly. And at the very least, these specific tools and technologies will all scale with us as quickly as we can develop things to use them.

This all takes time to build out to full scale in a large network, but it’s a tried and true, and a practical way, to begin your network automation adoption.

This framework — and the fundamental objective of knowing what’s running on your network at any given time — has been implemented with tremendous success in every network infrastructure I’ve worked on.

The day one results are immediate, and the foundation for all of this can be built in the time it takes to do a POC. The fact collection and logging that we’re doing through Tower and ELK both lend themselves well to a quick implementation and gradual scale-up to running against massive inventories.

Scaling Ansible and AWX/Tower with Network and Cloud Inventories

This topic is covered more in-depth in my Red Hat Summit talk on Managing 15,000 Network Devices.

Quick primer: Ansible is a CLI orchestration application that is written in Python and that operates over SSH and HTTPS. AWX (downstream, unsupported) and Tower (upstream, supported) are the suite of UI/API, job scheduler, and security gateway functionalities around Ansible.

Ansible and AWX/Tower operate and function somewhat differently when configuring network, cloud, and generic platform endpoints, versus when performing traditional OS management or targeting APIs. The differentiator between Ansible’s connectivity is, quite frankly, OS and applications — things that can run Python — versus everything else that cannot run Python.

Ansible with Operating Systems

When Ansible runs against an OS like Linux and Windows, the remote hosts receive a tarball of python programs/plugins, Operating System, or API commands via SSH or HTTPS. The remote hosts unpack and runs these playbooks, while APIs receive a sequence of URLs. In either case, both types of OS and API configurations returns the results to Ansible/Tower. In the case of OS’ like Linux and Windows, these hosts process their own data and state changes, and then return the results to Ansible/Tower.

As an example with a Linux host, a standard playbook to enable and configure the host logging service would be initiated by Ansible/Tower, and would then run entirely on the remote host. Upon completion, only task results and state changes are sent back to Ansible. With OS automation, Tower orchestrates changes and processes data.

Ansible with Network and Cloud Devices

Network and cloud devices, on the other hand,  don’t perform their own data processing, and are often sending nonstop command output back to Ansible. In this case, all data processing is performed locally on Ansible or AWX/Tower nodes.

Rather than being able to rely on remote devices to do their own work, Ansible handles all data processing as it’s received from network cloud devices. This will have drastic, and potentially catastrophic, implications when running playbooks at scale against network/cloud inventories.

Ansible Networking at Scale — Things to Consider

In the pursuit of scaling Ansible and AWX/Tower to manage network and cloud devices, we must consider a number of factors that will directly impact playbook and job performance:

Frequency/extent of orchestrating/scheduling device changes
With any large inventory, there comes a balancing act between scheduling frequent or large-scale configuration changes, while avoiding physical resource contention. At a high level, this can be as simple as benchmarking job run times with Tower resource loads, and setting job template forks accordingly. This will become critical in future development. More on that later.

Device configuration size
Most network automation roles will be utilizing Ansible Facts derived from inventory vars and device configs. By looking at the raw device config sizes, such as the text output from show run all, we can establish a rough estimate of per-host memory usage during large jobs.

Inventory sizes and devices families, e.g. IOS, NXOS, XR
Depending on overall inventory size, and the likelihood of significant inventory metadata, it’s critical to ensure that inventories are broken into multiple smaller groups — group sizes of 500 or less are preferable, while it’s highly recommended to limit max group sizes to 5,000 or less.

It’s important to note that device types/families perform noticeably faster/slower than others. IOS, for instance, is often 3-4 faster than NXOS.

Making Use of Ansible Facts
Ansible can collect device “facts” — useful variables about remote hosts — that can be used in playbooks. These facts can be cached in Tower, as well. The combination of using network facts and fact caching can allow you to poll existing data rather than parsing real-time commands.

Effectively using facts, and the fact cache, will significantly increase Ansible/Tower job speed, while reducing overall processing loads.

Development methodology
When creating new automation roles, it’s imperative that you establish solid standards and development practices. Ideally, you want to outright avoid potentially significant processing and execution times that plague novice developers.

Start with simple, stepping through your automation workflow task-by-task, and understand the logical progression of tasks/changes.. Ansible is a wonderfully simple tool, but it’s easy to overcomplicate code with faulty or overly-complex logic.

And be careful with numerous role dependencies, and dependency recursion using layer-upon-layer of ’include’ and ’import’. If you’re traversing more than 3-4 levels per role, then it’s time to break out that automation/logic into smaller chunks. Otherwise, a large role running against a large inventory can run OOM simply from attempting to load the same million dependencies per host.

Easier said than done, of course. There’s a lot here, and to some extent, this all comes with time. Write, play, break, and learn!