The Tale of the Network Automator: From Knight Errant to Engineer by James Kelly

middle-ages-1430003_1920.jpg

Podcast on YouTube

Perchance the temptation of technology and for certain the pangs of toil have persuaded some networkers to stray from the path of the CLI and *IE certification race.

Swinging APIs, they’ve heard, is the promised land and wielding shiny tools like Ansible can, not only confirm their chivalry, but also chase away any lament of rote configuration changes and misfortunes of fat fingers.

“Tonight, Sancho, we drink from the cup of automation. And we will be safely protected from any insurgent application issues rising to the heights of our virtuous network.”

Chapter 1: Initiation

And so once upon a time, after the ecstasy of donning the automation helmet and downloading a few horses, networkers looking for a rite of passage into the world of automation looked bravely in the mirror and said to themselves, “what is there to do that’s dangerous around here?”

The way many networkers made haste to an automation quest was ostensibly with all the strategy of the ad hoc adventures of a knight errant, looking to prove their worth.

These trailblazers were defined by the natural signposts in their workflows, by vendor toolkits, and by tools imported from the far away land of sysadmin. And so, often under the tutelage of nothing but their own trial and error, the networker knight errants learned to conquer a couple of the workflows that they were presented with, slaying manual steps one by one.

Chapter 2: Calamity

Shortly thereafter, while riding proudly toward a well-deserved weekend, the knights were so enchanted with their new weaponry and so pleased with themselves—their heads, swollen with victories, may have tightened helmets to their heads—that their attention and vision was impaired. And it was while fantasizing of their next rout that the slings and arrows started flying.

The actual assault is not important: a sudden unreachability, an impetuous line-of-business change request, a cloud VPC meltdown. It was as unexpected as it was inevitable. For some of the knights the affliction would uncover a weakness in armor. For others, it may have been the sting of the automation sword that was erratically swung and cut the wrong way, this time automating and thus swiftly amplifying an accident.

Chapter 3: Dedication

For some beginners, the setback of an automation blunder was so vexing that they turned back to peasant life. For many however, after finding their feet again and after some commiseration with knight companions, they found the strength to press on, carrying the lesson of their lapse forward to not repeat that unforgettable venomous mistake.

In knighthood, the networkers aspired to be so valiant and so adroit that they got the notion that if they’re doing it right, they won’t have difficulties. But difficulties, it turns out, are precisely the best circumstances for learning.

Ergo, the knight automators that persisted and persevered learned to be magnanimous with themselves first and foremost. To uphold future reliability, they resolved to automate around the woes that wounded them. In this way, an indiscretion, once defeated with automation, would never weasel its way back to a second sour encounter.

Chapter 4: Trials

Again and again in the hardship of automating their accomplishments, the networker knights became more astute and attuned to their environment. They began forging their own armaments and virtual squires to do their bidding. They also became less foolish and blunt on their escapades, taking what was useful from other toolsmiths and intrepid automators with software engineering skills.

In their training, the knights’ tenacity grew used to absorbing the lessons of failure, but they grew tired of the disrepute and thorns of trial and error.

As fortune would have it, one day they encountered software sages that spoke of staged replica battles. “But what is the use of repeating the past if we have integrated its teachings?” the knights inquired. “Not the past;” explained a sage, “we stage in preparation for any future change and crusade with many possibilities accounted for and tested.”  Testing and practice ahead of affronting a matter: it sounded ingenious, and so the knights too started building training grounds to rehearse their affairs.

Through testing, stressing and staging, and then automating and strengthening that preparation work too, the knights became known for their fastidious preparation and resulting dependability to conquer new projects and change conquests in production, and now with less havoc than sorely familiar from the past. Soon they became so mighty that their automation could take on more incessant action and circumstances of increasing variety.

Chapter 5: Consummation

Becoming even more zealous about automating trials not to error, the knights decided to consummate their erudition and experience by taking new identities as knight engineers.

See, each knight’s struggles and the many stories shared at the inns gradually made them less wandering and more disciplined and deliberate about their path. Their practice became more ritual and rigorous, and their automating more rewarding. Through their evolving wisdom, they had transformed from errant to engineer, worthy of an order or iron ring.

In due course, through tournaments and their own track record, the knight engineers decided they could not be passionate, proud engineers unless they could also measure and show their success. After all, they vowed reliability to others. And so, they decided to build public displays for every pledge of reliability that would maintain accurate assessments of their results. Constructed with even more automation, these displays objectively told of the current conduct and past performance of the systems in their lands.

Ultimately, so well-known these reliability signs of service became, that the ardent automation engineers are today called network reliability engineers.

Coda: The Canons of NRE

There are two discernable standards to which the knights held.

First, reliability was their utmost value. They determined that at the base of the hierarchy of needs for their service, if the network was not measurably reliable in the ways they promised, that nothing else mattered. Instead of trading off reliability with other values like agility, efficiency and so on, these boons were incidental because they first solved for reliability with automation.

Second, engineering is the best road to take as an automator. The processes and skills of software engineering guide far better than getting consumed with happened-upon technology, tools and APIs. Networking workflows are the battleground in which to practice automating and a brilliant place to start ad hoc, but ultimately the rigors of software engineering will provide strategy and structure for the tactics and tools.

For more on the culture, skills, processes, behaviors and common technologies of network reliability engineer (NRE) roles and teams, look to the famous older cousins: site reliability engineers (SRE). Check out the free online SRE book and my past blogs.

Dedication

Thanks to the NREs, SREs and network automator forebears for your inspiration, enlightenment and advice in putting together the 5-step journey to automated NetOps. With that guide, may the path of forthcoming automators be smoother and more straightforward than your own story.

Cliff Notes Translation

For those unfamiliar with the famous humor and satire of the book Don Quixote, the above story and style may require some explanation. However, I suspect that those, even the slightest bit attentive to the network automation space, can see the parallel between this metaphor and real life.

Here are the Cliff Notes of the stereotypical story:

Chapter 1: Not looking before they leap into automation, many people progress in a rather ad hoc manner, tuning tribal knowledge and apparent workflows into an opportunity to learn while aggregating manual tasks. This is mostly governed by the tools they know or those put in front of them—which may or may not be the best tool for the jobs at hand.

Chapter 2: Many people get stung by automation in one way or another. Especially with config management automation, one small issue could easily get propagated to a massive blast radius. Change workflows aren’t the safest place to learn, but they are the most infamous for some reason. Automating directly in production when learning and without testing is also a catastrophe waiting to happen.

Chapter 3: Continuous improvement and learning is always the foundation of good automation. People learn that small changes and sometimes small mistakes are where the lessons lay to help discover what to automate next. This can be taken all the way to NRE concepts like chaos engineering, where failure is induced on purpose.

Chapter 4: The rigors of software engineering like test-driven engineering, continuous integration and delivery (CICD) and automated deployment are critical to success and safely moving quickly, instead of trading off reliability for velocity. Automation is learned in pre-production and environments like labs or virtual labs. Over time, engineers also create well-tested in-production continuous response.

Chapter 5: Instead of relying on only general monitoring tools, NREs, like SREs, engineer what matters most: service-level indicators to objectively measure success of higher-order service goals and promises. With the help of data, they manage their way to truly measured continuous improvement.

Podcast on YouTube

image credit blitzmaerker/pixabay

Introducing NRE Labs by James Kelly

blog-automation-upside-down.png

“The 80s called and they want their CLI back.” Until recently, that was basically the beckoning call of the network automation and programmability plot. Like a broken record, the replayed attention on tools and APIs talked of new NetOps technology, but it didn’t paint a picture of the promised land. It didn’t provide a map and it didn’t prepare anyone professionally.

Today, as Network Reliability Engineering (NREs) roles pave the way from CLIs to SLIs and other higher-order SRE-inspired methods and metrics, the picture of automated NetOps is crystalizing. Juniper Networks has elaborated the 5-step map of the journey. And now, with the launch of NRE Labs, Juniper is introducing the virtual training camp to support the trip to the promised land. It’s open—open source and open for use—and it’s by and for network engineers.

People are the greatest asset or predicament to any transformation like automating NetOps, so this is where the focus was on democratizing NRE learning for all.

An Automation Dojo in the Browser

For most network engineers, automating has either been an ad hoc adventure, or the barrier to entry has stopped engineers from starting. And without the right roots, asking for automation from network engineers is like asking for pears from a pine tree. Hands-on time is needed to form the software engineering skills and experience needed to architect and automate.

Providing a cheap, quick and easy solution, NRE Labs is free and easily accessible. It includes live terminal access to one’s own network devices and linux systems, right in the browser. Each learning lab topic is pieced apart into lessons that are, in turn, comprised of short lab steps that each take only a couple minutes. This is important because it eliminates those overwhelming obstacles to getting started with hands-on learning.

  • It removes the risk of learning by trial and error in production

  • It doesn’t require a sign-up and doesn’t have a classroom or a teacher

  • It doesn't require a long download or physical or virtual lab setup times

  • It doesn’t demand expertise or painful piecing together of all the background infrastructure

  • It demands zero prerequisite knowledge of tools and programming

Without any major impediments, anybody can jump right in and click through a few lessons right in their web browser.

The main NRE Labs runtime sponsored by Juniper Networks is available at: https://labs.networkreliability.engineering

Learn by Doing

Education that isn’t applied is quickly forgotten. That’s why the topics and lessons in NRE Labs are organized into real-life NetOps contexts and workflows. NRE Labs is also built with useful systems and tools that are widely available and applicable to network engineers today.

NRE Labs was created to be accessible to anyone, so it starts from scratch with a learning topic for basics, but users can move around lessons as they see fit. Other topics include troubleshooting, configuration, testing and verification. In the future, additional lessons will take these topics deeper and additional topics will take learning broader. This includes covering the spectrum of NetOps workflows; structuring automation and infrastructure as code with gitOps rigor; evolving troubleshooting into proactive testing; orchestrating testing, delivery and deployment with a pipeline; and using telemetry and analytics for monitoring and measuring service-level indicators and automating network remediation and regulation.

Open for Contributions

NRE Labs needs to be open to democratize learning across different networking environments, but it’s also open to keep up with the rate of and scale it to all kinds of innovators broadly automating their networks that have something to share.

The NRE Labs back-end infrastructure and front-end lessons are all open source as the NRE Learning Antidote project and code repository on GitHub. The project’s documentation explores some background on the Kubernetes-based Antidote infrastructure and explains how to run a standalone instance of NRE Labs to develop, test and contribute improvements and lessons.

For the express documentation, Juniper has open sourced the NRE Labs project summary as a poster.

While NRE Labs is live today, it’s in tech preview and we’re looking for quick iterative progress over delayed perfection. We welcome contributions! Other than curriculum development, some of the back-end work revolves around hardening and scaling the infrastructure and shortening start times of network nodes. We’d also greatly appreciate front-end work such as mobile optimizations. Suggestions are welcome and are already flowing in. Use the project’s issues tracker on GitHub to find work or share feedback.

To be Continued

Notice that NRE is learning created by and for network engineers and free of external marketing. You can use it anonymously, but we hope you’ll tell us what you think. 

Beyond learning new NRE and automation skills, NRE Labs aims to spark new, open conversations to listen to or chime in. Follow @nrelabs for updates on new content, improvements and more. We also created an #nre_labs channel in the Slack spaces for the Juniper Engineering Network (EngNet) (Join here) and the vendor-neutral space run by Network2Code (Join here).

Please spread this day-one news and follow the blogs as lessons continue to roll out.

See you in the community,

@mierdin@cloudtoad and @jameskellynet from the NRE Learning Team

This blog was originally published at https://forums.juniper.net/t5/Enterprise-Cloud-and/Introducing-NRE-Labs/ba-p/381850

5 Steps to Automated NetOps by James Kelly

steps-388914_1920.jpg

Podcast on YouTube

In Juniper Networks anthology of 5-step frameworks, we take a different turn. Instead of focusing on a network domain vertical like the 5-steps for data center, campus, WAN and branch, we are focused horizontally across all domains on network automation. This 5-step can apply to any place in network, and be overlaid like a transparency, for example, over the data center 5-step.

5-step automated netops.png

Not 5 Steps to Network Automation

Sometimes you climb the ladder only to find it's standing against the wrong wall. 

In the pursuit of network automation, a multi-decade long affair, the narrative and advancements mostly revolved around programmability which gave way to NFV and SDN. Despite those developments, network automation seems to have ricocheted back to center. We’ve realized that the average NetOps job has practically been sealed in a time capsule compared to the evolution of software engineering and related DevOps and SRE movements.

That’s not to say network automation hasn’t gone anywhere, but progress has largely been technological in the inner-workings of products: we have slightly more autonomous systems; we have abstracted and elevated systems across more network surface area; and we’ve created more APIs, making systems more automatable. Alas, all this one-sided network automation does not an automated network make.

What has failed to change? The forlorn customer and NetOps opportunity for automation.

The handoff from vendor to customer is still, on average, very siloed and impetuous. NetOps catch what comes over the proverbial Dev-Ops wall and then has to run it. Then starts the same old crucible of some inaugural architecting, some less-agreeable administration and then hapless eons of daily toil and troubleshooting, trying to uphold availability. And we cannot forget our “friend,” IT gravity: pulling down issue triaging and blame fastest to the lowest common denominator, the network.

In brooding over that experience, surely NetOps itself is where the emphasis on automation is needed most, to evolve from automatable to automated. And the metamorphosis process cannot consider automation in the vacuum of technology alone, but rather must pay particular attention to ameliorating people and processes.

Where to Begin? 

Transforming people and process, it turns out, is hard, but luckily there are bright spots to replicate. The DevOps movement copied the lessons of the manufacturing industry to change the way software engineering was done, and now the most successful NetOps teams are essentially copying DevOps.

It also turns out that network engineers don’t fancy being called developers because developers and associated app teams are often the ones dropping the headaches, falling with that IT gravity, down upon network engineers’ heads. While most don’t mind the term DevOps or DevNetOps, implying “developer” may induce ire and make network engineers want to duck for cover. Moreover, DevOps is a fairly amorphous set of principles, so the leading NetOps teams have drawn inspiration from site reliability engineering (SRE), a prescriptive implementation of DevOps and dubbed their transformational job: network reliability engineering (NRE).

This 5-step framework to automating NetOps is a journey to a more self-driving network, but most of all, a journey of engineering reliability and simplicity. The journey stars upskilled network reliability engineers capable of some coding and wielding the tools of automation to manage the service-level goals and indicators of reliability.

Think of the framework as a map. As you orient yourself and direct your path, you’ll see progress is seldom a straight line, and it won’t begin in the same place for everyone. In all likelihood, most networkers are at Step 1, manual ops, riding the pine in the automation game and gingerly operating their networks by the ITIL book. But we’re convinced that engineers are dedicated lifelong learners and their stagnation at Step 1 is not so much from hesitation, but rather because they’re busy firefighting and the network automation narrative has not addressed them directly until the rise of NRE.

The importance of taking the first step cannot be overstated, yet it has also historically been daunting and difficult for engineers without a software engineering background. This is why Juniper has just launched EngNet, NRE Labs, ATOM, free trials, hosted trials, labs, training and services to ease the first small steps to automating. 

Reaching for Step 2, once you scientifically dissects some NetOps workflows then re-engineer what were manual tasks with some coding and tools, it’s a virtual gateway and virtuous cycle to more automating. Finding, sharing and using these tools, you also buy yourself more time to automate, partaking in less toil.

STEPS IN DETAIL

Step 1 - Manual Ops

Manual ops are actually very useful for teaching how things work and fit together, but for tasks that are arduous, lengthy and especially repetitive, network engineers need to begin to document their tribal knowledge and workflows and assess the ROI of automating them.

To move to Step 2:

  • Adopt an automator’s mindset. Be a builder and a technologist, not a technician

  • Take documented workflows and automate them. At this stage it can be any ad hoc workflow to cut one’s teeth coding and using new tools for speed, scale and consistency

  • In addition to using the CLI documentation, explore the API documentation for systems

  • Find tools that already exist and dissect them. And build those that are customized and contextual to NetOps workflows.

  • Realize the value of abstractions and SDN so that the re-creation of automation at the box-to-box or lower levels does not have to occur unnecessarily where proven systems exist. Automate on top of them.

Step 2 - Automate Workflows

In Step 2, you take documented workflows or their pseudocode and start automating small wins. The biggest pay off is in repetitive troubleshooting workflows, which are in fact an early form of testing and verification that will be useful in Step 3. Troubleshooting read-only workflows are a safer bet than re-configuration, re-deploy or read-write workflows. Automating changes during maintenance windows mitigates risk. But ultimately maintenance windows are an IT anti-pattern to avoid and changes are best handled with the reliability of a pipeline introduced in later steps.

To move to Step 3:

  • Progress beyond ad hoc automating. Begin to practice as-code and “GitOps” developer-like behaviors. Code means codifying, not necessarily programming. Use SCM workflows and a versioned source of truth for all artifacts, configurations and creations.

  • Configuration is not distributed and perpetually drifting, but declarative and codified and its changes are reviewed, as are programmed automated workflows.

  • Begin to think proactively of how to eliminate mistakes and manual triggers with both testing and sensors.

  • Connect the “then that” Step-2 automated and aggregate tasks that were manually triggered to now start getting automatically triggered. Thus, begin automating the “if this” to trigger the “then that.”

  • Use APIs and data from systems like Juniper AppFormix or other telemetry collectors and analytics systems in: 1. observability and decision making, moving to NRE service-level indicator tooling; 2. proactive testing instead of relying solely on reactive troubleshooting; and 3. automating “if this” sensors.

Step 3 - Automated Triggers and Networks as Code

Beyond provisioning, scripting and programming languages, at Step 3 you’re learning GitOps, version control and code reviewing. You’re embracing infrastructure as code and thinking about automating troubleshooting as testing and proactive verification. Test-driven network automation is inspired from test-driven development (TDD). It’s not sufficient to simply run scripts and fix problems later; but instead must build holistic tests that protect from failures.

Beyond proactive tests, we can be proactive about triggering some automated actions where event-driven frameworks will help. And proactive triggering requires building or using sensors. Sensors are sometimes based on telemetry and analytics systems that are also useful for providing or building service-levels indicators.

To move to Step 4:

  • Adopt a QA and testing mindset in making all changes, automating not only consistency, but accuracy as well.

  • Testing processes are inserted in between “as-code” and deployment on a pipeline. Congruent to software engineers using a DevOps pipeline, we could optionally call this a DevNetOps pipeline or networking CICD pipeline, like Juniper’s NITA framework.

  • Move toward expediting more frequent deployments without maintenance-window woes because of higher confidence in automated change testing.

Step 4 - Continuous Processes and Pipeline

Here, the technology and automation runtime takes on a new axis of pre-production instead of only in-production. Step 4 adds a CICD pipeline for running automated testing.

Continuous integration (CI) allows being able to integrate code changes at any time. For example, these could include programming changes or a configuration change. Reliable changes are made possible thanks to automated testing. The automated merging of sometimes concurrent changes into a safely tested main line and building the artifacts necessary for deployment is continuous delivery (CD).

Automating the deployment itself is also wise (here’s a customer example) and reaches toward continuous deployment (also CD). And even actual continuous deployment still involves manual judgements. Truly deploying any time, especially following the immutable infrastructure pattern, can cause controlled, isolated outages that require architecting and automating around the outage to preserve availability and not drop traffic. In microservice-crafted software, deployment patterns like blue-green, canary or rolling upgrades are more readily possible, but networks are not traditionally designed and architected for such things, although today some SDN systems are, and redundant or sliced hardware systems are closer to enabling it. 

Beyond CICD, continuous response (CR) extends the event-driven if-this then-that from Step 3. Also CR acts mostly in production instead of in the phases of pre-production and deployment. CR with machine learning, deep learning and big-data analytics can be used for observability and automated regulation of networking systems to achieve optimization and efficiency far closer to the edge of the envelope than what a human would manage. See the Juniper blogs on self-driving networks for more on this concept.

To move to Step 5:

  • Evolve tooling and thinking to NRE / SRE concepts

  • Operations culture, observability and planning is data driven

  • Seek to understand system efficiency, effectiveness and satisfaction to customers (e.g. the up-stack IT organization or an SP’s actual customers)

  • Use chaos engineering and experimentation to understand system boundaries, limits and dependencies to optimize and plan for capacity and what-if scenarios.

Step 5 - Engineering Outcomes

While Step 5 is the last step, it’s still one of continual learning and growth. This allows quick and safe iteration on the network and fine tuning of processes to focus on higher-order reliability metrics and other goals. Don’t stop at network uptime—dive deeper and continuously improve the ability to respond to issues and change.

The network ceases to be the center of the universe in this step and an NRE specialized in networking will manage reliability with error budgets, toil budgets and service-level indicators (SLIs) like any other SRE. They do this for themselves with service-level objectives (SLOs) and for their dependents with service-level agreements (SLAs). They consider their reliability dependencies; for example, they may have reliability dependencies on software running on infrastructure outside of their control.  

An NRE in this step has a world view in layers of separate concerns and understands their place in the stack.

With agreements, automation and tradeoffs, reliability is a goal to be managed, not necessarily maximized. Speed, agility, efficiency and other successes are incidental for the NRE meter that holds reliability and availability prerequisites to other useful economies.

Flip through our 5-step framework slides to learn more. Technologists seem mostly gripped by the 5-step tool landscape slide, but progress is less about what you use and more about how you use it.

Please leave a comment below about your journey and lessons, and finally, thanks for sharing these ideas—long been missing in the automation discourse.

Podcast on YouTube

image credit 9114 Images/pixabay

this blog was also published at https://forums.juniper.net/t5/Enterprise-Cloud-and/5-Steps-to-Automated-NetOps/ba-p/366048