The idea of applying artificial intelligence and machine learning to more rapidly and accurately resolve IT incidents and manage alerts has been gaining steam in the past year.
While AIOps, as it’s frequently called, has spawned an entirely new market of startups, many enterprise IT leaders are playing a cautious hand so far – and for good reason.
There are risks, though. If an AIOps tool goes wrong out of the gates, IT and executive trust diminishes.
That’s why it’s critical to establish a workflow for success with any implementation before you even unbox a tool.
I’ve evaluated and deployed AIOps systems in large-scale enterprises and have developed enterprise software for IT operations, giving me unique perspectives on how to successfully deploy advanced technologies in large organizations.
Below, I’ve outlined a 7-step deployment process from planning, configurations, testing and through to launch.
Step 1: Plan
Start small with your project, by choosing one or two use cases/workloads that need improvement and have a team that is open to change.
Next assess what skills you have and where you may need some outside help or training, predominantly in data science and automation but also DevOps and continuous integration. IT Ops personnel will need enough understanding of how machine learning analytics work so that, when they turn control over to the system, they can audit to see how automated control is doing its job.
Determine the needed workflow changes to the IT operations process for the selected use case. For instance, if you are applying the AI system to manage alerts for the e-commerce site, what happens when correlated alerts involve multiple teams?
Understand data requirements for your use cases. If the data is not native to the tool or platform, then there may not be sufficient context.
As a result, you may need to supplement the data from a CMDB or alternate source.
Set goals for expected results for your first projects. This could include the volume of alert noise reduced, support ticket volume decline, faster incident resolution. Along with goals, establish the process for measuring and sharing results.
Develop a training plan for the system’s machine learning model. Set expectations early on as to what the trainers should expect in the early weeks of the implementation when the model is immature and lessons are being learned.
Step 2: Socialize
Now that you’ve prepped all who need to know about your plans, it’s time to take the next step in awareness by bringing the user community onboard. People may worry that their jobs will eventually go away or change for the worse, with a new fancy AI machine in play. Help people understand how the system works, benefits to the business and IT employees, and how it will change their current workflows. Find evangelists/power users in your organization to help spread the word and help train others when needed.
Step 3: Understand
The understanding phase is when you can really dig deep into the AIOps systems capabilities and best practices. Here’s what to focus on:
Learn how the AIOps system works and its data requirements. For example, if you are applying AI to alert correlation, you may need to include topology mapping to validate that the relationship exists between individual alerts.
What use cases and problems are being solved by the system? Common ones include anomaly detection, event correlation, and ticket routing but others may include notifications and alert suppression. Focus on those which are likely to deliver the fastest results, won’t have negative effects on operations and can provide quick wins for your team.
Step 4: Setup & Observe
Now it’s time to configure the system based on the use cases you have selected. If days or weeks of configuration is required to enable the machine learning to work, then the solution viability needs to be questioned.
IT operators should be able to see how the algorithms interact with the data and deliver suggestions, guidance, and analysis. It’s important that the software can illustrate the transparency of its actions by showing how conclusions were reached using which data sets.
Step 5: Recommend
Part of the power of AIOps is the ability to quickly and efficiently handle routine, predictable events. In such cases, defined by IT Ops, the system is configured to take over and apply a fix.
This saves time, ensures standard responses to known issues (such as VM/server utilization thresholds and patch updates) and also may prevent the development of more serious, cascading issues. In a safe environment–such as a sandbox or on a non-critical workload–allow the system to automate these routine tasks and then monitor the results.
Step 6. Deploy
Once you are comfortable with the results from your testing and pilots, now it is time to turn the system on in production. Run the system in testing mode for at least a couple of weeks to determine that the outputs are accurate and that users are happy with the recommendations.
Step 7: Review & Refine
After a few weeks in production, review the results against your original goals. For instance, if the goal was for alert noise reduction, what has been the improvement?
Aside from checking against specific metrics goals that you set out in the beginning, conduct a qualitative survey of users to learn about their challenges and what benefits they are seeing so far. Then you can refine, retrain and/or if needed, or select new use cases.
Artificial intelligence is a constantly evolving discipline and technology. It may seem risky, due to the complexity, but if you break it down into a simplified planning and deployment plan you will see success.
Written by Ciaran Byrne, VP of Product Strategy at OpsRamp