You are here

You are here

Get the most from your AIOps: 4 tricks from the trenches

public://pictures/davidl.jpg
David Linthicum Chief Cloud Strategy Officer, Deloitte Consulting
 

The use of AI to manage complex tasks such as system management and monitoring is an old idea made new again as the tech has gotten cheaper. Today's AI systems also include the efficient storage of operational data to drive its training data. 

AI is now a force multiplier for IT operations, and that changes the game in terms of how you should approach CloudOps using AIOps technology.

While AIOps cannot deliver miracles, it does have the ability to adjust and expand to vastly different operations problem domains. It could become the centralized command-and-control interface. AIOps can also leverage an always-learning brain to automate today's manual operations tasks, and thus avoid an outage. 

Here are four new ways to think about the use of AIOps tools, plus what works and what doesn't—and tips from teams that deal with ops daily as they learn to leverage AIOps to address most operational issues.

1. Leverage existing knowledge bases from similar ops domains

Savvy CloudOps subject-matter experts don't start from scratch with AIOps, but instead bump up their AI power with a pre-populated knowledge base of experiences. 

Today, this sharing is done mostly intra-company. People share systems training data and actual knowledge bases with other ops teams within the company that may operate many of the same systems and application patterns. The freshly deployed AIOps tool shows up on day one with knowledge relevant to its applications. This knowledge could include how to diagnose ops data coming from the storage systems or how to use existing experiences, such as a known problem with I/O that will need to be fixed proactively before there is a catastrophic failure. 

Without this pre-existing knowledge, the AIOps tool would have to learn what leads up to a failure and how that failure can be detected by monitoring key data points from the storage system. This learning from scratch may take months, versus a minimal amount of time to install a pre-trained knowledge base that will help avoid known troubles. 

More knowledge bases on the way

An inter-company knowledge exchange is the next step on this path. Many companies rightly consider their knowledge bases proprietary data and a competitive differentiator, so don't bother knocking on your competitor's door with a request. Instead, the key here will be AIOps vendors' ability to provide knowledge bases that are purpose-built for the specific types of systems to be operated. 

These vendor knowledge bases are in the works today. It seems likely they will follow patterns similar to the GICS (Global Industry Classification Standard) sectors that include financials, healthcare, industrials, etc., that further break down to 24 industry groups, 69 industries, and 158 sub-industries. 

If AIOps vendors follow historical trends in IT, development will start with knowledge bases that can apply to many customers in many fields, such as accounting, and then work down to the sub-industries of customers that supply the vendor with the most revenue. 

You should expect the availability of vendor-supplied knowledge bases for huge retailers such as Walmart, Walgreens, and Costco, with their large corporate offices and big IT budgets, before they become available to small wholesalers that supply specialty goods. Work with your vendor, but also get creative in your search for potential knowledge bases for your specific sector. 

2. Integrate with other key ops systems

A recent data breach at a large company cost $10 million in direct lost revenue and caused a PR nightmare that will continue for years. The post-mortem determined that, leading up to the breach, an ops system reported CPU saturation on a group of servers. The cause of the saturation turned out to be random and automated logon attempts that were used to break into the system.

Because the ops tools didn’t talk to the SecOps tools or even to the GovOps tools, the breach attempts went undiscovered. The ops problem with saturation of CPUs should have raised the alarm that it was a likely security problem as well.

The good news is that most of today's AIOps tools provide these types of integrations. The bad news is that many IT departments don't leverage their AIOps tools to provide true observability into management and operations of all aspects of their systems: governance, security, performance, configuration management, and basic operations. 

Core insider trick: Leverage AIOps as a receiver of data from other systems, such as SecOps and GovOps, and as an active communicator with those systems. This is the kind of 1+1+1=6 value proposition that will keep your company out of the news cycle, and you in a job.  

3. Integrate with proactive ticketing systems to perform pre-crisis maintenance

Most AIOps tools support ticketing systems. And tools that support IT service management (ITSM) and IT operations management help manage all systems maintenance and repair activities. 

Simply put, ITSM is how IT teams manage the end-to-end delivery of IT services to customers. This covers all processes and activities, including the automatic creation of tickets to have automated and human-driven processes to fix both common and uncommon problems.

So how do you leverage an automated ticket creation and management system to lower the risk of operations and outages? It's a creative way to leverage humans to be more proactive by using an AI engine to define processes, fixes, and other ops activities.

For example, certain systems on a factory floor are in a hostile environment, in terms of temperature flux and vibration. Normal maintenance of the systems needs to occur frequently, which includes updating and upgrading power supplies and checking on the condition of physical disk platters. The alternative is factory system outages and millions in lost revenue each hour. 

These systems may have static maintenance schedules, much like those you find in the owner's manual of a car or truck. But over the course of months or years, these schedules can change, leading to different times that a human must intervene to run routine maintenance and checks on these systems. 

A better idea, and another trick, is for the AIOps system to manage the tickets and thus the humans also. The ticket system has a much better view of the overall behavior of the systems and can speed up or slow down maintenance schedules as data comes in from those servers or devices. 

Proactive ticketing has the benefit of avoiding catastrophic failures, which many companies should build into their planning. The ticketing system prioritizes all activities, automated or not, to be proactive around all preventive maintenance, which significantly reduces downtime. 

4. Centralize logging, analysis, and knowledge-based learning

It's pointless to manage a set of cloud brands using tools that are purpose-built for specific clouds such as Oracle, AWS, Google, and Microsoft. So you have two choices. 

First is holistic management of all cloud and non-cloud assets, where there is centralized command and control. The second choice is siloed management of different cloud and non-cloud platforms that use complex and confusing ops procedures, and fragmented use of heterogeneous technologies. 

Take, for example, three cloud brands that use three different cloud-native ops tools to manage each cloud. This configuration requires three different skills sets to correctly operate each tool. Also, you'll have observability only within each cloud using the single, siloed tools. 

There is no way to deal with dependencies or instances when systems span platforms and/or cloud brands. In other words, this example won't scale.

For cloud and non-cloud platforms, you want a single AIOps tool to manage heterogeneous resources. The trick to doing this is to focus on what's above and between the clouds—common services, in other words. 

Don't force each cloud application and dataset to be managed by only cloud-native technology on those specific clouds. (This is not only a trick, but the current trend.)

This approach solves a couple of problems:

  • You normalize operations by using a single management and monitoring platform, such as AIOps. While this may not manage all systems, not even all cloud-based systems, the idea is to move in the direction of centralization of operations as quickly as possible. This reduces costs and lowers risk in the long run. However, investments must be made in AIOps technology and in establishing a culture of ops automation.  
  • The heterogeneous platforms become application and data platforms as a group of systems. For instance, if the idea is to move to a multi-cloud to leverage different cloud services for different requirements, than you may have a loosely coupled application that spans cloud brands, such as AWS and Google. If you're considering a monolithic and siloed approach to ops management, using whatever native tool the cloud vendor provides, you'll have more risk and cost because you're running two or more ops tools that may have conflicts, and you'll need aligned skill sets as well.

There's no magic bullet

Despite the tricks I've shared here, there is no magic solution to more effectively leverage AIOps technology, or any technology, for that matter. However, the potential is to build operations around an all-knowing and all-seeing platform that can ultimately automate most humans out of the equation. That is when the overall reliability of IT operations will increase. 

As you move further along in your journey, read my recommendations in "8 things to consider before buying an AIOps tool."

Keep learning

Read more articles about: Enterprise ITIT Ops