Any textbook or guide on organizational performance management will tell you that you can’t improve what you don’t measure. The whole idea of key performance indicators (KPIs) is that they provide a brief measurement of some performance metric based on accurate data.
This is true for all areas of the enterprise, whether it’s sales, manufacturing, procurement, or IT.
IT organizations with even the most basic computerized ticketing systems have a wealth of data they can use to measure their performance. However, not many take full advantage of this treasure trove; even if they do, not all measure the right things.
In this article, we describe what’s meant by “after-incident reporting” and how doing it properly can result in IT team performance improvements–the most important of which is resolving incidents faster.
What Is After-Incident IT Reporting?
After-incident IT reporting falls into two broad categories, depending on whether you’re looking at the “forest” or individual “trees”:
-
Individual incident analysis, or post-incident review, focuses on a specific incident so that stakeholders can review how it was handled, what went right, and what could have been improved, as well as investigate possible root causes.
-
The summary analysis looks at performance measurements over a more comprehensive number of incidents to identify averages for different categories and trends over time.
Let’s look at these categories in more detail.
Post-Incident Review
Many IT organizations deal with hundreds of incident tickets daily, so holding a post-incident review session for every ticket wouldn’t make sense. For organizations that have post-incident reviews at all, most require them only for major (severity 1) incidents. These reviews should examine several aspects of the incident response and resolution, including:
-
How soon after symptoms were observed (or a monitoring system sent an alert) that the incident ticket was opened–and whether this time was within expectations or took much too long
-
How long it took to identify the issue as a major incident
-
The initial steps are taken to diagnose and contain the issue and whether they were appropriate or went down the wrong path.
-
How long it took to convene a conference call and get the right people on the line–and if it took too long, what the cause was
-
How long it took to engage external service providers, if applicable
-
How long it took to identify and implement a solution to restore system functionality
-
Any bottlenecks or communication issues that were observed
Some aspects should be captured in the ticket data, whereas other elements are more anecdotal. Both types of information are essential. For example, if the team has trouble engaging a specific external service provider repeatedly–something that might not be captured in a single ticket–it’s a sign that it might be time to shop for a new service provider.
An essential part of a post-incident review is looking at similar previous incidents to see if the same communication issues or root causes are repeatedly encountered. These can point to underlying problems that can be addressed and prevent future incidents or reduce their severity.
Summary Analysis
Summary analysis can cover a broader swath of incidents–not just significant incidents but those of lesser severity. By analyzing the data collected in the ticketing system, you can answer questions such as:
-
What IT systems and applications have the most incidents?
-
What are the causes of SLA breaches and escalations–are they related to specific systems, certain IT staff members, overall ticket load, or other factors?
-
Are there spikes in incident reports after significant changes and updates to systems and applications? If so, your change management program may need to be improved.
-
What are the trends over time for any of these measurements? Are we improving as a team, both overall and in specific areas?
Most of this information is captured in the ticketing system (if it’s being used correctly). Some observations should be correlated with other data to understand the underlying problems fully.
Benefits of Proper After-Incident Reporting
Proper after-incident reporting takes time and effort, but it’s well worth it. IT organizations that perform after-incident reporting can realize numerous benefits, such as:
-
The ability to identify common root causes, communications bottlenecks, and other factors that cause IT incidents, increase their severity, and increase the response or resolution time
-
Visibility into factors that affect the overall team’s performance or that of individual team members
-
The ability to pinpoint the systems and applications that have the most incidents, which could indicate configuration issues, performance tuning issues, or hardware in need of repair or replacement
Furthermore, the team’s experience with significant incidents can inform the organization’s disaster and business continuity planning. Knowing where the likely communications bottlenecks are can enable the team to include mitigation steps in the disaster recovery plan to eliminate or reduce the impact of these bottlenecks.
And, of course, measuring the right things can point you to areas where you can further investigate performance issues, determine their root causes, and address them. The result will be a faster resolution of IT incidents.
Proper after-incident reporting is a critical way to improve IT incident performance, but it’s not the only way. Other methods include rule-based ticket prioritization, simplifying communication channels, and more. For more ways to improve your team’s performance, download our white paper, “8 Solutions to Resolve IT Incidents Faster.”