Today’s is something of a heretical post, especially for the believers in ITIL. In the world of OSS, we look to build in layers of resiliency and not layers of redundancy.
The following diagram and subsequent text in italics describes a typical ITIL process and is all taken from https://www.computereconomics.com/article.cfm?id=1074
The sequence of events as shown in Figure 1 is as follows:
- At TIME = 0, an External Event is detected by the Incident Management process. This could be as simple as a customer calling to say that service is unavailable or it could be an automated alert from a system monitoring device.The incident owner logs and classifies this as incident i2. Then, the incident owner tries to match i2 to known errors, work-arounds, or temporary fixes, but cannot find a match in the database.
- At TIME = 1, the incident owner dispatches a problem request to the Problem Management process anticipating a work-around, temporary fix, or other assistance. In doing so, the incident owner has prompted the creation of Problem p2.
- At TIME = 2, the problem owner of p2 returns the expected temporary fix to the incident owner of i2. Note that both i2 and p2 are active and exist simultaneously. The incident owner for i2 applies the temporary fix.
- In this case, the work-around requires a change request. So, at Time = 3, the incident owner for i2 initiates change request, c2.
- The change request c2 is applied successfully, and at TIME = 4, c2 is closed. Note that for a while i2, p2 and c2 all exist simultaneously.
- Because c2 was successful, the incident owner for i2 can now confirm that the incident is resolved. At TIME = 5, i2 is closed. However, p2 remains active while the problem owner searches for a permanent fix. The problem owner for p2 would be responsible for implementing the permanent fix and initiating any necessary change requests.
But I look at it slightly differently. At their root, why do Incident Management, Problem Management and Change Management exist? They’re all just mechanisms for resolving a system* health issue. If we detect an event/s and fix it, we don’t have to expend all the effort of flicking tickets around.
Thinking within the T2R paradigm of trouble -> incident -> problem -> change -> resolve holds us back. If we can skip the middle steps and immediately associate a resolution with the event/s, we get a whole lot more efficient. If we can immediately relate a trigger with a reaction, we can also get rid of the intermediate ticket flickers and the processing cycle time.
So, the NOC of the future surely requires us to build a trigger -> reaction recommendation engine (and body of knowledge). That’s a more powerful tool to supply to our NOC operators than incidents, problems and change requests. (Easier to write about than to actually solve though of course)