“From watching ESPN, I’d learned about the power of information bombardment. ESPN strafes its viewers with an almost hysterical amount of data and details. Scrolling boxes. Panels. Bars. Graphics. Multi-angle camera perspectives. When exposed to a surfeit of data, men tend to feel more masculine and in command. Do most men bother to decipher these boxes, panels, bars and graphics? No – but that’s not really the point.”
Martin Lindstrom, in his book, “Small Data.”
I’ve just finished reading Small Data, a fascinating book that espouses forensic analysis of the lives of users (ie small data) rather than using big data methods to identify market opportunities. I like the idea of applying both approaches to our OSS products. After all, we need to make them more intuitive, endearing and ultimately, effective.
The quote above struck a chord in particular. Our OSS GUIs (user interfaces) can tend towards the ESPN model can’t they? The following paraphrasing doesn’t seem completely at odds with most of the OSS that we interact with – “[the OSS] strafes its viewers with an almost hysterical amount of data and details.”
And if what Lindstrom says is an accurate psychological analysis, does it mean:
The OSS GUIs we’re designing help make their developers “feel more masculine and in command” or
Our OSS operators “feel more masculine and in command” or
Intriguingly, does the feeling of being more masculine and in command actually help or hinder their effectiveness?
I find it fascinating that:
Our OSS/BSS form a multi billion dollar industry
Our OSS/BSS are the beating heart of the telecoms industry, being wholly responsible for operationalising the network assets that so much capital is invested in
So little effort is invested in making the human to OSS interface far more effective than they are today
I keep hearing operators bemoan the complexities and challenges of wrangling their OSS, yet only hear “more functionality” being mentioned by vendors, never “better usability”
Maybe the last point comes from me being something of a rarity. Almost every one of the thousands of people I know in OSS either works for the vendor/supplier or the customer/operator. Conversely, I’ve represented both sides of the fence and often even sit in the middle acting as a conduit between buyers and sellers. Or am I just being a bit precious? Do you also spot the incongruence of point D on a regular basis?
Whether you’re buy-side or sell-side, would you love to make your OSS more effective? Let us know and we can discuss some of the optimisation techniques that might work for you.
We’re going to look into assurance models of the past versus the changing assurance demands that are appearing these days. The diagrams below are highly stylised for discussion purposes so they’re unlikely to reflect actual implementations, but we’ll get to that.
Under the old model, the heart of the OSS/BSS was the database (almost exclusively a relational database). It would gather data, via probes/MDDs/collectors, from the network under management (Note: I’ve shown the sources as devices, but they could equally be EMS/NMS). The mediation device drivers (MDDs) would take feeds from the network and homogenise them to be suitable for very precise loading into tables in the relational databases.
This data came in the form of alarms/events, performance counters, syslogs and not much else. It could come in all sorts of common (eg SNMP) or obscure forms / protocols. Some would come via near-real-time notifications, but a lot was polled at cycles such as 5 or 15 mins.
Then the OSS/BSS applications (eg Assurance, Inventory, etc) would consume data from the database and write other data back to the database. There were some automations, such as hard-coded suppression rules, etc.
The automations were rarely closed-loop (ie to actually resolve the assurance challenge). There were also software assistants such as trendlines and threshold alerts to help capacity planners.
There was little overlap into security assurance – occasionally there might have even been a notification of device configuration varying from a golden config or indirect indicators through performance graphs / thresholds.
But so many aspects of the old world have been changing within our networks and IT systems. The active network, the resilience mechanisms, the level of virtualisation, the release management methods, containerisation, microservices, etc. The list goes on and the demands have become more complex, but also far more dynamic.
Let’s start with the data sources this time, because this impacts our choice of data storage mechanism. We still receive data from the active network devices (and EMS/NMS), but we also now source data from other sources. They might be internal sources from IT, security, etc, but could also be external sources like social indicators. The 4 Vs of data between old and new models have fundamentally changed:
Volume – we’re seeing far more data
Variety – the sources are increasing and the structure of data is no longer as homogenised as it once was (in fact unstructured data is now commonplace)
Velocity – we’re receiving incoming data at any number of different velocities, often at far higher frequency than the 15 minute poll cycles of the past
Veracity (or trustworthiness) – our systems of old relied on highly dependable data due to its relational nature and could easily become a data death spiral if data quality deteriorated. Now we accept data with questionable integrity and need to work around it
Again the data storage mechanism is at the heart of the solution. In this case it’s a (probably) unstructured data lake rather than a relational database because of the 4 Vs above. The data that it stores must still be stored in a way that allows cross-referencing to happen with other data sets (ie the role of the indexer), but not as homogenised as a relational database.
The 4 Vs also fundamentally change the way we have to make use of the data. It surpasses our ability to process in a manual or semi-manual way (where semi-manual implies the traditional rules-based automations like suppression, root-cause analysis, etc). We have no choice but to increase dependency on machine-driven tools as automations need to become:
More closed-loop in nature – that is, to not just consolidate and create a ticket, but also to automate the resolution and ticket closure
More abundant – doing even more of the mundane, recurring tasks like auto-sizing resources (particularly virtual environments), restarting services, scheduling services, log clean-up, etc
To be honest, we probably passed the manual/semi-manual tipping point many years ago. In the meantime we’ve done as best we could, eagerly waiting until the machine tools like ML (Machine Learning) and AI (Artificial Intelligence) could catch up and help out. This is still fairly nascent, but AIOps tools are becoming increasingly prevalent.
The exciting thing is that once we start harnessing the potential of these machine tools, our AIOps should allow us to ask far more than just network health questions like past models. They could allow us to ask marketing / cost / capacity / profitability / security questions like:
Where do I urgently need to increase capacity in the network (and can you automatically just make this happen – a more “just in time” provisioning of resources rather than planning months ahead)
Where could I re-position capacity around the network to reduce costs, improve performance, improve customer experience, meet unmet demand
Where should sales/marketing teams focus their efforts to best service unmet demand (could be based on current network or in sequence with network build-out that’s due to become ready-for-service)
Where are the areas of un-met demand compared with our current network footprint
With an available budget of $x, is it best spent on which ratio of maintenance, replacement, expansion and where
How do we better understand profitability vectors in the network compared to just the more crude revenue metrics (note that profitability vectors could include service density, amount of maintenance on the supporting infrastructure, customer interactions, churn, etc on a geographic or similar basis)
Where (and how) can we progressively automate a more intent or policy-driven auto-remediation of the network (if we don’t already have a consistent approach to config management)
What policies can we tweak to get better performance from the network on a more real-time basis (eg tweaking QoS policies based on current traffic in various parts of the network)
Can we look at past maintenance trends to automatically set routine maintenance schedules that are customised by device, region, device type, loads, etc rather than using a one-size-fits-all maintenance schedule
Can I determine, on a real-time basis, what services are using which resources to get a true service impact estimate in a dynamic, packet-switched network environment
What configurations (or misconfigurations) in the network pose security vulnerability threats
If a configuration change is identified, can it be automatically audited and reported on (and perhaps even quarantined) before then being authorised (manually or automatically?)
What anomalies are we seeing that could represent security events
Can we utilise end-to-end constructs such as network services, customer services, product lifecycle, device lifecycle, application performance (as well as the traditional network performance) to enhance context and correlation
And so many more that can’t be as easily accommodated by traditional assurance tools
Seems this post from last week has triggered some really interesting debate – Is your service assurance really service assurance?? (Part 5). It was a post that looked into collecting end-to-end service metrics rather than our traditional method of collecting network device events/metrics and trying to reverse-engineer to form a service-level perspective.
Thought I’d give you an update. I’m thinking along the following lines, but admit that I don’t have it all worked out by any means yet:
We need to concept of span like OpenTelemetry does between microservices (in a way, it’s like nearest-neighbour of where each packet is getting pushed).
Note that for us a span is on a service-by-service basis between nodes, not just a network link-by-link basis between nodes
We need to be able to measure the real-time metrics of the performance of each span as well as any events/faults impacting them
One challenge (one of probably many) is how to avoid flooding the data/management planes. Possibly a telemetry beacon at each node that’s aggregating performance/events of each packet passed for each service?? But what aggregation-window / cache-size to use? Still too impossibly huge to process except with ridiculously low sampling rates??
By chaining the spans we get a real-time, end-to-end trace of services and the performance (and real-time snapshot of service-by-service resource usage in a packet-switched network)
How to efficiently get the beacon data to a centralised logging/management point? Send beacons via management plane? Send via data plane? Take an approach similar to Netflow / IPFIX-style protocols?
How to store data for a short period (ie for real-time analysis/reporting) as well as for long periods. Due to volumes, we’d have to apply aging policies to the data, but it would still be valuable for the purpose of mid and long-term SLA, network health, optimisation, capacity management, etc
As you can see, there are still so many wide-open questions about the feasibility of the concept. But getting feedback from multiple very clever people who read this blog is definitely helping! Thank you!!
“There’s the famous quote that if you want to understand how animals live, you don’t go to the zoo, you go to the jungle. The Future Lab has really pioneered that within Lego, and it hasn’t been a theoretical exercise. It’s been a real design-thinking approach to innovation, which we’ve learned an awful lot from.”
Jorgen Vig Knudstorp.
This quote prompted me to ask the question – how many times during OSS implementations had I sought to understand user behaviour at the zoo versus the jungle?
By that, how many times had I simply spoken with the user’s representative on the project team rather than directly with end users? What about the less obvious personas as discussed in this earlier post about user personas? Had I visited the jungles where internal stakeholders like project sponsors, executives, data consumers, etc. or external stakeholders such as end-customers, regulatory bodies, etc go about their daily lives?
I can truthfully, but regretfully, say I’ve spent far more time at OSS zoos than in jungles. This is something I need to redress.
But, at least I can claim to have spent most time in customer-facing roles.
Too many of the product development teams I’ve worked closely with don’t even visit OSS zoos let alone jungles in any given year. They never get close to observing real customers in their native environments.
I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because just as you have the concept of traces / metrics / logs for apps, you similarly have traces / metrics / logs for networks.
In the network world, we’re good at getting metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains) as described earlier in this blog series. And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering logs / events, which seems a bit backward to me.
Take a closer look at the OpenTelemetry link above, which provides an overview of how their team is going to gather application telemetry. With increasing software-ification of our networks (eg SDN / NFV) and the use of microservices / NaaS / APIs in our management stacks, could this actually be our path to the holy grail of service assurance (ie capturing trace data – network service telemetry)?? Is it data plane? Is it control / management plane? Is it something in between?
Note: The “active measurements” approach described in part 3 is slightly compromised in current form, which is why I’m so intrigued by the potential of extending the concepts of OpenTelemetry into our software / virtual networks.
I’d really love your take on this one because I’m sure there are many elements to this that I haven’t thought through yet. Please leave your thoughts on the viability of the approach.
Interestingly, I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because it introduces the concept of telemetry on spans (not just application nodes). Tomorrow’s article will explore how the concept of spans / traces / metrics / logs for apps might provide insight into the challenge we face getting true end-to-end metrics from our networks (as opposed to the easy to come by nodal metrics).
In the network world, we’re good at getting nodal metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains, or an aggregation of spans in OpenTelemetry nomenclature). And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering nodal logs / events, which seems a bit backward to me.
Table 4 (from the Netrounds white paper link above) provides a view of the most common AI/ML techniques used. Classification and Clustering are useful techniques for alarm / event “optimisation,” (filter, group, correlate and prioritize alarms). That is, to effectively minimise the number of alarms / events a NOC operator needs to look at. In effect, traditional data collection allows AI / ML to remove the noise, but still leaves the problem to be solved manually (ie network assurance, not service assurance)
They’re helping optimise network / resource problems, but not solving the more important service-related problems, as articulated in Table 5 below (again from Netrounds).
If we can directly collect trace data (ie the “active measurements” described in yesterday’s post), we have the data to answer specific questions (which better aligns with our narrow AI technologies of today). To paraphrase questions in the Netrounds white paper, we can ask:
Has the digital service been properly activated.
What service level is currently being experienced by customers (and are SLAs being met)
Is there an outage or degradation of end to end service chains (established over multi-domain, hybrid and multi-layered networks)
Does feedback need to be applied (eg via an orchestration solution) to heal the network
Today we’ll discuss the approach/es to overcome the constraints described in yesterday’s post.
As shown via the inserted blue row in Table 6 below (source: Netrounds), a proposed solution is to use active measurements that reflect the end-to-end user experience.
The blue row only talks about the real-time monitoring of “synthetic user traffic,” in the table below. However, there are at least two other active measurement techniques that I can think of:
We can monitor real user traffic if we use port-mirroring techniques
We can also apply techniques such as TR-069 to collect real-time customer service meta data
Note: There are strengths and weaknesses of each of the three approaches, but we won’t dive into that here. Maybe another time.
You may recall in yesterday’s post that we couldn’t readily ask service-related questions of our traditional systems or data. Excitingly though, active measurement solutions do allow us to ask more customer-centric questions, like those shown in the orange box below. We can start to collect metrics that do relate directly to what the customer is paying for (eg real data throughput rates on a storage backup service). We can start to monitor real SLA metrics, not just proxy / vanity metrics (like device up-time).
Interestingly, I’ve only had the opportunity to use one vendor’s active measurement solutions so far (one synthetic transaction tool and one port-mirror tool). [The vendor is not Netrounds’ I should add. I haven’t seen Netround’s solution yet, just their insightful white paper]. Figure 3 actually does a great job of articulating why the other vendor’s UI (user interface) and APIs are currently lacking.
Whilst they do collect active metrics, the UI doesn’t allow the user to easily ask important service health questions of the data like in the orange box. Instead, the user has to dig around in all the metrics and make their own inferences. Similarly the APIs don’t allow for the identification of events (eg threshold crossing) or automatic push of notifications to external systems.
This leaves a gap in our ability to apply self-healing (automated resolution) and resolution prior to failure (prediction) algorithms like discussed in yesterday’s post. Excitingly, it can collect service-centric data. It just can’t close the loop with it yet!
Below are three insightful tables from the Netrounds white paper:
Table 1 looks at the typical components (systems) that service assurance is comprised of. But more interestingly, it looks at the types of questions / challenges each traditional system is designed to resolve. You’ll have noticed that none of them directly answer any service quality questions (except perhaps inventory systems, which can be prone to having sketchy associations between services and the resources they utilise).
Table 2 takes a more data-centric approach. This becomes important when we look at the big picture here – ensuring reliable and effective delivery of customer services. Infrastructure failures are a fact of life, so improved service assurance models of the future will depend on automated and predictive methods… which rely on algorithms that need data. Again, we notice an absence of service-related data sets here (apart from Inventory again). You can see the constraints of the traditional data collection approach can’t you?
Table 3 instead looks at the goals of an ideal service-centric assurance solution. The traditional systems / data are convenient but clearly don’t align well to those goals. They’re constrained by what has been presented in tables 1 and 2. Even the highly touted panaceas of AI and ML are likely to struggle against those constraints.
What if we instead start with Table 3’s assurance of customer services in mind and work our way back? Or even more precisely, what if we start with an objective of perfect availability and performance of every customer service?
That might imply self-healing (automated resolution) and resolution prior to failure (prediction) as well as resilience. But let’s first give our algorithms (and dare I say it, AI/ML techniques) a better chance of success.
Working back – What must the data look like? What should the systems look like? What questions should these new systems be answering?
I just came across an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.” You can find a copy here. It’s well worth a read, so much so that I’ll unpack a few of the concepts it contains in a series of articles this week.
It rightly points out that, “Alarms and fault management are what most people think of when hearing the term service assurance. Classical service assurance systems do fall into this category, as they collect indicators from network devices (such as traps, syslog messages and telemetry data) and try to pinpoint faulty devices and interfaces that need fixing.”
This takes us into the rabbit-hole of what exactly is a service (a rabbit-hole that this article partly covers). But let’s put that aside for a moment and consider a service as being an end-to-end “thing” that a customer uses (and pays for, and therefore assumes will behave as “they” expect).
To borrow again from Netrounds, “… we must be able to measure and report on service KPIs in order to accurately measure network service quality from the end user, or customer, perspective. The KPIs should correspond to the service that the customer is paying for. For example, internet access services should measure network KPIs like loss, latency, jitter, and DNS and HTTP response times; a storage backup service should measure data throughput rate; IPTV should measure video frame loss, video buffer underrun events and channel zapping time; and VoIP should measure Mean Opinion Score (MOS).”
There’s just one problem with traditional assurance measuring techniques (eg traps, syslog messages). They are only an indirect proxy for the customer’s experience (and expectations) with the service they’re paying for. Traditional techniques just report on the links in the chain rather than the integrity of the entire length of chain. We have to look at each broken link and attempt to determine whether the chain’s integrity is actually impaired (considering the “meshing” that protects modern service chains). And if there is impairment, to then determine whose chain is impacted, in what way, and what priority needs to be given to its repair.
If we’re being completely honest, the customer doesn’t care about the chain links, or even their MOS score, only that they couldn’t understand what the person at the other end of the VoIP line was trying to communicate with them.
Exacerbating this further, with increasing dependency on cloud and virtualised resources means that there are more chain links that fall outside our domain of visibility.
So, this thing that we’ve called service assurance for the last few decades might actually be a misnomer. We’ve definitely been monitoring the health of network devices and infrastructure (the links), but we tend to only be able to manage services (the chain) through reverse-engineering – by inference, brute force and wizardry.
Is there another way? Let’s dig further in tomorrow’s post.
We sometimes attack OSS/BSS planning at a quite transactional level. For example, think about the process of gathering detailed requirements at the start of a project. They tend to be detailed and transactional don’t they? This type of requirement gathering is more like the WHAT and HOW rings in Simon Sinek’s Golden Circle.
Just curious, do you have a persona map that shows all of the different user groups that interact with your OSS/BSS?
More importantly, do you deeply understand WHY they interact with your OSS/BSS? Not just on a transaction-by-transaction level, but in the deeper context of how the organisation functions? Perhaps even on a psychological level?
If you do, you’re in a great position to apply the 10:10:10 mapping rule. That is, to describe how you’re adding value to each user group 10 minutes from now, 10 days from now and 10 months from now…
The mapping table could describe current tense (ie how your OSS/BSS is currently adding value), or as a planning mechanism for a future tense (ie how your OSS/BSS can add value in the future).
This mapping table can act as a guide for the evolution of your solution.
I should also point out that the diagram above only shows a sample of the internal personas that directly interact with your OSS/BSS. But I’d encourage you to look further. There are other personas that have direct and indirect engagement with your OSS/BSS. These include internal stakeholders like project sponsors, executives, data consumers, etc. They also include external stakeholders such as end-customers, regulatory bodies, etc.
If you need assistance to unlock your current state through persona mapping, real process mapping, etc and then planning out your target-state, Passionate About OSS would be delighted to help.
The diagram below attempts to demonstrate the concept visually, in the form of three important sliders.
When it comes to the technical delivery, it makes sense that most of the responsibility falls upon the supplier. They obviously have the greater know-how from building and implementing their own products. However, and despite what some clients expect, you’ll notice that the slider isn’t all the way to the left though. The client can’t just “throw the hand grenade over the fence” and expect the supplier to just build the solution in isolation. The client needs to be involved to ensure the solution is configured to their unique requirements. This covers factors such as network types, service types, process models, naming conventions, personas supported, integrations, approvals, etc.
Unfortunately, organisational change is an afterthought far too often on OSS projects. Not only that, but the client often expects the supplier to handle that too. They expect the slider to fall far to the left too. In my opinion, this is completely unrealistic. In most cases, the supplier simply doesn’t have the knowledge of, or influence over, the individuals within the client’s organisation. That’s why the middle slider falls mostly towards the right-hand (client) side. Not all the way though because the supplier will have suggestions / input / training based on learnings from past implementations. BTW. The link above also describes an important perspective shift to help the org change aspect of OSS transformation.
And lastly, the success of a project relies on strength of relationship throughout, but also far beyond, the initial implementation. You’d expect that most OSS implementations will have a useful life of many years. Due to the complexity of OSS transformations, clients want to stay with the same supplier for long periods because they don’t want to endure a change-out. Like any relationships, trust plays an important role. The relationship clearly has to be beneficial to both parties. Unfortunately, three factors often doom OSS relationships from the outset.
Firstly, the sliders above show my unbiased perspective of the weight of responsibility on a generic OSS project. If each party has a vastly different expectation of slider positioning, then the project can be off to a difficult (but all-too-common) start.
Secondly, the nature of vendor selection process can also gnaw away at trust quite quickly. The client wants an as-low-as-possible cost in the contract (obviously). The supplier wants to win the bid, so they keep costs as low as possible, often hoping to make up the difference through the inevitable variations that happen on these complex projects.
And thirdly, the complexity of these projects means challenges almost always arise and can cause cynicism being hurled across the fence by both parties.
You may be wondering why the third slider isn’t perfectly centred between both. You may claim that significant responsibility for humility, fairness and forgiveness lies with each participant to ensure a long-lasting, trusted relationship. I’d agree with you on that, but I’d also argue that the supplier carries slightly more responsibility as they (usually) hold a slight balance in power. They know the client doesn’t want to endure another OSS change-out project any time soon, so the client generally has more to lose from a relationship breakdown. Unfortunately, I’ve seen this leveraged by vendors too many times.
Do you agree/disagree with these observations? I’d love to hear your thoughts.
Oh, and if you’re ever need an independent third-party to help set the right balance of expectations across these sliders on your project, you’re welcome to call upon Passionate About OSS to assist.
Over the many OSS implementation projects I’ve worked on, UI/UX (user interface / user experience) has been an afterthought (if even thought about at all). I know there are OSS UI/UX experts out there (I’ve met a handful), but none have ever been assigned to the projects I’ve worked on unfortunately. UI has always just been the domain of the developer. If the functionality worked (even if in a highly convoluted way), then the developer would move on to the next epic. The UI was almost never re-visited unless the functionality proved to be almost unusable.
So the question becomes, how do we observe, measure and trial UI/UX effectiveness?
Have you ever tried running a heat-mapping analysis over your OSS to show actual user behaviour?
Given that almost all OSS are now browser-based, there are plenty of heat-map tools available. They give results that might look something like this (but can also provide more granularity of analysis too):
Whereas these tools are generally used by web developers to improve retention and conversion rates on their pages (ie customers buying, clicking through on a banner ad, calls to action, etc), we’ll use them in a different way within our OSS. We’ll instead be looking for efficiency of movement, an indicator of whether the design of our page is intuitive or not. Are the operators of your OSS clicking in the right places (menus, titles, buttons, links, etc) to achieve certain outcomes?
I’d be particularly interested in comparing heat-maps of new operators (eg if you’ve installed a sand-pit environment at a client site for the first time and let the client’s operators loose) versus experienced teams. Depending on the OSS application you’re analysing, you may even been interested in observing different behaviours across different devices (eg desktops, phones, tablets).
There’s generally a LOT of functionality available within each OSS. Are we optimising the experience for the functionality that matters most? For web-page designers, that might mean ensuring all the most important information is “above-the-fold” (ie can be seen on the page without having to scroll down – noting that the “fold” will be different across different devices/resolutions). If they want a user to click on the “buy now” button, then they *may* want that to appear above the fold, ensuring the prospective buyer doesn’t have to go searching for it.
In the case of an OSS, you don’t want to hide the most important functionality under layers of menus. And don’t forget that different personas (eg designers, admins, execs, help-desk, NOC, etc) are likely to have different definitions of “important functionality.” You may want to optimise important UI elements for each different persona (if your OSS allows for that level of configurability).
I’m not endorsing Smartlook, but if you’d like to read more about heat-mapping techniques, you can start here.
OSS implementations / transformations are always challenging. Stakeholders seem to easily get their heads around the fact that there will be technical challenges (even if they / we can’t always get their head around the actual changes initially).
When a supplier is charged with doing an OSS implementation, the client (perhaps rightly) expects the supplier to lead the technical implementation and guide the client through any challenges. It’s the, “Over to you!” client mentality at times.
However, it’s the change management challenges that are often overlooked and/or underestimated (by client and supplier alike). It’s far less realistic for a client to delegate these activities and challenges to the supplier. The supplier simply doesn’t have the reach or influence within the client’s organisation (unless they’re long-term trusted partners). Just doing a 2 week training course at the end of the implementation rarely works.
Now, if you do represent the client, change management starts all the way back at the start of the project – from the time we start to gather current and desired future state, including process and persona mappings.
At that time we can put ourselves in the shoes of each person impacted by OSS change and consider, “If your current normal is exactly what you need, then different isn’t worth exploring” (a Seth Godin quote).
How many times have you heard about operators bypassing their sophisticated new OSS and reverting back to their old spreadsheets (thus keeping an offline store of data that would be valuable to be stored in the OSS)?
Interestingly though, if you approach those same people before the OSS implementation and ask them whether their as-is spreadsheet model gives them exactly what they need, you will undoubtedly get some great insights (either yes it is and here’s why…
or not it’s not because…).
You have a stronger position of influence with these operators if you involve and listen pre-implementation than enforcing change afterwards.
To again quote Seth, they’re not always, “hesitant about this new idea because it’s a risky, problematic, defective idea… [but] because it’s simply different than [they’re] used to.”
I was speaking with a friend today about an old OSS assurance product that is undergoing a refresh and investment after years of stagnation.
He indicated that it was to come with about 20 out of the box adaptors for data collection. I found that interesting because it was replacing a product that probably had in excess of 100 adaptors. Seemed like a major backward step… until my friend pointed out the types of adaptor in this new product iteration – Splunk, AWS, etc.
Our OSS no longer collect data directly from the network anymore. We have web-scaled processes sucking everything out of the network / EMS, aggregating it and transforming / indexing / storing it. Then, like any other IT application, our OSS just collect what we need from a data set that has already been consolidated and homogenised.
I don’t know why I’d never thought about it like this before (ie building an architecture that doesn’t even consider connecting to the the multitude of network / device / EMS types). In doing so, we lose the direct connection to the source, but we also reduce our integration tax load (directly to the OSS at least).
This is the third part of a series describing a really exciting analysis I’ve just finished.
Part 1 described how we can turn simple log files into a Sankey diagram that shows real-life process flows (not just a theoretical diagram drawn by BAs and SMEs), like below:
Part 2 described how the logs are broken down into a design tree and how we can assign weightings to each branch based on the data stored in the logs, as below:
I’ve already had lots of great feedback in relation to the Part 1 blog, especially from people who’ve had challenges capturing as-is process. The feedback has been greatly appreciated so I’m looking forward to helping them draw up their flow-charts on the way to helping optimise their process flows.
But that’s just the starting point. Today’s post is where things get really exciting (for me at least). Today we build on part 2 and not just record weightings, but use them to assist future decisions.
We can use the decision tree to “predict forward” and help operators / algorithms make optimal decisions whilst working towards process completion. We can use a feedback loop to steer an operator (or application) down the most optimal branches of the tree (and/or avoid the fall-out variants).
This allows us to create a closed-loop, self-optimising, Decision Support System (DSS), as follows:
Using log data alone, we can perform decision optimisation based on “likelihood of success” or “time to complete” as per the weightings table. If supplemented with additional data, the weightings table could also allow decisions to be optimised by “cost to complete” or many other factors.
The model has the potential to be used in “real-time” mode, using the constant stream of process logs to continually refine and adapt. For example:
If the long-term average of a process path is 1 minute, but there’s currently a problem with and that path is failing, then another path (one that is otherwise slightly less optimised over the long-term), could be used until the first path is repaired
An operator happens to choose a new, more optimal path than has ever been identified previously (the delta function in the diagram). It then sets a new benchmark and informs the new approach via the DSS (Darwinian selection)
If you’re wondering how the DSS could be implemented, I can envisage a few ways:
Using existing RPA (Robotic Process Automation) tools [which are particularly relevant if the workflow box in the diagram above crosses multiple different applications (not just a single monolithic OSS/BSS)]
Providing a feedback path into the functionality of the OSS/BSS and it’s GUI
Via notifications (eg email, Slack, etc) to operators
Via a simple, more manual process like flow diagrams, work instructions, scorecards or similar
This visualisation is exciting because it shows how your processes are actually flowing (or not), as opposed to the theoretical process diagrams that are laboriously created by BAs in conjunction with SMEs. It also shows which branches in the flow are actually being utilised and where inefficiencies are appearing (and are therefore optimisation targets).
Some people have wondered how simple activity logs can be used to show the Sankey diagrams. Hopefully the diagram below helps to describe this. You scan the log data looking for variants / patterns of flows and overlay those onto a map of decision states (DPs). In the diagram above, there are only 3 DPs, but 303 different variants (sounds implausible, but there are many variants that do multiple loops through the 3 states and are therefore considered to be a different variant).
The numbers / weightings you see on the Sankey diagram are the number* of instances (of a single flow type) that have transitioned between two DPs / states.
* Note that this is not the same as the count value that appears in the Weightings table. We’ll get to that in tomorrow’s post when we describe how to use the weightings data for decision support.
In your travels, I don’t suppose you’ve ever come across anyone having challenges to capture and/or optimise their as-is OSS/BSS process flows? Once or twice?? 🙂
Well I’ve just completed an analysis that I’m really excited about. It’s something I’ve been thinking about for some time, but have just finished proving on the weekend. I thought it might have relevance to you too. It quickly helps to visualise as-is process and identify areas to optimise.
The method takes activity logs (eg from OSS, ITIL, WFM, SAP or similar) and turns them into a process diagram (a Sankey diagram) like below with real instance volumes. Much better than a theoretical process map designed by BAs and SMEs don’t you think?? And much faster and more accurate too!!
A theoretical process map might just show a sequence of 3 steps, but the diagram above has used actual logs to show what’s really occurring. It highlights governance issues (skipped steps) and inefficiencies (ie the various loops) in the process too. Perfect for process improvement.
But more excitingly, it proves a path towards real-time “predict-forward” decision support without having to get into the complexities of AI. More has been included in the analysis!
If this is of interest to you, let me know and I’ll be happy to walk you through the full analysis. Or if you want to know how your real as-is processes perform, I’d be happy to help turn your logs into visuals like the one above.
PS1. You might think you need a lot of fields to prepare the diagrams above. The good news is the only mandatory fields would be something like:
Flow type – eg Order type, project type or similar (only required if the extract contains multiple flow types mixed together. The diagram above represents just one flow type)
Flow instance identifier – eg Order number, project number or similar (the diagram above was based on data that had around 600,000 flow instances)
Activity identifier – eg Activity name (as per the 3 states in the diagram above), recorded against each flow instance. Note that they will ideally be an enumerated list (ie from a finite pick-list)
Timestamps – Start/end timestamp on each activity instance
If the log contains other details such as the name of the operator who completed each activity, that can help add richness, but not mandatory.
PS2. The main objective of the analysis was to test concepts raised in the following blog posts:
“Whatever is well conceived is clearly said,
And the words to say it flow with ease.”
I’d like to hijack this quote and re-direct it towards architectures. Could we equally state that a well conceived architecture can be clearly understood? Some modern OSS/IT frameworks that I’ve seen recently are hugely complex. The question I’ve had to ponder is whether they’re necessarily complex. As the aphorism states, “Everything should be made as simple as possible, but not simpler.”
Just take in the complexity of this triptych I prepared to overlay SDN, NFV and MANO frameworks.
Yet this is only a basic model. It doesn’t consider networks with a blend of PNF and VNF (Physical and Virtual Network Functions). It doesn’t consider closed loop assurance. It doesn’t consider other automations, or omni-channel, or etc, etc.
Yesterday’s post raised an interesting concept from Tom Nolle that as our solutions become more complex, our ability to make a basic assessment of value becomes more strained. And by implication, we often need to upskill a team before even being able to assess the value of a proposed project.
It seems to me that we need simpler architectures to be able to generate persuasive business cases. But it poses the question, do they need to be complex or are our solutions just not well enough conceived yet?
To borrow a story from Wikiquote, “Richard Feynman, the late Nobel Laureate in physics, was once asked by a Caltech faculty member to explain why spin one-half particles obey Fermi Dirac statistics. Rising to the challenge, he said, “I’ll prepare a freshman lecture on it.” But a few days later he told the faculty member, “You know, I couldn’t do it. I couldn’t reduce it to the freshman level. That means we really don’t understand it.“
“…as technology gets more complicated, it becomes more difficult for buyers to acquire the skills needed to make even a basic assessment of value. Without such an assessment, it’s hard to get a project going, and in particular hard to get one going the right way.”
Have you noticed that over the last few years, OSS choice has proliferated, making project assessment more challenging? Previously, the COTS (Commercial Off-the-Shelf) product solution dominated. That was already a challenge because there are hundreds to choose from (there are around 400 on our vendors page alone). But that’s just the tip of the iceberg.
We now also have choices to make across factors such as:
Building OSS tools with open-source projects
An increasing amount of in-house development (as opposed to COTS implementations by the product’s vendors)
Smaller niche products that need additional integration
An increase in the number of “standards” that are seeking to solve traditional OSS/BSS problems (eg ONAP, ETSI’s ZSM, TM Forum’s ODA, etc, etc)
Revolutions from the IT world such as cloud, containerisation, virtualisation, etc
As Tom indicates in the quote above, the diversity of skills required to make these decisions is broadening. Broadening to the point where you generally need a large team to have suitable skills coverage to make even a basic assessment of value.
At Passionate About OSS, we’re seeking to address this in the following ways:
We have two development projects underway (more news to come)
One to simplify the vendor / product selection process
One to assist with up-skilling on open-source and IT tools to build modern OSS
In addition to existing pages / blogs, we’re assembling more content about “standards” evolution, which should appear on this blog in coming days
Use our “Finding an Expert” tool to match experts to requirements
And of course there are the variety of consultancy services we offer ranging from strategy, roadmap, project business case and vendor selection through to resource identification and implementation. Leave us a message on our contact page if you’d like to discuss more
Over the years in OSS, I’ve spent a lot of my time helping companies create their OSS / BSS strategies and roadmaps. Sometimes clients come from the buy side (eg carriers, utilities, enterprise), other times clients come from the sell side (eg vendors, integrators). There’s one factor that seems to be most commonly raised by these clients, and it comes from both sides.
What is that one factor? Well, we’ll come back to what that factor is a little later, but let’s cover some background first.
OSS / BSS covers a fairly broad estate of functionality:
If you’re from the buy-side, you need to manage both to build a full-function OSS/BSS suite. If you’re from the sell-side, you’re either forced into dealing with both (reactive) or sometimes you can choose to develop those to bring a more complete offering to market (proactive).
You will have noticed that both are double-ended. Integrations bring two applications / functions together. Relationships bring two organisations together.
This two-ended concept means there’s always a “far-side” that’s outside your control. It’s in our nature to worry about what’s outside our control. We tend to want to put controls around what we can’t control. Not only that, but it’s incumbent on us as organisation planners to put mitigation strategies in place.
Which brings us back to the one factor that is raised by clients on most occasions – substitution – how do we minimise our exposure to lock-in with an OSS product / service partner/s if our partnership deteriorates?
Well, here are some thoughts:
Design your own architecture with product / partner substitution in mind (and regularly review your substitution plan because products are always evolving)
Develop multiple integrations so that you always have active equivalency. This is easier for sell-side “reactives” because their different customers will have different products to integrate to (eg an OSS vendor that is able to integrate with four different ITSM tools because they have different customers with each of those variants)
Enhance your own offerings so that you no longer require the partnership, but can do it yourself
Invest in your partnerships to ensure they don’t deteriorate. This is the OSS marriage analogy where ongoing mutual benefits encourage the relationship to continue.