Going to the OSS zoo

There’s the famous quote that if you want to understand how animals live, you don’t go to the zoo, you go to the jungle. The Future Lab has really pioneered that within Lego, and it hasn’t been a theoretical exercise. It’s been a real design-thinking approach to innovation, which we’ve learned an awful lot from.”
Jorgen Vig Knudstorp
.

This quote prompted me to ask the question – how many times during OSS implementations had I sought to understand user behaviour at the zoo versus the jungle?

By that, how many times had I simply spoken with the user’s representative on the project team rather than directly with end users? What about the less obvious personas as discussed in this earlier post about user personas? Had I visited the jungles where internal stakeholders like project sponsors, executives, data consumers, etc. or external stakeholders such as end-customers, regulatory bodies, etc go about their daily lives?

I can truthfully, but regretfully, say I’ve spent far more time at OSS zoos than in jungles. This is something I need to redress.

But, at least I can claim to have spent most time in customer-facing roles.

Too many of the product development teams I’ve worked closely with don’t even visit OSS zoos let alone jungles in any given year. They never get close to observing real customers in their native environments.

 

Is your service assurance really service assurance?? (Part 4)

Yesterday’s post introduced the concept of active measurements as the better method for monitoring and assuring customer services.

Like the rest of this series, it borrowed from an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.”

Interestingly, I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because it introduces the concept of telemetry on spans (not just application nodes). Tomorrow’s article will explore how the concept of spans / traces / metrics / logs for apps might provide insight into the challenge we face getting true end-to-end metrics from our networks (as opposed to the easy to come by nodal metrics).

In the network world, we’re good at getting nodal metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains, or an aggregation of spans in OpenTelemetry nomenclature). And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering nodal logs / events, which seems a bit backward to me.

Table 4 (from the Netrounds white paper link above) provides a view of the most common AI/ML techniques used. Classification and Clustering are useful techniques for alarm / event “optimisation,” (filter, group, correlate and prioritize alarms). That is, to effectively minimise the number of alarms / events a NOC operator needs to look at. In effect, traditional data collection allows AI / ML to remove the noise, but still leaves the problem to be solved manually (ie network assurance, not service assurance)

They’re helping optimise network / resource problems, but not solving the more important service-related problems, as articulated in Table 5 below (again from Netrounds).

If we can directly collect trace data (ie the “active measurements” described in yesterday’s post), we have the data to answer specific questions (which better aligns with our narrow AI technologies of today). To paraphrase questions in the Netrounds white paper, we can ask:

  • Has the digital service been properly activated.
  • What service level is currently being experienced by customers (and are SLAs being met)
  • Is there an outage or degradation of end to end service chains (established over multi-domain, hybrid and multi-layered networks)
  • Does feedback need to be applied (eg via an orchestration solution) to heal the network

PS. Since we spoke about the AI / ML techniques of Classification and Clustering above, you might want to revisit an earlier post that discusses a contrarian approach to root-cause analysis that could use them too – Auto-releasing chaos monkeys to harden your network (CT/IR).

Is your service assurance really service assurance?? (Part 3)

Yep, this is the third part, so that might suggest that there were two lead-up articles prior to this one. Well, you’d be right:

  • The first proposed that most of what we refer to as “service assurance” is really only “network infrastructure” assurance.
  • The second then looked at the constraints we face in trying to reverse-engineer “network infrastructure” assurance into data that will allow us to assure customer services.

I should also point out that both posts, like today’s, were inspired by an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.”

Today we’ll discuss the approach/es to overcome the constraints described in yesterday’s post.

As shown via the inserted blue row in Table 6 below (source: Netrounds), a proposed solution is to use active measurements that reflect the end-to-end user experience.

The blue row only talks about the real-time monitoring of “synthetic user traffic,” in the table below. However, there are at least two other active measurement techniques that I can think of:

  • We can monitor real user traffic if we use port-mirroring techniques
  • We can also apply techniques such as TR-069 to collect real-time customer service meta data

Note: There are strengths and weaknesses of each of the three approaches, but we won’t dive into that here. Maybe another time.

You may recall in yesterday’s post that we couldn’t readily ask service-related questions of our traditional systems or data. Excitingly though, active measurement solutions do allow us to ask more customer-centric questions, like those shown in the orange box below. We can start to collect metrics that do relate directly to what the customer is paying for (eg real data throughput rates on a storage backup service). We can start to monitor real SLA metrics, not just proxy / vanity metrics (like device up-time).

Interestingly, I’ve only had the opportunity to use one vendor’s active measurement solutions so far (one synthetic transaction tool and one port-mirror tool). [The vendor is not Netrounds’ I should add. I haven’t seen Netround’s solution yet, just their insightful white paper]. Figure 3 actually does a great job of articulating why the other vendor’s UI (user interface) and APIs are currently lacking.

Whilst they do collect active metrics, the UI doesn’t allow the user to easily ask important service health questions of the data like in the orange box. Instead, the user has to dig around in all the metrics and make their own inferences. Similarly the APIs don’t allow for the identification of events (eg threshold crossing) or automatic push of notifications to external systems.

This leaves a gap in our ability to apply self-healing (automated resolution) and resolution prior to failure (prediction) algorithms like discussed in yesterday’s post. Excitingly, it can collect service-centric data. It just can’t close the loop with it yet!

More on the data tomorrow!

Is your service assurance really service assurance?? (Part 2)

In yesterday’s article, we asked whether what many know as service assurance can rightfully be called service assurance. Yesterday’s, like today’s, post was inspired by an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.”

Below are three insightful tables from the Netrounds white paper:

Table 1 looks at the typical components (systems) that service assurance is comprised of. But more interestingly, it looks at the types of questions / challenges each traditional system is designed to resolve. You’ll have noticed that none of them directly answer any service quality questions (except perhaps inventory systems, which can be prone to having sketchy associations between services and the resources they utilise).

Table 2 takes a more data-centric approach. This becomes important when we look at the big picture here – ensuring reliable and effective delivery of customer services. Infrastructure failures are a fact of life, so improved service assurance models of the future will depend on automated and predictive methods… which rely on algorithms that need data. Again, we notice an absence of service-related data sets here (apart from Inventory again). You can see the constraints of the traditional data collection approach can’t you?

Table 3 instead looks at the goals of an ideal service-centric assurance solution. The traditional systems / data are convenient but clearly don’t align well to those goals. They’re constrained by what has been presented in tables 1 and 2. Even the highly touted panaceas of AI and ML are likely to struggle against those constraints.

What if we instead start with Table 3’s assurance of customer services in mind and work our way back? Or even more precisely, what if we start with an objective of perfect availability and performance of every customer service?

That might imply self-healing (automated resolution) and resolution prior to failure (prediction) as well as resilience. But let’s first give our algorithms (and dare I say it, AI/ML techniques) a better chance of success.

Working back – What must the data look like? What should the systems look like? What questions should these new systems be answering?

More tomorrow.

Is your service assurance really service assurance??

I just came across an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.” You can find a copy here. It’s well worth a read, so much so that I’ll unpack a few of the concepts it contains in a series of articles this week.

It rightly points out that, “Alarms and fault management are what most people think of when hearing the term service assurance. Classical service assurance systems do fall into this category, as they collect indicators from network devices (such as traps, syslog messages and telemetry data) and try to pinpoint faulty devices and interfaces that need fixing.

This takes us into the rabbit-hole of what exactly is a service (a rabbit-hole that this article partly covers). But let’s put that aside for a moment and consider a service as being an end-to-end “thing” that a customer uses (and pays for, and therefore assumes will behave as “they” expect).

To borrow again from Netrounds, “… we must be able to measure and report on service KPIs in order to accurately measure network service quality from the end user, or customer, perspective. The KPIs should correspond to the service that the customer is paying for. For example, internet access services should measure network KPIs like loss, latency, jitter, and DNS and HTTP response times; a storage backup service should measure data throughput rate; IPTV should measure video frame loss, video buffer underrun events and channel zapping time; and VoIP should measure Mean Opinion Score (MOS).”

There’s just one problem with traditional assurance measuring techniques (eg traps, syslog messages). They are only an indirect proxy for the customer’s experience (and expectations) with the service they’re paying for. Traditional techniques just report on the links in the chain rather than the integrity of the entire length of chain. We have to look at each broken link and attempt to determine whether the chain’s integrity is actually impaired (considering the “meshing” that protects modern service chains). And if there is impairment, to then determine whose chain is impacted, in what way, and what priority needs to be given to its repair.

If we’re being completely honest, the customer doesn’t care about the chain links, or even their MOS score, only that they couldn’t understand what the person at the other end of the VoIP line was trying to communicate with them.

Exacerbating this further, with increasing dependency on cloud and virtualised resources means that there are more chain links that fall outside our domain of visibility.

So, this thing that we’ve called service assurance for the last few decades might actually be a misnomer. We’ve definitely been monitoring the health of network devices and infrastructure (the links), but we tend to only be able to manage services (the chain) through reverse-engineering – by inference, brute force and wizardry.

Is there another way? Let’s dig further in tomorrow’s post.

Three OSS project responsibility sliders

Last week we shared an article that talked about the different expectations from suppliers and clients when undertaking an OSS implementation project.

The diagram below attempts to demonstrate the concept visually, in the form of three important sliders.

OSS Responsibility Sliders

When it comes to the technical delivery, it makes sense that most of the responsibility falls upon the supplier. They obviously have the greater know-how from building and implementing their own products. However, and despite what some clients expect, you’ll notice that the slider isn’t all the way to the left though. The client can’t just “throw the hand grenade over the fence” and expect the supplier to just build the solution in isolation. The client needs to be involved to ensure the solution is configured to their unique requirements. This covers factors such as network types, service types, process models, naming conventions, personas supported, integrations, approvals, etc.

Unfortunately, organisational change is an afterthought far too often on OSS projects. Not only that, but the client often expects the supplier to handle that too. They expect the slider to fall far to the left too. In my opinion, this is completely unrealistic. In most cases, the supplier simply doesn’t have the knowledge of, or influence over, the individuals within the client’s organisation. That’s why the middle slider falls mostly towards the right-hand (client) side. Not all the way though because the supplier will have suggestions / input / training based on learnings from past implementations. BTW. The link above also describes an important perspective shift to help the org change aspect of OSS transformation.

And lastly, the success of a project relies on strength of relationship throughout, but also far beyond, the initial implementation. You’d expect that most OSS implementations will have a useful life of many years. Due to the complexity of OSS transformations, clients want to stay with the same supplier for long periods because they don’t want to endure a change-out. Like any relationships, trust plays an important role. The relationship clearly has to be beneficial to both parties. Unfortunately, three factors often doom OSS relationships from the outset.

Firstly, the sliders above show my unbiased perspective of the weight of responsibility on a generic OSS project. If each party has a vastly different expectation of slider positioning, then the project can be off to a difficult (but all-too-common) start.

Secondly, the nature of vendor selection process can also gnaw away at trust quite quickly. The client wants an as-low-as-possible cost in the contract (obviously). The supplier wants to win the bid, so they keep costs as low as possible, often hoping to make up the difference through the inevitable variations that happen on these complex projects.

And thirdly, the complexity of these projects means challenges almost always arise and can cause cynicism being hurled across the fence by both parties.

You may be wondering why the third slider isn’t perfectly centred between both. You may claim that significant responsibility for humility, fairness and forgiveness lies with each participant to ensure a long-lasting, trusted relationship. I’d agree with you on that, but I’d also argue that the supplier carries slightly more responsibility as they (usually) hold a slight balance in power. They know the client doesn’t want to endure another OSS change-out project any time soon, so the client generally has more to lose from a relationship breakdown. Unfortunately, I’ve seen this leveraged by vendors too many times.

Do you agree/disagree with these observations? I’d love to hear your thoughts.

Oh, and if you’re ever need an independent third-party to help set the right balance of expectations across these sliders on your project, you’re welcome to call upon Passionate About OSS to assist.

This OSS is different to what I’m used to

OSS implementations / transformations are always challenging. Stakeholders seem to easily get their heads around the fact that there will be technical challenges (even if they / we can’t always get their head around the actual changes initially).

When a supplier is charged with doing an OSS implementation, the client (perhaps rightly) expects the supplier to lead the technical implementation and guide the client through any challenges. It’s the, “Over to you!” client mentality at times.

However, it’s the change management challenges that are often overlooked and/or underestimated (by client and supplier alike). It’s far less realistic for a client to delegate these activities and challenges to the supplier. The supplier simply doesn’t have the reach or influence within the client’s organisation (unless they’re long-term trusted partners). Just doing a 2 week training course at the end of the implementation rarely works.

Now, if you do represent the client, change management starts all the way back at the start of the project – from the time we start to gather current and desired future state, including process and persona mappings.

At that time we can put ourselves in the shoes of each person impacted by OSS change and consider, “If your current normal is exactly what you need, then different isn’t worth exploring” (a Seth Godin quote).

How many times have you heard about operators bypassing their sophisticated new OSS and reverting back to their old spreadsheets (thus keeping an offline store of data that would be valuable to be stored in the OSS)?

Interestingly though, if you approach those same people before the OSS implementation and ask them whether their as-is spreadsheet model gives them exactly what they need, you will undoubtedly get some great insights (either yes it is and here’s why…
or not it’s not because…).

You have a stronger position of influence with these operators if you involve and listen pre-implementation than enforcing change afterwards.

To again quote Seth, they’re not always, “hesitant about this new idea because it’s a risky, problematic, defective idea… [but] because it’s simply different than [they’re] used to.”

A modern twist on OSS architecture

I was speaking with a friend today about an old OSS assurance product that is undergoing a refresh and investment after years of stagnation.

He indicated that it was to come with about 20 out of the box adaptors for data collection. I found that interesting because it was replacing a product that probably had in excess of 100 adaptors. Seemed like a major backward step… until my friend pointed out the types of adaptor in this new product iteration – Splunk, AWS, etc.

Of course!!

Our OSS no longer collect data directly from the network anymore. We have web-scaled processes sucking everything out of the network / EMS, aggregating it and transforming / indexing / storing it. Then, like any other IT application, our OSS just collect what we need from a data set that has already been consolidated and homogenised.

I don’t know why I’d never thought about it like this before (ie building an architecture that doesn’t even consider connecting to the the multitude of network / device / EMS types). In doing so, we lose the direct connection to the source, but we also reduce our integration tax load (directly to the OSS at least).

I’m really excited by a just-finished OSS analysis (part 3)

This is the third part of a series describing a really exciting analysis I’ve just finished.

Part 1 described how we can turn simple log files into a Sankey diagram that shows real-life process flows (not just a theoretical diagram drawn by BAs and SMEs), like below:

Part 2 described how the logs are broken down into a design tree and how we can assign weightings to each branch based on the data stored in the logs, as below:
OSS Decision Tree Analysis

I’ve already had lots of great feedback in relation to the Part 1 blog, especially from people who’ve had challenges capturing as-is process. The feedback has been greatly appreciated so I’m looking forward to helping them draw up their flow-charts on the way to helping optimise their process flows.

But that’s just the starting point. Today’s post is where things get really exciting (for me at least). Today we build on part 2 and not just record weightings, but use them to assist future decisions.

We can use the decision tree to “predict forward” and help operators / algorithms make optimal decisions whilst working towards process completion. We can use a feedback loop to steer an operator (or application) down the most optimal branches of the tree (and/or avoid the fall-out variants).

This allows us to create a closed-loop, self-optimising, Decision Support System (DSS), as follows:

Note: Diagram sourced from https://passionateaboutoss.com/closing-the-loop-to-make-better-decisions, where further explanation is provided

Using log data alone, we can perform decision optimisation based on “likelihood of success” or “time to complete” as per the weightings table. If supplemented with additional data, the weightings table could also allow decisions to be optimised by “cost to complete” or many other factors.

The model has the potential to be used in “real-time” mode, using the constant stream of process logs to continually refine and adapt. For example:

  • If the long-term average of a process path is 1 minute, but there’s currently a problem with and that path is failing, then another path (one that is otherwise slightly less optimised over the long-term), could be used until the first path is repaired
  • An operator happens to choose a new, more optimal path than has ever been identified previously (the delta function in the diagram). It then sets a new benchmark and informs the new approach via the DSS (Darwinian selection)

If you’re wondering how the DSS could be implemented, I can envisage a few ways:

  1. Using existing RPA (Robotic Process Automation) tools [which are particularly relevant if the workflow box in the diagram above crosses multiple different applications (not just a single monolithic OSS/BSS)]
  2. Providing a feedback path into the functionality of the OSS/BSS and it’s GUI
  3. Via notifications (eg email, Slack, etc) to operators
  4. Via a simple, more manual process like flow diagrams, work instructions, scorecards or similar
  5. You can probably envisage other methods

I’m really excited by a just-finished OSS analysis (part 2)

As the title suggests, this is the second part in a series describing a process flow visualisation, optimisation and decision support methodology that uses simple log data as input.

Yesterday’s post, part 1 in the series, showed the visualisation aspect in the form of a Sankey flow diagram.

This visualisation is exciting because it shows how your processes are actually flowing (or not), as opposed to the theoretical process diagrams that are laboriously created by BAs in conjunction with SMEs. It also shows which branches in the flow are actually being utilised and where inefficiencies are appearing (and are therefore optimisation targets).

Some people have wondered how simple activity logs can be used to show the Sankey diagrams. Hopefully the diagram below helps to describe this. You scan the log data looking for variants / patterns of flows and overlay those onto a map of decision states (DPs). In the diagram above, there are only 3 DPs, but 303 different variants (sounds implausible, but there are many variants that do multiple loops through the 3 states and are therefore considered to be a different variant).

OSS Decision Tree Analysis

The numbers / weightings you see on the Sankey diagram are the number* of instances (of a single flow type) that have transitioned between two DPs / states.

* Note that this is not the same as the count value that appears in the Weightings table. We’ll get to that in tomorrow’s post when we describe how to use the weightings data for decision support.

Are modern OSS architectures well conceived?

Whatever is well conceived is clearly said,
And the words to say it flow with ease
.”
Nicolas Boileau-Despréaux
.

I’d like to hijack this quote and re-direct it towards architectures. Could we equally state that a well conceived architecture can be clearly understood? Some modern OSS/IT frameworks that I’ve seen recently are hugely complex. The question I’ve had to ponder is whether they’re necessarily complex. As the aphorism states, “Everything should be made as simple as possible, but not simpler.”

Just take in the complexity of this triptych I prepared to overlay SDN, NFV and MANO frameworks.

Yet this is only a basic model. It doesn’t consider networks with a blend of PNF and VNF (Physical and Virtual Network Functions). It doesn’t consider closed loop assurance. It doesn’t consider other automations, or omni-channel, or etc, etc.

Yesterday’s post raised an interesting concept from Tom Nolle that as our solutions become more complex, our ability to make a basic assessment of value becomes more strained. And by implication, we often need to upskill a team before even being able to assess the value of a proposed project.

It seems to me that we need simpler architectures to be able to generate persuasive business cases. But it poses the question, do they need to be complex or are our solutions just not well enough conceived yet?

To borrow a story from Wikiquote, “Richard Feynman, the late Nobel Laureate in physics, was once asked by a Caltech faculty member to explain why spin one-half particles obey Fermi Dirac statistics. Rising to the challenge, he said, “I’ll prepare a freshman lecture on it.” But a few days later he told the faculty member, “You know, I couldn’t do it. I couldn’t reduce it to the freshman level. That means we really don’t understand it.

Making a basic assessment of OSS value

“…as technology gets more complicated, it becomes more difficult for buyers to acquire the skills needed to make even a basic assessment of value. Without such an assessment, it’s hard to get a project going, and in particular hard to get one going the right way.”
Tom Nolle
.

Have you noticed that over the last few years, OSS choice has proliferated, making project assessment more challenging? Previously, the COTS (Commercial Off-the-Shelf) product solution dominated. That was already a challenge because there are hundreds to choose from (there are around 400 on our vendors page alone). But that’s just the tip of the iceberg.

We now also have choices to make across factors such as:

  • Building OSS tools with open-source projects
  • An increasing amount of in-house development (as opposed to COTS implementations by the product’s vendors)
  • Smaller niche products that need additional integration
  • An increase in the number of “standards” that are seeking to solve traditional OSS/BSS problems (eg ONAP, ETSI’s ZSM, TM Forum’s ODA, etc, etc)
  • Revolutions from the IT world such as cloud, containerisation, virtualisation, etc

As Tom indicates in the quote above, the diversity of skills required to make these decisions is broadening. Broadening to the point where you generally need a large team to have suitable skills coverage to make even a basic assessment of value.

At Passionate About OSS, we’re seeking to address this in the following ways:

  • We have two development projects underway (more news to come)
    • One to simplify the vendor / product selection process
    • One to assist with up-skilling on open-source and IT tools to build modern OSS
  • In addition to existing pages / blogs, we’re assembling more content about “standards” evolution, which should appear on this blog in coming days
  • Use our “Finding an Expert” tool to match experts to requirements
  • And of course there are the variety of consultancy services we offer ranging from strategy, roadmap, project business case and vendor selection through to resource identification and implementation. Leave us a message on our contact page if you’d like to discuss more

The 3 states of OSS consciousness

The last four posts have discussed how our OSS/BSS need to cope with different modes of working to perform effectively. We started off with the thread of “group flow,” where multiple different users of our tools can work cohesively. Then we talked about how flow requires a lack of interruptions, yet many of the roles using our OSS actually need constant availability (ie to be constantly interrupted).

From a user experience (UI/UX) perspective, we need an awareness of the state the operator/s needs to be in to perform each step of an end-to-end process, be it:

  • Deep think or flow mode – where the operator needs uninterrupted time to resolve a complex and/or complicated activity (eg a design activity)
  • Constant availability mode – where the operator needs to quickly respond to the needs of others and therefore needs a stream of notifications / interruptions (eg network fault resolutions)
  • Group flow mode – where a group of operators need to collaborate effectively and cohesively to resolve a complex and/or complicated activity (eg resolve a cross-domain fault situation)

This is a strong argument for every OSS/BSS supplier to have UI/UX experts on their team. Yet most leave their UI/UX with their coders. They tend to take the perspective that if the function can be performed, it’s time to move on to building the next function. That was the same argument used by all MP3 player suppliers before the iPod came along with its beautiful form and function and dominated the market.

Interestingly, modern architectural principles potentially make UI/UX design more challenging. With old, monolithic OSS/BSS, you at least had more control over end-to-end workflows (I’m not suggesting we should go back to the monoliths BTW). These days, you need to accommodate the unique nuances / inconsistencies of third-party modules like APIs / microservices.

As Evan Linwood incisively identified, ” I guess we live in the age of cloud based API providers, theoretically enabling loads of pre-canned integration patterns but these may not be ideal for a large service provider… Definitely if the underlying availability isn’t there, but could also occur through things like schema mismanagement across multiple providers? (Which might actually be an argument for better management / B/OSS, rather than against the use of microservices!

Am I convincing any of you to hire more UI/UX resources? Or convincing you to register for UI/UX as your next training course instead of learning a ninth programming language?

Put simply, we need your assistance to take our OSS from this…
Old MP3 player

To this…
iPod

OSS work practices that are repulsive

I believe in the principle that deep work and constant availability are repulsive concepts (in the magnetic sense).”
Tyler Mumford
in comment 2 to this post.

This blogging thing really amazes me at times. I’m regularly left shocked at the serendipitous connections that form when writing posts. Take today’s post. I did a web search looking for the thread of an idea that had no relation at all to yesterday’s post. But of the millions of possible authors that could’ve come up in the search, the article I read first was by Cal Newport. The same Cal Newport as quoted in yesterday’s post. The two articles weren’t even from the same domain (BBC.com vs calnewport.com).

Not only that but the quote above from Tyler Mumford, in serendipitous response to Cal’s article, perfectly articulates what I was struggling to describe to close out yesterday’s post. Deep work and constant availability are indeed repulsive (ie mutually exclusive). Yet both exist within the activities performed using our OSS!!

Think about that for a moment.

There are some tasks that require constant availability (think about the NOC operators who have to respond urgently to any degradation in their network’s health).
There are other tasks that require deep work (think about the NOC operators who have to identify the root-cause of a really gnarly and catastrophic fault).

But the OSS user interfaces we build do little to separate them. The processes we design don’t consider their repulsiveness. Even the way we resource our OSS implementation projects suffers from this magnetic repulsion.

As an OSS implementer, I’ve always found it interesting that clients struggle to provide suitable expertise to steer the build, to ensure it’s configured precisely for their needs. I often quote the old parable of “you get back what you put in.” I still believe the saying, but there’s more to it than that.

An OSS implementation team needs significant input from the most knowledgeable end-users. They provide the local context, the tribal knowledge. But the most knowledgeable end-users are also most valuable at performing BAU (business as usual) tasks [assuming you’re transforming an OSS whilst still maintaining an existing network]. But I’ve rarely seen a client get the balance right between providing expertise to the “build” and “run” streams in parallel. Even rarer have I seen a client expert who can quickly task-switch between build and run activities. It seems to be much more effective if client expert/s can be seconded to work on the OSS project team with few BAU activities. Tyler’s quote above helps to explain why.

Build mode requires deep work, for the most part (eg coding, process design, solution architecture, data mapping, etc). Run mode tends to require constant availability, with a few key exceptions (eg network designs, root-cause identification, etc). The two require separation.

So perhaps the parable should be, “you get back what you put in and separate out.” 🙂

Stealing Fire for OSS (part 2)

Yesterday’s post talked about the difference between “flow state” and “office state” in relation to OSS delivery. It referenced a book I’m currently reading called Stealing Fire.

The post mainly focused on how the interruptions of “office state” actually inhibit our productivity, learning and ability to think laterally on our OSS. But that got me thinking that perhaps flow doesn’t just relate to OSS project delivery. It also relates to post-implementation use of the OSS we implement.

If we think about the various personas who use an OSS (such as NOC operators, designers, order entry operators, capacity planners, etc), do our user interfaces and workflows assist or inhibit them to get into the zone? More importantly, if those personas need to work collaboratively with others, do we facilitate them getting into “group flow?”

Stealing Fire suggests that it costs around $500k to train each Navy SEAL and around $4.25m to train each elite SEAL (DEVGRU). It also describes how this level of training allows DEVGRU units to quickly get into group flow and function together almost as if choreographed, even in high-pressure / high-noise environments.

Contrast this with collaborative activities within our OSS. We use tickets, emails, Slack notifications, work order activity lists, etc to collaborate. It seems to me that these are the precise instruments that prevent us from getting into flow individually. I assume it’s the same collectively. I can’t think back to any end-to-end OSS workflows that seem highly choreographed or seamlessly effective.

Think about it. If you experience significant rates of process fall-out / error, then it would seem to indicate an OSS that’s not conducive to group flow. Ditto for lengthy O2A (order to activate) or T2R (trouble to resolve) times. Ditto for bringing new products to market.

I’d love to hear your thoughts. Has any OSS environment you’ve worked in facilitated group flow? If so, was it the people and/or the tools? Alternatively, have the OSS you’ve used inhibited group flow?

PS. Stealing Fire details how organisations such as Google and DARPA are investing heavily in flow research. They can obviously see the pay-off from those investments (or potential pay-offs). We seem to barely even invest in UI/UX experts to assist with the designs of our OSS products and workflows.

Lightning strikes in OSS

Operators have developed many unique understandings of what impacts the health of their networks.

For example, mobile operators know that they have faster maintenance cycles in coastal areas than they do in warm, dry areas (yes, due to rust). Other operators have a high percentage of faults that are power-related. Others are impacted by failures caused by lightning strikes.

Near-real-time weather pattern and lightning strike data is now readily accessible, potentially for use by our OSS.

I was just speaking with one such operator last week who said, “We looked at it [using lightning strike data] but we ended up jumping at shadows most of the time. We actually started… looking for DSLAM alarms which will show us clumps of power failures and strikes, then we investigate those clumps and determine a cause. Sometimes we send out a single truck to collect artifacts, photos of lightning damage to cables, etc.”

That discussion got me wondering about what other lateral approaches are used by operators to assure their networks. For example:

  1. What external data sources do you use (eg meteorology, lightning strike, power feed data from power suppliers or sensors, sensor networks, etc)
  2. Do you use it in proactive or reactive mode (eg to diagnose a fault or to use engineering techniques to prevent faults)
  3. Have you built algorithms (eg root-cause, predictive maintenance, etc) to utilise your external data sources
  4. If so, do those algorithms help establish automated closed-loop detect and response cycles
  5. By measuring and managing, has it created quantifiable improvements in your network health

I’d love to hear about your clever and unique insight-generation ideas. Or even the ideas you’ve proposed that haven’t been built yet.

What if most OSS/BSS are overkill? Planning a simpler version

You may recall a recent article that provided a discussion around the demarcation between OSS and BSS, which included the following graph:

Note that this mapping is just my demarc interpretation, but isn’t the definitive guide. It’s definitely open to differing opinions (ie religious wars).

Many of you will be familiar with the framework that the mapping is overlaid onto – TM Forum’s TAM (The Application Map). Version R17.5.1 in this case. It is as close as we get to a standard mapping of OSS/BSS functionality modules. I find it to be a really useful guide, so today’s article is going to call on the TAM again.

As you would’ve noticed in the diagram above, there are many, many modules that make up the complete OSS/BSS estate. And you should note that the diagram above only includes Level 2 mapping. The TAM recommendation gets a lot more granular than this. This level of granularity can be really important for large, complex telcos.

For the OSS/BSS that support smaller telcos, network providers or utilities, this might be overkill. Similarly, there are OSS/BSS vendors that want to cover all or large parts of the entire estate for these types of customers. But as you’d expect, they don’t want to provide the same depth of functionality coverage that the big telcos might need.

As such, I thought I’d provide the cut-down TAM mapping below for those who want a less complex OSS/BSS suite.

It’s a really subjective mapping because each telco, provider or vendor will have their own perspective on mandatory features or modules. Hopefully it provides a useful starting point for planning a low complexity OSS/BSS.

Then what high-level functionality goes into these building blocks? That’s possibly even more subjective, but here are some hints:

OSS change…. but not too much… oh no…..

Let me start today with a question:
Does your future OSS/BSS need to be drastically different to what it is today?

Please leave me a comment below, answering yes or no.

I’m going to take a guess that most OSS/BSS experts will answer yes to this question, that our future OSS/BSS will change significantly. It’s the reason I wrote the OSS Call for Innovation manifesto some time back. As great as our OSS/BSS are, there’s still so much need for improvement.

But big improvement needs big change. And big change is scary, as Tom Nolle points out:
IT vendors, like most vendors, recognize that too much revolution doesn’t sell. You have to creep up on change, get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening.”

Do you feel like we’re already in the midst of a revolution? Cloud computing, web-scaling and virtualisation (of IT and networks) have been partly responsible for it. Agile and continuous integration/delivery models too.

The following diagram shows a “from the moon” level view of how I approach (almost) any new project.

The key to Tom’s quote above is in step 2. Just how far, or how ambitious, into the future are you projecting your required change? Do you even know what that future will look like? After all, the environment we’re operating within is changing so fast. That’s why Tom is suggesting that for many of us, step 2 is just a “creep up on it change.” The gap is essentially small.

The “creep up on it change” means just adding a few new relatively meaningless features at the end of the long tail of functionality. That’s because we’ve already had the most meaningful functionality in our OSS/BSS for decades (eg customer management, product / catalog management, service management, service activation, network / service health management, inventory / resource management, partner management, workforce management, etc). We’ve had the functionality, but that doesn’t mean we’ve perfected the cost or process efficiency of using it.

So let’s say we look at step 2 with a slightly different mindset. Let’s say we don’t try to add any new functionality. We lock that down to what we already have. Instead we do re-factoring and try to pull the efficiency levers, which means changes to:

  1. Platforms (eg cloud computing, web-scaling and virtualisation as well as associated management applications)
  2. Methodologies (eg Agile, DevOps, CI/CD, noting of course that they’re more than just methodologies, but also come with tools, etc)
  3. Process (eg User Experience / User Interfaces [UX/UI], supply chain, business process re-invention, machine-led automations, etc)

It’s harder for most people to visualise what the Step 2 Future State looks like. And if it’s harder to envisage Step 2, how do we then move onto Steps 3 and 4 with confidence?

This is the challenge for OSS/BSS vendors, supplier, integrators and implementers. How do we, “get buyers disconnected from the comfortable past and then get them to face not the ultimate future but a future that’s not too frightening?” And I should point out, that it’s not just buyers we need to get disconnected from the comfortable past, but ourselves, myself definitely included.

In an OSS, what are O2A, T2R, U2C, P2O and DBA?

Let’s start with the last one first – DBA.

In the context of OSS/BSS, DBA has multiple meanings but I think the most relevant is Death By Acronym (don’t worry all you Database Administrators out there, I haven’t forgotten about you). Our industry is awash with TLAs (Three-Letter Acronyms) that lead to DBA.

Having said that, today’s article is about four that are commonly used in relation to end to end workflows through our OSS/BSS stacks. They often traverse different products, possibly even multiple different vendors’ products. They are as follows:

  • P2O – Prospect to Order – This workflow operates across the boundary between the customer and the customer-facing staff at the service provider. It allows staff to check what products can be offered to a customer. This includes service qualification (SQ), feasibility checks, then design, assign and reserve resources.
  • O2A – Order to Activate – This workflow includes all activities to manage customer services across entire life-cycles. That is, not just the initial activation of a service, but in-flight changes during activation and post-activation changes as well
  • U2C – Usage to Cash – This workflow allows customers or staff to evaluate the usage or consumption of a service (or services) that has already been activated for a customer
  • T2R – Trouble to Resolve – This “workflow” is more like a bundle of workflows that relate to assuring health of the services (and the network that carries them). They can be categorised as reactive (ie a customer triggers a resolution workflow by flagging an issue to the service provider) or a proactive (ie the service provider identifies and issue, degradation or potential for issue and triggers a resolution workflow internally)

If you’re interested in seeing how these workflows relate to the TM Forum APIs and specifically to NaaS (Network as a Service) designs, there’s a great document (TMF 909A v1.5) that can be found at the provided link. It shows the sub-elements (and associated APIs) that each of these workflows rely on.

PS. I recently read a vendor document that described additional flows:- I2I (Idea to Implementation – service onboarding, through a catalog presumably), P2P (Plan to Production – resource provisioning) and O2S (Order to Service). There’s also C2M (Concept to Market), L2C (Lead to Cash) and I’m sure I’m forgetting a number of others. Are there any additional TLAs that I should be listing here to describe end-to-end workflows?

Cool new feature – An OSS masquerading as…

I spent some time with a client going through their OSS/BSS yesterday. They’re an Australian telco with a primarily home-grown, browser-based OSS/BSS. One of its features was something I’ve never seen in an OSS/BSS before. But really quite subtle and cool.

They have four tiers of users:

  1. Super-admins (the carrier’s in-house admins),
  2. Standard (their in-house users),
  3. Partners (they use many channel partners to sell their services),
  4. Customer (the end-users of the carrier’s services).

All users have access to the same OSS/BSS, but just with different levels of functionality / visibility, of course.

Anyway, the feature that I thought was really cool was that the super-admins have access to what they call the masquerade function. It allows them to masquerade as any other user on the system without having to log-out / login to other accounts. This allows them to see exactly what each user is seeing and experience exactly what they’re experiencing (notwithstanding any platform or network access differences such as different browsers, response times, etc).

This is clearly helpful for issue resolution, but I feel it’s even more helpful for design, feature release and testing across different personas.

In my experience at least, OSS/BSS builders tend to focus on a primary persona (eg the end-user) and can overlook multi-persona design and testing. The masquerade function can make this task easier.