What are OSS “platform wrapper” roadblocks?

OSS can be cumbersome at times. Making change can be difficult. We tend to build layers of protections around them and the networks we manage. I get that. Change can be risky (although the protections are often implemented because the OSS and/or network platforms might not be as robust as they could be).

Contrast this with the OSS we want to create. We want to create a platform for rapid innovation, the platform that helps us and our clients generate opportunities and advantages.

For us to build a platform that allows our customers (and their customers) to revolutionise their markets, we might have to consider whether the protective layers around our OSS that are stymying change. Things like firewall burns, change review boards, documentation, approvals, politics, individuals with a reticence to change, etc.

For example, Netflix takes a contrarian, whitelist approach to access by its engineers rather than a blacklist. It assumes that its engineers are professional enough to only use the tools that they need to get their tasks done. They enable their engineers to use commonly off-limits functionality such as adding their own DNS records (ie to support the stand-up of new infrastructure). But they also take a use-it-or-lose-it approach, monitoring the tools that the engineer uses and rescinding access to tools they haven’t used within 90 days. But if they do need access again, it’s as simple as a message on Slack to reinstate it.

This is just one small example of streamlining the platform wrapper. There are probably a million others.

When working on OSS projects as the integrator / installer, I’ve seen many of these “platform wrapper” roadblocks. I’m sure you have too. If you see them as the installer, chances are the ops team you hand over to will also experience these roadblocks.

Question though. Do you flag these platform wrapper roadblocks for improvement, or do you treat them as non-platform and therefore just live with them?

Only do the OSS that only you can do

A friend of mine has a great saying, “only do what only you can do.”

Do you think that this holds true for the companies undergoing digital transformation? Banks are now IT companies. Insurers are IT companies. Car manufacturers are now IT companies. Telcos are, well, some are IT companies.

We’ve spoken before about the skill transformations that need to happen within telcos if they’re to become IT companies. Some are actively helping their workforce to become more developer-centric. Some of the big telcos that I’ve been assisting in the last few years are embarking on bold Agile-led IT transformations. They’re cutting more of their own code and managing their own IT developments.

That’s exciting news for all of us in OSS. Even if it loses the name OSS in future, telcos will still need software that efficiently operationalises their networks. We have the overlapping skills in software, networks, business and operations.

But I wonder about the longevity of the in-house approach unless we come focus clearly on the first quote above. If all development is brought in-house, we end up with a lot of duplication across the industry. I’m not really sure that it makes sense doing all the heavy-lifting of all custom OSS tools when the heavy-lifting has already been done elsewhere.

It’s the old ebb and flow between in-house and outsourced OSS.

In my very humble opinion, it’s not just a choice between in-house and outsourced that matters. The more important decisions are around choosing to only develop the tools in-house that only you can do (ie the strategic differentiators).

A single glass of pain or single pane of glass??

Is your OSS a single pane of glass, or a single glass of pain?

You can tell I’m being a little flippant here. People often (perhaps idealistically) talk about OSS as being the single pane of glass (SPOG) to manage a network.

I say “idealistically” for a couple of reasons:

  1. There are usually many personas who interact with an OSS, each with vastly different user interface (UI) needs
  2. There is usually more than one OSS product in a client’s OSS suite, often from different vendors, with varying levels of integration

Where a single pane of glass can be a true ambition is as a consolidated health-status dashboard / portal, Invariably, this portal is used by executive / leader / manager personas who want to quickly see a single-screen health status that covers all networks and/or parts of the OSS suite. When things go wrong, this portal becomes the single glass of pain.

These single panes tend to be heavily customised for each organisation as every one has a unique set of metrics-that-matter. For those designing these panes, the key is to not just include vanity metrics, but to show information that the leader can action.

But the interesting perspective here is whether the single glass of pain is even relevant within your organisation’s culture. It’s just my opinion, but I prefer for coal-face workers to be empowered to make rapid recovery actions rather than requiring direction from up high in the org-chart. Coal-face workers generally have different tools with UIs that *should* help them monitor, manage and repair super-efficiently.

To get back to the “idealistic” comment above, each OSS UI needs to be fit-for-purpose for each unique persona (eg designers, product owners, network operations, etc). To me this implies that there is no single pane of glass…

I should caveat that by citing the example of an OSS search interface, something I’ve yet to see in OSS… although that’s just a front end to dozens of persona-specific panes of glass.

An OSS without the shackles of topology

It’s been nearly two decades since I designed my first root-cause analysis (RCA) rule. It was completely reliant on network topology – more specifically, it relied on a network hierarchy to determine which alarms could be suppressed.

I had a really interesting discussion today with some colleagues who are using much more modern RCA techniques. I was somewhat surprised, but not surprised at all in hindsight, that their Machine Learning engine doesn’t even use topology data. It just looks at events and tries to identify patterns.

That’s a really interesting insight that hadn’t dawned on me before. But it’s an exciting one because it effectively unshackles our fault management tools from data quality perfection in our inventory / asset databases. It also possibly lessens the need for integrations that share topological data.

Equally interesting, the ML engine had identified over 4,000 patterns, but only a dozen had been codified and put into use so far. In other words, the machine was learning, but humans still needed to get involved in the process to confirm that the machine had learned correctly.

Makes me wonder whether the ML pre-seeding technique we discussed in an earlier post might actually be useful for confirmations at a greater scale than the team had achieved with 12 of 4000+ to date.

The standard approach is to let the ML loose and identify patterns. This is the reactive approach. The ML reacts to the alarms that are pushed up from the network. It looks at alarms and determines what the root cause is based on historical data. A human then has to check that the root cause is correct by reverse engineering the alarm stream (just like a network operator used to do before RCA tools came along) and comparing. If the comparison is successful, the person then approves this pattern.

My proposed alternate approach is the proactive method. If we proactively trigger a fault (e.g. pull a patch lead, take a port down, etc), we start from a position of already knowing what the root cause is. This has three benefits:
1. We can check if the ML’s interpretation of root cause is right
2. We’ve proactively seeded the ML’s data with this root cause example
3. We categorically know what the root cause is, unlike the reactive mode which only assumes the operator has correctly diagnosed the root cause

Then we just have to figure out a whole bunch of proactive failures to test safely. Where to start? Well, I’d speak with the NOC operators to find out what their most common root causes are and look to trigger those events.

More tomorrow on intentionally triggering failures in production systems.

I sent you an OSS helicopter

There’s a fable of a man stuck in a flood. Convinced that God is going to save him, he says no to a passing canoe, boat, and helicopter that offer to help. He dies, and in heaven asks God why He didn’t save him. God says, “I sent you a canoe, a boat, and a helicopter!”
We all have vivid imaginations. We get a goal in our mind and picture the path so clearly. Then it’s hard to stop focusing on that vivid image, to see what else could work.
New technologies make old things easier, and new things possible. That’s why you need to re-evaluate your old dreams to see if new means have come along
.”
Derek Sivers
, here.

In the past, we could make OSS platform decisions with reasonable confidence that our choices would remain viable for many years. For example, in the 1990s if we decided to build our OSS around a particular brand of relational database then it probably remained a valid choice until after 2010.

But today, there are so many more platforms to choose from, not to mention the technologies that underpin them. And it’s not just the choices currently available but the speed with which new technologies are disrupting the existing tech. In the 1990s, it was a safe bet to use AutoCAD for outside plant visualisation without the risk of heavy re-tooling within a short timeframe.

If making the same decision today, the choices are far less clear-cut. And the risk that your choice will be obsolete within a year or two has skyrocketed.

With the proliferation of open-source projects, the decision has become harder again. That means the skill-base required to service each project has also spread thinner. In turn, decisions for big investments like OSS projects are based more on the critical mass of developers than the functionality available today. If many organisations and individuals have bought into a particular project, you’re more likely to get your new features developed than from a better open-source project that has less community buy-in.

We end up with two ends of a continuum to choose between. We can either chase every new bright shiny object and re-factor for each, or we can plan a course of action and stick to it even if it becomes increasingly obsoleted over time. The reality is that we probably fit somewhere between the two ends of the spectrum.

To be brutally honest I don’t have a solution to this conundrum. The closest technique I can suggest is to design your solution with modularity in mind, as opposed to the monolithic OSS of the past. That’s the small-grid OSS architecture model. It’s easier to replace one building than an entire city.

Life-cycles of key platforms are likely to now be a few years at best (rather than decades if starting in the 1990s). Hence, we need to limit complexity (as per the triple-constraint of OSS) and functionality to support the most high-value objectives.

I’m sure you face the same conundrums on a regular basis. Please leave a comment below to tell us how you overcome them.

Mythical OSS beasts – feature removal releases

Life can be improved by adding, or by subtracting. The world pushes us to add, because that benefits them. But the secret is to focus on subtracting…

No amount of adding will get me where I want to be. The adding mindset is deeply ingrained. It’s easy to think I need something else. It’s hard to look instead at what to remove.

The least successful people I know run in conflicting directions, drawn to distractions, say yes to almost everything, and are chained to emotional obstacles.

The most successful people I know have a narrow focus, protect against time-wasters, say no to almost everything, and have let go of old limiting beliefs.”
Derek Sivers, here.

I’m really curious here. Have you ever heard of an OSS product team removing a feature? Nope?? Me either!

I’ve seen products re-factored, resulting in changes to features. I’ve also seen products obsoleted and their replacements not offer all of the same features. But what about a version upgrade to an existing OSS product that has features subtracted? That never happens does it?? The adding mindset is deeply ingrained.

So let’s say we do want to go on a subtraction drive and remove some of the clutter from our OSS. I know plenty of OSS GUIs where subtraction is desperately needed BTW! But how do we know what to remove?

I have no data to back this up, but I would guess that almost every OSS would have certain functions that are not used, by any of their customers, in a whole year. That functionality was probably built for a specific use-case for a specific customer that no longer has relevance. Perhaps for a service type that is no longer desired by the market or a network type that will never be used again.

Question is, does your OSS have profiling instrumentation that allows you to measure what functionality is and isn’t used across your whole client base?

Can your products team readily produce a usage profile graph like the following that shows a list of functions (x-axis) by the number of times each function is used (y-axis) in a given time window? Per client? Across all clients?
Long-tail of OSS functionality use

Leave us a comment below if you’ve ever seen this type of profiling instrumentation (not for code optimisation, but for identifying client utilisation levels) and/or systematic feature subtraction initiatives.

BTW. I should make the distinction that just because a function hasn’t been used in a while, doesn’t mean it should automatically be removed. Some functionality (eg data loaders) might be rarely used, but important to retain.

The use of drones by OSS

The last few days have been all about organisational structuring to support OSS and digital transformations. Today we take a different tack – a more technical diversion – onto how drones might be relevant to the field of OSS.

A friend recently asked for help to look into the use of drones in his archaeological business. This got me to thinking about how they might apply in cross-over with OSS.

I know they’re already used to perform really accurate 3D cable route / corridor surveying. Much cooler than the old surveyor diagrams on A1 sheets from the old days. Apparently experts in the field can even tell if there’s rock in the surveyed area by looking at the vegetation patterns, heat and LIDAR scans.

But my main area of interest is in the physical inventory. With accurate geo-tagging available on drones and the ability to GPS correct the data, it seems like a really useful technique for getting outside plant (OSP) data into OSS inventory systems. Or geo-correcting data for brownfields assets.
Drone-based cable corridor surveys
Have you heard of drone-based OSP asset identification and mapping data being fed into inventory systems yet? I haven’t, but it seems like the logical next step. Do you know anyone who has started to dabble in this type of work? If you do, please send me a note as I’d love to be introduced.

Once loaded into the inventory system, with 3d geo-location, we then have the ability to visualise the OSP data with augmented reality solutions.

And other applications for drone technology?

OSS orgitecture

So far this week we’ve been focusing on ways to improve the OSS transformation process. Monday provided 7 models for achieving startup-like efficiency for larger OSS transformations. Tuesday provided suggestions for speeding up the transition from OSS PoC to getting the solution into production, specifically strategies for absorbing an OSS PoC into production.

Both of these posts talk about the speed of getting things done outside the bureaucracy of big operators, big networks and big OSS. Today, as the post title suggests, we’re going to look at orgitecture – how re-designing the structure and culture of an organisation can help streamline digital transformations.

Do you agree with the premise that smaller entities (eg Agile autonomous groups, partners, consultants, etc) can get OSS tasks done more efficiently when operating at arms-length of the larger entity (eg the carrier)? I believe that this is a first principle of physics at play.

If you’ve worked under this arms-length arrangement in the past, you’ll also know that at some point those delivery outcomes need to get integrated back into the big entity. It’s what we referred to yesterday as absorption, where the level of integration effort falls on a continuum between minimally absorbed to fully absorbed.

OSS orgitecture is the re-architecture of the people, processes, culture and org structure to better allow for the absorption process. In the past, all the safety-checks (eg security, approvals, ops handover, etc) were designed on the assumption that internal teams were doing the work. They’re not always a great fit, especially when it comes to documentation review and approval.

For example, I have a belief that the effectiveness of documentation review and approval is inversely proportional to the number of reviewers (in most, but not all cases). Unfortunately, when an external entity is delivering, there tends to be inherently less trust than if an internal entity was delivering. As such, the safety-checks increase.

Another example is when the large organisation uses Agile delivery models, but use supply partners to deliver scope of works. The partners are able to assign effort in a sequential / waterfall manner, but can be delayed by only getting timeslices of attention from client’s staff (ie resources are available according to Agile sprint planning).

Security and cutover planning mechanisms such as Change Review Boards (CRB) have also been designed around old internal delivery models. They also need to be reconsidered to facilitate a pipeline of externally-implemented change.

Perhaps the biggest orgitecture factor is in getting multiple internal business units to work together effectively. In the old world we needed all the business units to reach consensus for a new product to come to market. Sales/Marketing/Products had to work with OSS/IT and Networks. Each of these units tend to have vastly different cultures and different cadences for getting their tasks done. Delivering a new product was as much an organisational challenge as it was a technical challenge and often took months. Those times-to-market are not feasible in a world of software where competitive advantages are fleeting. External entities can potentially help or hinder these timeframes. Careful design of small autonomous teams have the potential to improve abstraction at the interlocks, but culture remains the potential roadblock.

I’m excited by the opportunity for OSS delivery improvement coming from leveraging the gig economy. But if big OSS transformations are to make use of these efficiency gains, then we may also need to consider culture and process refinement as part of the change management.

Speeding up your OSS transition from PoC to PROD

In yesterday’s article, we discussed 7 models for achieving startup-like efficiency on large OSS transformations.

One popular approach is to build a proof-of-concept or sandpit quickly on cloud hosting or in lab environments. It’s fast for a number of reasons including reduced number of approvals, faster activation of infrastructure, reduced safety checks (eg security, privacy, etc), minimised integration with legacy systems and many other reasons. The cloud hosting business model is thriving for all of these reasons.

However, it’s one thing to speed up development of an OSS PoC and another entirely to speed up deployment to a PROD environment. As soon as you wish to absorb the PoC-proven solution back into PROD, all the items listed above (eg security sign-offs) come back into play. Something that took days/weeks to stand up in PoC now takes months to productionise.

Have you noticed that the safety checks currently being used were often defined for the old world? They often aren’t designed with transition from cloud to PROD in mind. Similarly, the culture of design cross-checks and approvals can also be re-framed (especially when the end-to-end solution crosses multiple different business units). Lastly, and way outside my locus of competence, is in re-visiting security / privacy / deployment / etc models to facilitate easier transition.

One consideration to make is just how much absorption is required. For example, there are examples of services being delivered to the large entity’s subscribers by a smaller, external entity. The large entity then just “clips-the-ticket,” gaining a revenue stream with limited involvement. But the more common (and much more challenging) absorption model is for the partner to fold the solution back into the large entity’s full OSS/BSS stack.

So let’s consider your opportunity in terms of the absorption continuum that ranges between:

clip-the-ticket (minimally absorbed) <-----------|-----------> folded-in (fully absorbed)

Perhaps it’s feasible for your opportunity to fit somewhere in between (partially absorbed)? Perhaps part of that answer resides in the cloud model you decide to use (public, private, hybrid, cloud-managed private cloud) as well as the partnership model?

Modularity and reduced complexity (eg integrations) are also a factor to consider (as always).

I haven’t seen an ideal response to the absorption challenge yet, but I believe the solution lies in re-framing corporate culture and technology stacks. We’ll look at that in more detail tomorrow.

How about you? Have you or your organisation managed to speed up your transition from PoC to PROD? What techniques have you found to be successful?

Seven OSS transformation efficiency models

Do you work in a large organisation? Have you also worked in smaller organisations?
Where have you felt more efficient?

I’ve been lucky enough to work on some massive OSS transformations for large T1 telcos. But I’ve always noticed the inefficiency of working on these projects when embedded inside the bureaucracy of the beast. With all of the documentation, sign-offs, meetings, politics, gaining consensus, budget allocations, etc it can sometimes feel so inefficient. On some past projects, I’ve felt I can accomplish more in a day outside than a week or more inside the beast.

This makes sense when applying the fundamental law of physics F = M x a to OSS projects. In other words, the greater the mass (of the organisation), the more force must be applied to reach a given acceleration (ie to effect a change).

It’s one of the reasons I love working within a small entity (Passionate About OSS), but into big entities (the big telcos and utilities). It’s also why I strongly believe that the big entities need to better leverage smaller working groups to facilitate big OSS change. Not just OSS transformation, but any project where the size of the culture and technology stack are prohibitive.

Here are a few ways you can use to bring a start-up’s efficiency to a big OSS transformation:

  1. Agile methodologies – If done well, Agile can be great at breaking transformations down into smaller, more manageable pieces. The art is in designing small autonomous teams / responsibilities and breakdown of work to minimise dependencies
  2. Partnerships – Using smaller, external partners to deliver outcomes (eg product builds or service offerings) that can be absorbed into the big organisation. There are varying levels of absorption here – from an external, “clip-the-ticket” offering to offerings that are fully absorbed into the large entity’s OSS/BSS stack
  3. Consultancies – Similar to partnerships, but using smaller teams to implement professional services
  4. Spin-out / spin-in teams – Separating small teams of experts out from the bureaucracy of the large organisation so that they can achieve rapid progress
  5. Smart contracts / RFPs – I love the potential for smart contracts to automate the offer of small chunks of work to trusted partners to bid upon and then deliver upon
  6. Externalised Proofs of Concept (PoC) – One of the big challenges in implementing for large organisations is all of the safety checks that slow progress. Many, such as security and privacy mechanisms, are completely justified for a production network. But when a concept needs to be proved, such as user journeys, product integrations, sand-pit environments, etc, then cloud-based PoCs can be brilliant
  7. Alternate brands – Have you also noticed that some of the tier-1 telcos have been spinning out low-cost and/or niche brands with much leaner OSS/BSS stacks, offerings and related culture lately? It’s a clever business model on many levels. Combined with the strangler fig transformation approach, this might just represent a pathway for the big brand to shed many of their OSS/BSS legacy constraints

Can you think of other models that I’ve missed?

The key to these strategies is not so much the carve-out, the process of getting small teams to do tasks efficiently, but the absorb-in process. For example, how to absorb a cloud-based PoC back into the PROD network, where all safety checks (eg security, privacy, operations acceptance, etc) still need to be performed. More on that in tomorrow’s post.

How to bring your art and your science to your OSS

In the last two posts, we’ve discussed repeatability within the field of OSS implementation – paint-by-numbers vs artisans and then resilience vs precision in delivery practices.

Now I’d like you to have a think about how those posts overlay onto this quote by Karl Popper:
Non-reproducible single occurrences are of no significance to science.”

Every OSS implementation is different. That means that every one is a non-reproducible single occurrence. But if we bring this mindset into our OSS implementations, it means we’re bringing artisinal rather than scientific method to the project.

I’m all for bringing more art, more creativity, more resilience into our OSS projects.

I’m also advocating more science though too. More repeatability. More precision. Whilst every OSS project may be different at a macro level, there are a lot of similarities in the micro-elements. There tends to be similarities in sequences of activities if you pay close attention to the rhythms of your projects. Perhaps our products can use techniques to spot and leverage similarities too.

In other words, bring your art and your science to your OSS. Please leave a comment below. I’d love to hear the techniques you use to achieve this.

The Mona Lisa of OSS

All OSS rely on workflows to make key outcomes happen. Outcomes like activating a customer order, resolving a fault, billing customers, etc. These workflows often touch multiple OSS/BSS products and/or functional capabilities. There’s not always a single-best-way to achieve an outcome.

If you’re responsible for your organisation’s workflows do you want to build a paint-by-numbers approach where each process is repeatable?
Or do you want the bespoke paintings, which could unintentionally lead to a range in quality from Leonardo’s Mona Lisa to my 3 year old’s finger painting?

Apart from new starters, who thrive on a paint-by-numbers approach at first, every person who uses an OSS wants to feel like an accomplished artisan. They want to have the freedom to get stuff done with their own unique brush-strokes. They certainly don’t want to follow a standard, pre-defined pattern day-in and day-out. That would be so boring and demoralising. I don’t blame them. I’d be exactly the same.

This is perhaps why some organisations don’t have documented workflows, or at least they only have loosely defined ones. It’s just too hard to capture all the possibilities on one swim-lane chart.

I’m all for having artisans on the team who are able to handle the rarer situations (eg process fall-outs) with bespoke processes. But bespoke processes should never be the norm. Continual improvement thrives on a strong level of repeatability.

To me, bespoke workflows are not necessarily an indication of a team of free spirited artists that need to be regimented, but of processes with too many variants. Click on this link to find recommendations for reducing the level of bespoke processes in your organisation.

Are processes bespoke or paint-by-numbers in your organisation?

BTW. We’ll take a slightly different perspective on workflow repeatability in tomorrow’s post.

Can OSS/BSS assist CX? We’re barely touching the surface

Have you ever experienced an epic customer experience (CX) fail when dealing a network service operator, like the one I described yesterday?

In that example, the OSS/BSS, and possibly the associated people / process, had a direct impact on poor customer experience. Admittedly, that 7 truck-roll experience was a number of years ago now.

We have fewer excuses these days. Smart phones and network connected devices allow us to get OSS/BSS data into the field in ways we previously couldn’t. There’s no need for printed job lists, design packs and the like. Our OSS/BSS can leverage these connected devices to give far better decision intelligence in real time.

If we look to the logistics industry, we can see how parcel tracking technologies help to automatically provide status / progress to parcel recipients. We can see how recipients can also modify their availability, which automatically adjusts logistics delivery sequencing / scheduling.

This has multiple benefits for the logistics company:

  • It increases first time delivery rates
  • Improves the ability to automatically notify customers (eg email, SMS, chatbots)
  • Decreases customer enquiries / complaints
  • Decreases the amount of time the truck drivers need to spend communicating back to base and with clients
  • But most importantly, it improves the customer experience

Logistics is an interesting challenge for our OSS/BSS due to the sheer volume of customer interaction events handled each day.

But it’s another area that excites me even more, where CX is improved through improved data quality:

  • It’s the ability for field workers to interact with OSS/BSS data in real-time
  • To see the design packs
  • To compare with field situations
  • To update the data where there is inconsistency.

Even more excitingly, to introduce augmented reality to assist with decision intelligence for field work crews:

  • To provide an overlay of what fibres need to be spliced together
  • To show exactly which port a patch-lead needs to connect to
  • To show where an underground cable route goes
  • To show where a cable runs through trayway in a data centre
  • etc, etc

We’re barely touching the surface of how our OSS/BSS can assist with CX.

The 7 truck-roll fail

In yesterday’s post we talked about the cost of quality. We talked about examples of primary, secondary and tertiary costs of bad data quality (DQ). We also highlighted that the tertiary costs, including the damage to brand reputation, can be one of the biggest factors.

I often cite an example where it took 7 truck rolls to connect a service to my house a few years ago. This provider was unable to provide an estimate of when their field staff would arrive each day, so it meant I needed to take a full day off work on each of those 7 occasions.

The primary cost factors are fairly obvious, for me, for the provider and for my employer at the time. On the direct costs alone, it would’ve taken many months, if not years, for the provider to recoup their install costs. Most of it attributable to the OSS/BSS and associated processes.

Many of those 7 truck rolls were a direct result of having bad or incomplete data:

  • They didn’t record that it was a two storey house (and therefore needed a crew with “working at heights” certification and gear)
  • They didn’t record that the install was at a back room at the house (and therefore needed a higher-skilled crew to perform the work)
  • The existing service was installed underground, but they had no records of the route (they went back to the designs and installed a completely different access technology because replicating the existing service was just too complex)

Customer Experience (CX), aka brand damage, is the greatest of all cost of quality factors when you consider studies such as those mentioned below.

A dissatisfied customer will tell 9-15 people about their experience. Around 13% of dissatisfied customers tell more than 20 people.”
White House Office of Consumer Affairs
(according to customerthink.com).

Through this page alone, I’ve told a lot more than 20 (although I haven’t mentioned the provider’s name, so perhaps it doesn’t count! 🙂  ).

But the point is that my 7 truck-roll example above could’ve been avoided if the provider’s OSS/BSS gave better information to their field workers (or perhaps enforced that the field workers populated useful data).

We’ll talk a little more tomorrow about modern Field Services tools and how our OSS/BSS can impact CX in a much more positive way.

Calculating the cost of quality

This week of posts has followed the theme of the cost of quality. Data quality that is.

But how do you calculate the cost of bad data quality?

Yesterday’s post mentioned starting with PNI (Physical Network Inventory). PNI is the cables, splices / joints, patch panels, ducts, pits, etc. This data doesn’t tend to have a programmable interface to electronically reconcile with. This makes it prone to errors of many types – mistakes in manual entry, reconfigurations that are never documented, assets that are lost or stolen, assets that are damaged or degraded, etc.

Some costs resulting from poor PNI data quality (DQ) can be considered primary costs. This includes SLA breaches caused by an inability to identify a fault within an SLA window due to incorrect / incomplete / indecipherable design data. These costs are the most obvious and easy to calculate because they result in SLA penalties. If a network operator misses a few of these with tier 1 clients then this is the disaster referred to yesterday.

But the true cost of quality is in the ripple-out effects. The secondary costs. These include the many factors that result in unnecessary truck rolls. With truck rolls come extra costs including contractor costs, delayed revenues, design rework costs, etc.

Other secondary effects include:

  • Downstream data maintenance in systems that rely on PNI data
  • Code in downstream systems that caters for poor data quality, which in turn increases the costs of complexity such as:
    • Additional testing
    • Additional fixing
    • Additional curation
  • Delays in the ability to get new products to market
  • Ability to accurately price products (due to variation in real costs caused by extra complexity)
  • Reduced impact of automations (due to increased variants)
  • Potential to impact Machine Learning / Artificial Intelligence engines, which rely on reliable and consistent data at scale
  • etc

There are probably more sophisticated ways to calculate the cost of quality across all these factors and more, but in most cases I just use a simple multiplier:

  • Number of instances of DQ events (eg number of additional truck rolls); times by
  • A rule-of-thumb cost impact of each event (eg the cost of each additional truck roll)

Sometimes the rules-of-thumb are challenging to estimate, so I tend to err on the side of conservatism. I figure that even if the rules-of-thumb aren’t perfectly accurate, at least they produce a real cost estimate rather than just anecdotal evidence.

And more importantly, the tertiary and less tangible costs of brand damage (also known as Customer Experience or CX or reputation damage). We’ll talk a little more about that tomorrow.

 

Where an absence of OSS data can still provide insights

The diagram below has some parallels with OSS. The story however is a little long before it gets to the OSS part, so please bear with me.

null

The diagram shows analysis the US Navy performed during WWII on where planes were being shot. The theory was that they should be reinforcing the areas that received the most hits (ie the wing-tips, central part of the body, etc as per the diagram).

Abraham Wald, a statistician, had a completely different perspective. His rationale was that the hits would be more uniform and the blank sections on the diagram above represent the areas that needed reinforcement. Why? Well, the planes that were hit there never made it home for analysis.

In OSS, this is akin to the device or EMS that has failed and is unable to send any telemetry data. No alarms are appearing in our alarm lists for those devices / EMS because they’re no longer capable of sending any.

That’s why we use heartbeat mechanisms to confirm a system is alive (and capable of sending alarms). These come in the form of pollers, pingers or just watching for other signs of life such as network traffic of any form.

In what other areas of OSS can an absence of data demonstrate an insight?

The OSS Minimum Feature Set is Not The Goal

This minimum feature set (sometimes called the “minimum viable product”) causes lots of confusion. Founders act like the “minimum” part is the goal. Or worse, that every potential customer should want it. In the real world not every customer is going to get overly excited about your minimum feature set. Only a special subset of customers will and what gets them breathing heavy is the long-term vision for your product.

The reality is that the minimum feature set is 1) a tactic to reduce wasted engineering hours (code left on the floor) and 2) to get the product in the hands of early visionary customers as soon as possible.

You’re selling the vision and delivering the minimum feature set to visionaries not everyone.”
Steve Blank here.

A recent blog series discussed the use of pilots as an OSS transformation and augmentation change agent.
I have the need for OSS speed
Re-framing an OSS replacement strategy
OSS transformation is hard. What can we learn from open source?

Note that you can replace the term pilot in these posts with MVP – Minimum Viable Product.

The attraction in getting an MVP / pilot version of your OSS into the hands of users is familiarity and momentum. The solution becomes more tangible and therefore needs less documentation (eg architecture, designs, requirement gathering, etc) to describe foreign concepts to customers. The downside of the MVP / pilot is that not every customer will “get overly excited about your minimum feature set.”

As Steve says, “Only a special subset of customers will and what gets them breathing heavy is the long-term vision for your product.” The challenge for all of us in OSS is articulating the long-term vision and making it compelling…. and not just leaving the product in its pilot state (we’ve all seen this happen haven’t we?)

We’ll provide an example of a long-term vision tomorrow.

PS. I should also highlight that the maximum feature set also isn’t the goal either.

The layers of ITIL redundancy

Today’s is something of a heretical post, especially for the believers in ITIL. In the world of OSS, we look to build in layers of resiliency and not layers of redundancy.

The following diagram and subsequent text in italics describes a typical ITIL process and is all taken from https://www.computereconomics.com/article.cfm?id=1074

Example of relationship between ITIL incidents, problems, and changes

The sequence of events as shown in Figure 1 is as follows:

  • At TIME = 0, an External Event is detected by the Incident Management process. This could be as simple as a customer calling to say that service is unavailable or it could be an automated alert from a system monitoring device.The incident owner logs and classifies this as incident i2. Then, the incident owner tries to match i2 to known errors, work-arounds, or temporary fixes, but cannot find a match in the database.
  • At TIME = 1, the incident owner dispatches a problem request to the Problem Management process anticipating a work-around, temporary fix, or other assistance. In doing so, the incident owner has prompted the creation of Problem p2.
  • At TIME = 2, the problem owner of p2 returns the expected temporary fix to the incident owner of i2.  Note that both i2 and p2 are active and exist simultaneously. The incident owner for i2 applies the temporary fix.
  • In this case, the work-around requires a change request.  So, at Time = 3, the incident owner for i2 initiates change request, c2.
  • The change request c2 is applied successfully, and at TIME = 4, c2 is closed. Note that for a while i2, p2 and c2 all exist simultaneously.
  • Because c2 was successful, the incident owner for i2 can now confirm that the incident is resolved. At TIME = 5, i2 is closed. However, p2 remains active while the problem owner searches for a permanent fix. The problem owner for p2 would be responsible for implementing the permanent fix and initiating any necessary change requests.

But I look at it slightly differently. At their root, why do Incident Management, Problem Management and Change Management exist? They’re all just mechanisms for resolving a system* health issue. If we detect an event/s and fix it, we don’t have to expend all the effort of flicking tickets around.

Thinking within the T2R paradigm of trouble -> incident -> problem -> change -> resolve holds us back. If we can skip the middle steps and immediately associate a resolution with the event/s, we get a whole lot more efficient. If we can immediately relate a trigger with a reaction, we can also get rid of the intermediate ticket flickers and the processing cycle time.

So, the NOC of the future surely requires us to build a trigger -> reaction recommendation engine (and body of knowledge). That’s a more powerful tool to supply to our NOC operators than incidents, problems and change requests. (Easier to write about than to actually solve though of course)

OSS transformation is hard. What can we learn from open source?

Have you noticed an increasing presence of open-source tools in your OSS recently? Have you also noticed that open-source is helping to trigger transformation? Have you thought about why that might be?

Some might rightly argue that it is the cost factor. You could also claim that they tend to help resolve specific, but common, problems. They’re smaller and modular.

I’d argue that the reason relates to our most recent two blog posts. They’re fast to install and they’re easy to run in parallel for comparison purposes.

If you’re designing an OSS can you introduce the same concepts? Your OSS might be for internal purposes or to sell to market. Either way, if you make it fast to build and easy to use, you have a greater chance of triggering transformation.

If you have a behemoth OSS to “sell,” transformation persuasion is harder. The customer needs to rally more resources (funds, people, time) just to compare with what they already have. If you have a behemoth on your hands, you need to try even harder to be faster, easier and more modular.

I have the need for OSS speed

You already know that speed is important for OSS users. They / we don’t want to wait for minutes for the OSS to respond to a simple query. That’s obvious right? The bleeding obvious.

But that’s not what today’s post is about. So then, what is it about?

Actually, it follows on from yesterday’s post about re-framing of OSS transformation.  If a parallel pilot OSS can be stood up in weeks then it helps persuasion. If the OSS is also fast for operators to learn, then it helps persuasion.  Why is that important? How can speed help with persuasion?

Put simply:

  • It takes x months of uncertainty out of the evaluators’ lives
  • It takes x months of parallel processing out of the evaluators’ lives
  • It also takes x months of task-switching out of the evaluators’ lives
  • Given x months of their lives back, customers will be more easily persuaded

It also helps with the parallel bake-off if your pilot OSS shows a speed improvement.

Whether we’re the buyer or seller in an OSS pilot, it’s incumbent upon us to increase speed.

You may ask how. Many ways, but I’d start with a mass-simplification exercise.