Orders down, faults up

As mentioned in a post about Service and Resource Availability last week, I do tend to think of OSS workflows around an “orders down, faults up,” flow direction. And that means customers (services) at the top, network (resources) at the bottom of the (TMN) pyramid.

I also think of inventory (yellow) as the point where Assurance / Faults (blue) and Fulfillment / Orders (purple) collide and enhance, as per the diagram below.


These are highly generic examples, but let’s take a closer look:

Assurance flow (blue) – an alarm or event in the network (NEL/NE layer) pushes up through the stack to the OSS as a fault. The inventory (network / service) helps to enrich the fault with additional information (eg device name, location, connectivity, correlation of this and other faults, etc) to help resolve the fault (either manually by operators or automatically by algorithms). It also helps associate the fault in the network/resource with the customer/s using those resources. This allows notifications to be issued to customers. Note that this simple flow doesn’t reflect examples such as an incident (ie when a customer notices a problem first and calls it in before the OSS has been able to issue a notification).

Fulfillment flow (purple) – a customer places an order (BML/BSS layer or above) and it pushes down through the stack, including changes in the network (NEL/NE layer). Once all the appropriate network changes have been made, the order is ready for use by the customer. Once again, inventory plays an important part, associating customer / service identifiers with suitable resources from the available resource pool. Generally the (customer facing) service orders won’t have the technology-specific details required to actually update the network configurations though. That’s where the inventory often helps to fill in the knowledge gaps and send technology-specific commands down into the network. [See Friday’s post for more information about CFS and RFS definitions and mappings]

Inventory flows (yellow) – an inventory is only relevant if the BSS and network / resource layers don’t hold enough information for the blue and purple flows to be fully processed. The enriching information stored by inventory must come from somewhere. Some of it comes from Discovery (usually an automated process of collecting from the network or other sources), or via Manual / Scripted Input (eg physical network designs including patch cables and splicing). Some data (eg splices) just can’t be collected automatically as they relate to passive equipment that has no programmatic interface. This data just has to be created manually.
But arguably the more important inventory data is actually recording the mappings made from customers (services) to network (resources). Inventory solutions  are often where these linking keys / relationships are recorded.

These flows also tend to indicate the direction of data mastery. Whilst the network itself is the source of truth, Fulfillment flows will start at the BSS and customer / service / order data will tend to be mastered there before orders are pushed into the network as provisioning commands. For assurance flows, the network will tend to be the master source of data, but with enrichment along it’s path northbound.

Just keep in mind that there are many exceptions to these examples. Data and processes can flow in many different ways. The diagram above is just useful for helping newcomers to understand some conceptual processes and data models / flows.

Differences between CFS and RFS

Further to yesterday’s post about Service and Resource Availability, I received some questions about how to discern between CFS (Customer Facing Services) and RFS (Resource Facing Services).

I thought the following description, sourced from TM Forum’s GB999 (Service overview sec 1.1.3), might help clarify the differences:

  • “This enables us to model a wide variety of services in a common class hierarchy while differentiating between Services that are obtained as a Product by a Customer versus those that aren’t. As we will see, a CustomerFacingService is one that is obtained as a Product by a Customer. Therefore, the Customer may have specific control over this Service via its associated Product. In contrast, the Customer never knows explicitly which ResourceFacingServices are being used to support a CustomerFacingService. More importantly, the Customer shouldn’t have to know which ResourceFacingServices are being used, since the Customer hasn’t explicitly obtained them.”
  • CFS are associated with resource technology neutral services i.e. they describe general capabilities and have attributes that are general across many technologies e.g. throughput, latency, SLA /loss rate, availability.
  • CFS and RFS typically have different lifecycles, CFS are related to customer and product changes and RFS to technology changes.
  • RFS are associated with resource technology specific services i.e. they have attributes that predominately relate to a specific technology.
  • RFS typically do some of the following:
    1. Map between the native protocols used to expose management of resources e.g. Netconf, YANG, SNMP, etc.
    2. Provide some integrated approach to provisioning and assuring RFS that span multiple technical domains e.g. slices across RAN and Core).
    3. They may be Operator, SP, ISV or Supplier provided.

Further important notes:

  • Composition of subordinate CFSs to support the CFSs exposed by Production capabilities (iterative composition pattern). These subordinate CFSs may be from other Operational Domains both within the same operator or acquired from third party operators as happens with wholesale interconnect.
  • Mapping of CFS to internal Resource Facing Services (RFS) that abstract into services the resource defined by Suppliers’ Technical Domains whose boundaries are defined by technology and supplier choices. This mapping links the boundary decisions of Operational Domains to the Technical Domain boundary decisions of suppliers.
  • RFSs can be atomic or composite to include other RFS (iterative composition pattern). This is a decision taken by the Operations / Integrator composing or creating RFSs based on deployment needs.
  • In a Service Oriented Architecture any exposed services can be consumed by any other service.

How is OSS/BSS service and resource availability supposed to work?

The brilliant question above was asked by a contact who is about to embark on a large OSS/BSS transformation.  That’s certainly a challenging question to start the new year with!!

The following was provided for a little more context:

  • We have a manually maintained table for each address where we can store which services are available—ie. DSL up to 5 Mbps or Fiber Data 300 Mbps

  • This manual information has no data-level connection to the actual plant used to serve the address 

  • In a “perfect world”, how does this work?

  • Where is the data stored? Ex: Does a geospatial network inventory store this data, then the BSS requests it as needed?

  • How does a typical OSS tie together physical network and equipment to products and offerings?

  • How is it typically stored? How is it accessed?

  • Sort of related to the address, we have “Facility” records that include things like the Electronics (Card, slot, port, shelf, etc) and some important “hops” along the way 

  • Right now if a tech makes changes to physical plant, we have to manually update our mapping (if the path changes), spreadsheets (if fiber assignment changes) or paper records (if copper pair assignments change).. additionally, we might need to update the Facilities database

  • It doesn’t “use” it’s “awareness” of our plant or network equipment to do anything except during service orders where certain products are tied to provisioning features—ie. callerID billing code being on an order causes a command to be issued to the switch to add that feature.

  • There is no visibility into network status.. how does this normally work?

  • I feel like I’m missing a fundamental reference point because I’ve never seen an actual working example of “Orders down, faults up”, just manually maintained records that sort of single-directionally “integrate” to network devices but only in the context of what was ordered, not in the context of what is available and what the real-time status is.

Wow! Where do we start? Certainly not an easy or obvious question by any means. In fact it’s one of the trickier of all OSS/BSS challenges.

In the old days of OSS/BSS, services tended to be circuit-oriented and there was a direct allocation of logical / physical resources to each customer service. You received an order, you created a “customer circuit” for the order, you reserved suitable / available resources in your inventory to assign to the circuit, then issued work order activities to implement the circuit. When the work order activities were complete, the circuit was ready for service.

The utilised resources in your inventory system/s were tagged with the circuit ID or service ID and therefore not available to other services. This association also allowed Service Impact Analysis (SIA) to be performed. In the background, you had to reconcile the real resources available in the network with what was being shown in your inventory solution. Relationships were traceable down through all layers of the TMN stack (as below). Status of the resources (eg a Network Element had failed) could also be associated to the inventory solution because alarms / events had linking keys to all the devices, cards, ports, logical ports, etc in inventory .

To an extent, it’s still possible to do this for the access/edge of the network. For example, from the Customer Premises Equipment (CPE) / Network Termination Device (NTD) to the telco’s access device (eg DSLAM or similar). But from that point deeper into the core of the telco network, it’s usually a dynamic allocation of resources (eg packet-switched, routed signal paths).

With modern virtualised and packet-switched networks, dynamic allocation makes its harder to directly associate actual resources with customer services at any point in time. See this earlier post for more detail on the diagram below.

Instead, we now just ask the OSS to push orders into the virtualisation cloud and expect the virtualisation managers to ensure reliable availability of resources. We’ve lost visibility inside the cloud.

So this poses the question about whether we even need visibility now. There are three main states to consider:

  1. At Service Initiation – What resources are available to assign to a service? As long as capacity planning is doing its job and keeping the available resource pool full, we just assume there will be sufficient resource and let the virtualisation manager do its thing
  2. After Service is Commissioned – What resources are assigned to the service at the current point in time? If the virtualisation manager and network are doing their highly available, highly resilient thing, then do we want to know?
  3. During an Outage – What services are impacted by resources that are degraded or not available? As operators, we definitely want to know what needs to be fixed and which customers need to be alerted.

So, let’s now get into a more “modern orchestration and abstraction” approach to associating customer services with resources. I’ve seen it done many different ways but let’s use the diagram below as a reference point (you might have to view in expanded form):

CFS RFS orchestration

 Here are a few thoughts that might help:

  • As mentioned by the contact, “orders down, faults up,” is a mindset I tend to start with too. Unfortunately, data flows often have to be custom-designed as they’re constrained by the available systems, organisation structures, data quality improvement models, preferred orchestration model, etc
  • You may have heard of CFS (Customer Facing Service) and RFS (Resource Facing Service) constructs? They’re building blocks that are often used by operators to design product offerings for customers (and then design the IT systems that support them). They’re shown as blue ovals as they’re defined in the Service Catalog (CFS shown as north-facing and RFS as south-facing)
  • CFS are services tied to a product/offering. RFS are services linked with resources
  • To simplify, I think of CFS like a customer order form (ie what fields and options are available for the customer) and RFS being the technical interfaces to the network (eg APIs into the Domain Managers and possibly NMS/EMS/VIM)
  • Examples of CFS might be VPN, Internet Access, Transport, Security, Mobility, Video, etc
    Examples of RFS might be DSL, DOCSIS, BGP (border gateway protocol), DNS, etc
    See conceptual model from Oracle here:
  • Now, let’s think of how to create this model in two halves:
      • One is design-time – that’s where you design the CFS and/or RFS service definitions, as well as designing the orchestration plan (OP) (aka provisioning plan). The OP is the workflow of activities required to activate a CFS type. This could be as simple as one CFS consuming an RFS stub with a few basic parameters mapped (eg CallingID). Others can be very complex flows if there are multiple service variants and additional data that needs to be gathered from East-West systems (eg request for next available patch-port from physical network inventory [PNI]). Some of the orchestration steps might be automated / system-driven, whilst others might be manual work order activities that need to be done by field workforce.
        Note that the “Logging and Test” box at the left is just to test your design-time configurations prior to processing a full run-time order
      • The other is run-time – that’s where the Orchestrator runs the OP to drive instance-by-instance implementation of a service (including consumption of actual resources). That is, an instantiation of one customer order through the orchestration workflow you created during design-time 
  • A CFS can map parameters from one or more RFS (there can even be hierarchical consumption of multiple RFS and CFS in some situations, but that will just confuse the situation)
  • You can also loosely think of CFS as being part of the BSS and RFS as being part of the OSS, with the service orchestration usually being a grey area in the middle
  • Now to the question about where is the data stored:
    • Design-time – CFS building block constructs are generally stored in a BSS or service catalog. Orchestration plans are often also part of modern catalogs, but could also fall within your BSS or OSS depending on your specific stack
    • Run-time (ie for each individual order) – The customer order details (eg speeds, configurations, etc) are generally stored in “the BSS.” The orchestration plan for each order then drives data flows. This is where things get very specific to individual stacks. The OP can request resource availability via east-west systems (eg inventory [LNI or PNI], DNS, address databases, WFM, billing code database, etc, etc, etc) and/or to southbound interfaces (eg NMS/EMS/Infrastructure-Manager APIs) to gather whatever information is required
    • Distributed or Centralised data – There’s no specific place where all data is collected. Some of the systems (eg PNI/LNI) above will have their own data repositories, whilst others will pull from a centralised data store or the network or other infrastructure via NMS/EMS/VIM
    • Data master – in theory the network (eg NMS/EMS/NE) should be the most accurate store of information, hence the best place to get data from (and your best visibility of current state of the network). Unfortunately, the NMS/EMS/NE often won’t have all the info you need to drive the orchestration plan. For example, if you don’t already have a cable to the requesting customer’s address, then the orchestration plan will have to include an action/s for a designer to use PNI/geospatial data to find the nearest infrastructure (eg joint/pedestal) to run a new cable from, then go through all the physical build activities, before sending the required data back to the orchestration plan. Since the physical network (eg cables, joints, etc) almost never has a programmatic interface, it will require manual effort and manual data entry. Alternatively, the NMS/EMS/VIM might not be able to tell us exactly what resource the service is consuming at any point in time
    • For Specific product offerings – There are so many different possibilities here that it’s hard to answer all the possible data flows / models. The orchestration plan within the Business Orchestration (aka Cross-domain Orchestration) layer is responsible for driving flows. It may have to perform service provisioning, network orchestration and infrastructure control. 

This is far less concise than I hoped. 

If you have a simpler way of answering the question and/or can point us to a better description, we’d love to hear from you!

What’s in your OSS for me?

May I ask you a question?  Do the senior executives at your organisation ever USE your OSS/BSS?

I’d love to hear your answer.

My guess is that few, if any, do. Not directly anyway. They may depend on reports whose data comes from our OSS, but is that all?

Execs are ultimately responsible for signing off large budget allocations (in CAPEX and OPEX) for our OSS. But if they don’t see any tangible benefits, do the execs just see OSS as cost centres? And cost centres tend to become targets for cost reduction right?

Building on last week’s OSS Scoreboard Analogy, the senior execs are the head coaches of the team. They don’t need the transactional data our OSS are brilliant at collating (eg every network device’s health metrics). They need insights at a corporate objective level.

How can we increase the executives’s, “what’s in it for me?” ranking of the OSS/BSS we implement? We can start by considering OSS design through the lens of senior executive responsibilities:

  • Strategy / objective development
  • Strategy execution (planning and ongoing management to targets)
  • Clear communication of priorities and goals
  • Optimising productivity
  • Risk management / mitigation
  • Optimising capital allocation
  • Team development

And they are busy, so they need concise, actionable information.

Do we deliver functionality that helps with any of those responsibilities? Rarely!

Could we? Definitely!

Should we? Again, I’d love to hear your thoughts!


The Ineffective OSS Scoreboard Analogy

Imagine for a moment that you’re the coach of a sporting team. You train your team and provide them with a strategy for the game. You send them out onto the court and let them play.

The scoreboard gives you all of the stats about each player. Their points, blocks, tackles, heart-rate, distance covered, errors, etc. But it doesn’t show the total score for each team or the time remaining in the game. 

That’s exactly what most OSS reports and dashboards are like! You receive all of the transactional data (eg alarms, truck-rolls, device performance metrics, etc), but not how you’re collectively tracking towards team objectives (eg growth targets, risk reduction, etc). 

Yes, you could infer whether the team is doing well by reverse engineering the transactional data. Yes, you could then apply strategies against those inferences in the hope that it has a positive impact. But that’s a whole lot of messing around in the chaos of the coach’s box with the scores close (you assume) and the game nearing the end (possibly). You don’t really know when the optimal time is to switch your best players back into the game.

As coach with funding available, would you be asking your support team to give you more transactional tools / data or the objective-based insights?

Does this analogy help articulate the message from the previous two posts (Wed and Thurs)?

PS. What if you wanted to build a coach-bot to replace yourself in the near future? Are you going to build automations that close the feedback loop against transactional data or are you going to be providing feedback that pulls many levers to optimise team objectives?

One big requirement category most OSS can’t meet

We talked yesterday about a range of OSS products that are more outcome-driven than our typically transactional OSS tools. There’s not many of them around at this stage. I refer to them as “data bridge” products.
Our typical OSS tools help manage transactions (alarms, activate customers services, etc). They’re generally not so great at (directly) managing objectives such as:
  • Sign up an extra 50,000 customers along the new Southern network corridor this month
  • Optimise allocation of our $10M capital budget to improve average attainable speeds by 20% this financial year
  • Achieve 5% revenue growth in Q3
  • Reduce truck rolls by 10% in the next 6 months
  • Optimal management of the many factors that contribute to churn, thus reducing churn risk by 7% by next March
We provide tools to activate the extra 50,000 customers. We also provide reports / dashboards that visualise the numbers of activations. But we don’t tend to include the tools to manage ongoing modelling and option analysis to meet key objectives. Objectives that are generally quantitative and tied to time, cost, etc and possibly locations/regions. 
These objectives are often really difficult to model and have multiple inputs. Managing to them requires data that’s changing on a daily basis (or potentially even more often – think of how a single missed truck-roll ripples out through re-calculation of optimal workforce allocation).
That requires:
  • Access to data feeds from multiple sources (eg existing OSS, BSS and other sources like data lakes)
  • Near real-time data sets (or at least streaming or regularly updating data feeds)
  • An ability to quickly prepare and compare options (data modelling, possibly using machine-based learning algorithms)
  • Advanced visualisations (by geography, time, budget drawdown and any graph types you can think of)
  • Flexibility in what can be visualised and how it’s presented
  • Methods for delivering closed-loop feedback to optimise towards the objectives (eg RPA)
  • Potentially manage many different transaction-based levers (eg parallel project activities, field workforce allocations, etc) that contribute to rolled-up objectives / targets
You can see why I refer to this as a data bridge product right? I figure that it sits above all other data sources and provides the management bridge across them all. 
PS. If you want to know the name of the existing products that fit into the “data bridge” category, please leave us a message.

Do you want funding on an OSS project?

OSS tend to be very technical and transactional in nature. For example, a critical alarm happens, so we have to coordinate remedial actions as soon as possible. Or, a new customer has requested service so we have to coordinate the workforce to implement certain tasks in the physical and logical/virtual world. When you spend so much of your time solving transactional / tactical problems, you tend to think in a transactional / tactical way.
You can even see that in OSS product designs. They’ve been designed for personas who solve transactional problems (eg alarms, activations, etc). That’s important. It’s the coal-face that gets stuff done.
But who funds OSS projects? Are their personas thinking at a tactical level? Perhaps, but I suspect not on a full-time basis. Their thoughts might dive to a tactical level when there are outages or poor performance, but they’ll tend to be thinking more about strategy, risk mitigation and efficiency if/when they can get out of the tactical distractions.
Do our OSS meet project sponsor needs? Do our OSS provide functionality that help manage strategy, risk and efficiency? Well, our OSS can help with reports and dashboards that help them. But do reports and dashboards inspire them enough to invest millions? Could sponsors rightly ask, “I’m spending money, but what’s in it for me?”
What if we tasked our product teams to think in terms of business objectives instead of transactions? The objectives may include rolled-up transaction-based data and other metrics of course. But traditional metrics and activities are just a means to an end.
You’re probably thinking that there’s no way you can retrofit “objective design” into products that were designed years ago with transactions in mind. You’d be completely correct in most cases. So what’s the solution if you don’t have retrofit control over your products?
Well, there’s a class of OSS products that I refer to as being “the data bridge.” I’ll dive into more detail on these currently rare products tomorrow.

An OSS checksum

Yesterday’s post discussed two waves of decisions stemming from our increasing obsession with data collection.

“…the first wave had [arisen] because we’d almost all prefer to make data-driven decisions (ie decisions based on “proof”) rather than “gut-feel” decisions.

We’re increasingly seeing a second wave come through – to use data not just to identify trends and guide our decisions, but to drive automated actions.”

Unfortunately, the second wave has an even greater need for data correctness / quality than we’ve experienced before.

The first wave allowed for human intervention after the collection of data. That meant human logic could be applied to any unexpected anomalies that appeared.

With the second wave, we don’t have that luxury. It’s all processed by the automation. Even learning algorithms struggle with “dirty data.” Therefore, the data needs to be perfect and the automation’s algorithm needs to flawlessly cope with all expected and unexpected data sets.

Our OSS have always had a dependence on data quality so we’ve responded with sophisticated ways of reconciling and maintaining data. But the human logic buffer afforded a “less than perfect” starting point, as long as we sought to get ever-closer to the “perfection” asymptote.

Does wave 2 require us to solve the problem from a fundamentally different starting point? We have to assume perfection akin to a checksum of correctness.

Perfection isn’t something I’m very qualified at, so I’m open to hearing your ideas. 😉


Riffing with your OSS

Data collection and data science is becoming big business. Not just in telco – our OSS have always been one of the biggest data gatherers around – but across all sectors that are increasingly digitising (should I just say, “all sectors” because they’re all digitising?).

Why do you think we’re so keen to collect so much data?

I’m assuming that the first wave had mainly been because we’d almost all prefer to make data-driven decisions (ie decisions based on “proof”) rather than “gut-feel” decisions.

We’re increasingly seeing a second wave come through – to use data not just to identify trends and guide our decisions, but to drive automated actions.

I wonder whether this has the potential to buffer us from making key insights / observations about the business, especially senior leaders who don’t have the time to “science” their data? Have teams already cleansed, manipulated, aggregated and presented data, thus stripping out all the nuances before senior leaders ever even see your data?

I regretfully don’t get to “play” with data as much as I used to. I say regretfully because looking at raw data sets often gives you the opportunity to identify trends, outliers, anomalies and patterns that might otherwise remain hidden. Raw data also gives you the opportunity to riff off it – to observe and then ask different questions of the data.

How about you? Do you still get the opportunity to observe and hypothesise using raw OSS/BSS data? Or do you make your decisions using data that’s already been sanitised (eg executive dashboards / reports)?


OSS diamonds are forever (part 2)

Wednesday’s post discussed how OPEX is forever, just like the slogan for diamonds.
As discussed, some aspects of Operational Expenses are well known when kicking off a new OSS project (eg annual OSS license / support costs). Others can slip through the cracks – what I referred to as OPEX leakage (eg third-party software, ongoing maintenance of software customisations).
OPEX leakage might be an unfair phrase. If there’s a clear line of sight from the expenses to a profitable return, then it’s not leakage. If costs (of data, re-work, cloud services, applications, etc) are proliferating with no clear benefit, then the term “leakage” is probably fair.
I’ve seen examples of Agile and cloud implementation strategies where leakage has occurred. And even the supposedly “cheap” open-source strategies have led to surprises. OPEX leakage has caused project teams to scramble as their financial year progressed and budgets were unexpectedly being exceeded.
Oh, and one other observation to share that you may’ve seen examples of, particularly if you’ve worked on OSS in large organisations – Having OPEX incurred by one business unit but the benefit derived by different business units. This can cause significant problems for the people responsible for divisional budgets, even if it’s good for the business as a whole. 
Let me explain by example: An operations delivery team needs extralogging capability so they stand up a new open-source tool. They make customisations so that log data can be collected for all of their network types. All log data is then sent to the organisation’s cloud instance. The operations delivery team now owns lifecycle maintenance costs. However, the cost of cloud (compute and storage) and data lake licensing have now escalated but Operations doesn’t foot that bill. They’ve just handed that “forever” budgetary burden to another business unit.
The opposite can also be true. The costs of build and maintain might be borne by IT or ops, but the benefits in revenue or CX (customer experience) are gladly accepted by business-facing units.
Both types of project could give significant whole-of-company benefit. But the unit doing the funding will tend to choose projects that are less effective if it means their own business unit will derive benefit (especially if individual’s bonuses are tied to those results).
OSS can be powerful tools, giving and receiving benefit from many different business units. However, the more OPEX-centric OSS projects that we see today are introducing new challenges to get funded and then supported across their whole life-cycle.
PS. Just like diamonds bought at retail prices, there’s a risk that the financials won’t look so great a year after purchase. If that’s the case, you may have to seek justification on intangible benefits.  😉
PS2. Check out Robert’s insightful comment to the initial post, including the following question, “I wonder how many OSS procurements are justified on the basis of reducing the Opex only *of the current OSS*, rather than reducing the cost of achieving what the original OSS was created to do? The former is much easier to procure (but may have less benefit to the business). The latter is harder (more difficult analysis to do and change to manage, but payoff potentially much larger).”

Diamonds are Forever and so is OSS OPEX

Sourced from: www.couponraja.in

I sometimes wonder whether OPEX is underestimated when considering OSS investments, or at least some facets (sorry, awful pun there!) of it.

Cost-out (aka head-count reduction) seems to be the most prominent OSS business case justification lever. So that’s clearly not underestimated. And the move to cloud is also an OPEX play in most cases, so it’s front of mind during the procurement process too. I’m nought for two so far! Hopefully the next examples are a little more persuasive!

Large transformation projects tend to have a focus on the up-front cost of the project, rightly so. There’s also an awareness of ongoing license costs (usually 20-25% of OSS software list price per annum). Less apparent costs can be found in the exclusions / omissions. This is where third-party OPEX costs (eg database licenses, virtualisation, compute / storage, etc) can be (not) found.

That’s why you should definitely consider preparing a TCO (Total Cost of Ownership) model that includes CAPEX and OPEX that’s normalised across all options when making a buying decision.

But the more subtle OPEX leakage occurs through customisation. The more customisation from “off-the-shelf” capability, the greater the variation from baseline, the larger the ongoing costs of maintenance and upgrade. This is not just on proprietary / commercial software, but open-source products as well.

And choosing Agile almost implies ongoing customisation. One of the things about Agile is it keeps adding stuff (apps, data, functions, processes, code, etc) via OPEX. It’s stack-ranked, so it’s always the most important stuff (in theory). But because it’s incremental, it tends to be less closely scrutinised than during a CAPEX / procurement event. Unless carefully monitored, there’s a greater chance for OPEX leakage to occur.

And as we know about OPEX, like diamonds, they’re forever (ie the costs re-appear year after year). 

A billion dollar bid

A few years ago I was lucky enough to be invited to lead a bid. I say lucky because the partner organisations are two of the most iconic firms in the tech industry. The bid was for bleeding-edge work, potentially worth well over a billion dollars. I was a little surprised to be honest. I mean, two tech titans, with many very, very clever people, much cleverer than me. Why would they need to look outside and engage me?

As it turned out, the answer became clear within the first few meetings. And whilst the project had little to do with OSS, it certainly had (has) parallels in the world of OSS.

Both of the organisations were highly siloed. Each product / capability silo had immense talent and immense depth to it. Our combined team had many PhDs who could discuss their own silo for hours, but could only point me in the general direction of what plugged into their products. 

Clearly, I was engaged to figure out the required end-to-end solution for the customer and then how to bolt the two sets of silos into that solution framework.

The same is true when looking for OSS solution gaps, in my experience at least. If you look into a domain or a product, the functionality / capability is usually quite well defined, understood and supported. For example, alarm / event managers are invariably very good at managing alarm / event lists.

If you’re going to find gaps, they’re more likely to be found in the end-to-end solution – in the handoffs, responsibility demarcation points, interfaces and processes that cross between silos. That’s why external consultancies can prove valuable for large organisations. They generally look into the cross-domain solution performance.

As you’d already know, the end-to-end solution is a combination of people, process and technology. Even so, as the “manager of managers,” I’m not sure our OSS tech is solving this problem as well as it could. Is there even a “glue” product that’s missing from our OSS/BSS stack?

Sure, we have some tools that fit this purpose – workflow engines, messaging buses, orchestration engines, data lakes, etc. Yet I still feel there’s an opportunity to do it far better. And the opportunity probably extends far beyond just OSS and into the broader IT industry.

What have you done to help solve this problem on your OSS suites?

PS. If you’re wondering what happened to the bid. Well, the team was excited to have made the shortlist of 3, but then the behemoths decided to withdraw from the race. Turns out that winning the bid could’ve jeopardised the even bigger supply contracts they already had with the client. Boggles the mind to think there were bigger contracts already in play!!


Inventory Management re-states its case

In a post last week we posed the question on whether Inventory Management still retains relevance. There are certainly uses cases where it remains unquestionably needed. But perhaps others that are no longer required, a relic of old-school processes and data flows.
If you have an extensive OSP (Outside Plant) network, you have almost no option but to store all this passive infrastructure in an Inventory Management solution. You don’t have the option of having an EMS (Element Management System) console / API to tell you the current design/location/status of the network. 
In the modern world of ubiquitous connection and overlay / virtual networks, Inventory Management might be less essential than it once was. For service qualification, provisioning and perhaps even capacity planning, everything you need to know is available on demand from the EMS/s. The network is a more correct version of the network inventory than external repository (ie Inventory Management) can hope to be, even if you have great success with synchronisation.
But I have a couple of other new-age use-cases to share with you where Inventory Management still retains relevance.
One is for connectivity (okay so this isn’t exactly a new-age use-case, but the scenario I’m about to describe is). If we have a modern overlay / virtual network, anything that stays within a domain is likely to be better served by its EMS equivalent. Especially since connectivity is no longer as simple as physical connections or nearest neighbours with advanced routing protocols. But anything that goes cross-domain and/or off-net needs a mechanism to correlate, coordinate and connect. That’s the role the Inventory Manager is able to do (conceptually).
The other is for digital twinning. OSS (including Inventory Management) was the “original twin.” It was an offline mimic of the production network. But I cite Inventory Management as having a new-age requirement for the digital twin. I increasingly foresee the need for predictive scenarios to be modelled outside the production network (ie in the twin!). We want to try failure / degradation scenarios. We want to optimise our allocation of capital. We want to simulate and optimise customer experience under different network states and loads. We’re beginning to see the compute power that’s able to drive these scenarios (and more) at scale.
Is it possible to handle these without an Inventory Manager (or equivalent)?

When OSS experts are wrong

When experts are wrong, it’s often because they’re experts on an earlier version of the world.”
Paul Graham.
OSS experts are often wrong. Not only because of the “earlier version of the world” paradigm mentioned above, but also the “parallel worlds” paradigm that’s not explicitly mentioned. That is, they may be experts on one organisation’s OSS (possibly from spending years working on it), but have relatively little transferable expertise on other OSS.
It would be nice if the OSS world view never changed and we could just get more and more expert at it, approaching an asymptote of expertise. Alas, it’s never going to be like that. Instead, we experience a world that’s changing across some of our most fundamental building blocks.
We are the sum total of our experiences.”
B.J. Neblett.
My earliest forays into OSS had a heavy focus on inventory. The tie-in between services, logical and physical inventory (and all use-cases around it) was probably core to me becoming passionate about OSS. I might even go as far as saying I’m “an Inventory guy.”
Those early forays occurred when there was a scarcity mindset in network resources. You provisioned what you needed and only expanded capacity within tight CAPEX envelopes. Managing inventory and optimising revenue using these scarce resources was important. We did that with the help of Inventory Management (IM) tools. Even end-users had a mindset of resource scarcity. 
But the world has changed. We now operate with a cloud-inspired abundance mindset. We over-provision physical resources so that we can just spin up logical / virtual resources whenever we wish. We have meshed, packet-switched networks rather than nailed up circuits. Generally speaking, cost per resource has fallen dramatically so we now buy a much higher port density, compute capacity, dollar per bit, etc. Customers of the cloud generation assume abundance of capacity that is even available in small consumption-based increments. In many parts of the world we can also assume ubiquitous connectivity.
So, as “an inventory guy,” I have to question whether the scarcity to abundance transformation might even fundamentally change my world-view on inventory management. Do I even need an inventory management solution or should I just ask the network for resources when I want to turn on new customers and assume the capacity team has ensured there’s surplus to call upon?
Is the enormous expense we allocate to building and reconciling a digital twin of the network (ie the data gathered and used by Inventory Management) justified? Could we circumvent many of the fallouts (and a multitude of other problems) that occur because the inventory data doesn’t accurately reflect the real network?
For example, in the old days I always loved how much easier it was to provision a customer’s mobile / cellular or IN (Intelligent Network) service than a fixed-line service. It was easier because fixed-line service needed a whole lot more inventory allocation and reservation logic and process. Mobile / IN services didn’t rely on inventory, only an availability of capacity (mostly). Perhaps the day has almost come where all services are that easy to provision?
Yes, we continue to need asset management and capacity planning. Yes, we still need inventory management for physical plant that has no programmatic interface (eg cables, patch-panels, joints, etc). Yes, we still need to carefully control the capacity build-out to CAPEX to revenue balance (even more so now in a lower-profitability operator environment). But do many of the other traditional Inventory Management and resource provisioning use cases go away in a world of abundance?


I’d love to hear your opinions, especially from all you other “inventory guys” (and gals)!! Are your world-views, expertise and experiences changing along these lines too or does the world remain unchanged from your viewing point?
Hat tip to Garry for the seed of this post!

Google’s Circular Economy in OSS

OSS wear many hats and help many different functions within an organisation. One function that OSS assists might be surprising to some people – the CFO / Accounting function.

The traditional service provider business model tends to be CAPEX-heavy, with significant investment required on physical infrastructure. Since assets need to be depreciated and life-cycle managed, Accountants have an interest in the infrastructure that our OSS manage via Inventory Management (IM) tools.

I’ve been lucky enough to work with many network operators and see vastly different asset management approaches used by CFOs. These strategies have ranged from fastidious replacement of equipment as soon as depreciation cycles have expired through to building networks using refurbished equipment that has already passed manufacturer End-of-Life dates. These strategies fundamentally effect the business models of these operators.

Given that telecommunications operator revenues are trending lower globally, I feel it’s incumbent on us to use our OSS to deliver positive outcomes to global business models. 

With this in mind, I found this article entitled, “Circular Economy at Work in Google Data Centers,” to be quite interesting. It cites, “Google’s circular approach to optimizing end of life of servers based on Total Cost of Ownership (TCO) principles have resulted in hundreds of millions per year in cost avoidance.”

Google Asset Lifecycle

Asset lifecycle management is not your typical focus area for OSS experts, but an area where we can help add significant value for our customers!

Some operators use dedicated asset management tools such as SAP. Others use OSS IM tools. Others reconcile between both. There’s no single right answer.

For a deeper dive into ideas where our OSS can help in asset lifecycle (which Google describes as its Circular Economy and seems to manage using its ReSOLVE tool), I really recommend reviewing the article link above.

If you need to develop such a tool using machine learning models, reach out to us and we’ll point you towards some tools equivalent to ReSOLVE to augment your OSS.

Another OSS “forehead-slap” moment!

I don’t know about you, but I find this industry of ours has a remarkable ability to keep us humble. Barely a day goes by when I don’t have to slap my forehead and say, “uhhh…. of course!” (or perhaps, “D’oh!!”)

I had one such instance yesterday. I couldn’t figure out why a client’s telemetry / performance-management suite needed an inventory ingestion interface. Can you think of a reason (you probably can)???

My mind had followed the line of thinking that it was for reconciling with traditional inventory systems or perhaps some sort of topology reckoning. It’s far more rudimentary than that. 

Have you figured out what it might be used for yet?


For example, if device names (hostnames) attached to the metrics aren’t human-readable, simple, just enrich the data with its human-readable alternate name. If you don’t know what device type is generating sub-sets of metrics, no problems, just enrich the data.

I’d heard of enrichment of alarms/event of course, but hadn’t followed that line of thinking for performance management before. Does your performance management stack allow you to enrich its data sets?

Seems obvious in hindsight! Smacked down again!!

I’d love to hear any anecdotes you have where OSS gave you a “forehead slap” moment.

Over 30 Autonomous Networking User Stories

The following is a set of user stories I’ve provided to TM Forum to help with their current Autonomous Networking initiative.

They’re just an initial discussion point for others to riff off. We’d love to get your comments, additions and recommended refinements too.

As a Head of Network Operations, I want to Automatically maintain the health of my network (within expected tolerances if necessary) So that Customer service quality is kept to an optimal level with little or no human intervention
As a Head of Network Operations, I want to Ensure the overall solution is designed with machine-led automations as a guiding principle So that Human intervention can not be easily engineered into the systems/processes
As a Head of Network Operations, I want to Automatically identify any failures of resources or services within the entire network So that All relevant data can be collected, logged, codified and earmarked for effective remedial action without human interaction
As a Head of Network Operations, I want to Automatically identify any degradation of resource or service performance within the network So that All relevant data can be collected, logged, codified and earmarked for effective remedial action without human interaction
As a Head of Network Operations, I want to Map each codified data set (for failure or degradation cases) to a remedial action plan So that Remedial activities can be initiated without human interaction
As a Head of Network Operations, I want to Identify which remedial activities can be initiated via a programmatic interface and which activities require manual involvement such as a truck roll So that Even manual activities can be automatically initiated
As a Head of Network Operations, I want to Ensure that automations are able to resolve all known failure / degradation scenarios So that Activities can be initiated for any failure or degradation and be automatically resolved through to closure (with little or no human intervention)
As a Head of Network Operations, I want to Ensure there is sufficient network resilience So that Any failure or degradation can be automatically bypassed (temporarily or permanently)
As a Head of Network Operations, I want to Ensure there is sufficient resilience within all support systems So that Any failure or degradation can be automatically bypassed (temporarily or permanently) to ensure customer service is maintained
As a Head of Network Operations, I want to Ensure that operator initiated changes (eg planned maintenance, software upgrades, etc) automatically generate change tracking, documentation and logging So that The change can be monitored (by systems and humans where necessary) to ensure there is minimal or no impact to customer services, but also to ensure resolution data is consistently recorded
As a Head of Network Operations, I want to Ensure that customer initiated changes (eg by raising an incident) automatically generate change tracking, documentation and logging So that The change can be monitored (by systems and humans where necessary) to ensure the incident is closed expediently, but also to ensure resolution data is consistently recorded
As a Head of Network Operations, I want to Initiate planned outages with or without triggering automated remedial activities So that The change agents can decide to use automations or not and ensure automations don’t adversely effect the activities that are scheduled for the planned outage window
As a Head of Network Operations, I want to Ensure that if an unplanned outage does occur, impacted customers are automatically notified (on first instance and via a communications sequence if necessary throughout the outage window) So that Customer experience can be managed as best possible
As a Head of Network Operations, I want to Ensure that if an unplanned outage does occur without a remedial action being triggered, a post-mortem analysis is initiated So that Automations can be revised to cope with this previously unhandled outage scenario
As a Head of Network Operations, I want to Ensure that even previously un-seen new fail scenarios can be handled by remedial automations So that Customer service quality is kept to an optimal level with little or no human intervention
As a Head of Network Operations, I want to Automatically monitor the effects of remedial actions So that Remedial automations don’t trigger race conditions that result in further degradation and/or downstream impacts
As a Head of Network Operations, I want to Be able to manually override any automations by following a documented sequence of events So that If a race condition is inadvertently triggered by an automation, it can be negated quickly and effectively before causing further degradation
As a Head of Network Operations, I want to Intentionally trigger network/service outages and/or degradations, including cascaded scenarios on an scheduled and/or randomised basis So that The resilience of the network and systems can be thoroughly tested (and improved if necessary)
As a Head of Network Operations, I want to Intentionally trigger network/service outages and/or degradations, including cascaded scenarios on an ad-hoc basis So that The resilience of the network and systems can be thoroughly tested (and improved if necessary)
As a Head of Network Operations, I want to Perform scheduled compliance checks on the network So that Expected configurations and policies are in place across the network
As a Head of Network Operations, I want to Automatically generate scheduled reports relating to the effectiveness of the network, services and automations So that The overall solution health (including automations) can be monitored
As a Head of Network Operations, I want to Automatically generate dashboards (in near-real-time) relating to the effectiveness of the network, services and automations So that The overall solution health (including automations) can be monitored
As a Head of Network Operations, I want to Ensure that automations are able to extend across all domains within the solution So that Remedial actions aren’t constrained by system hand-offs
As a Head of Network Operations, I want to Ensure configuration backups are performed automatically on all relevant systems (eg EMS, OSS, etc) So that A recent good solution configuration can be stored as protection in case automations fail and corrupt configurations within the system
As a Head of Network Operations, I want to Ensure configuration restores are performed and tested automatically on all relevant systems (eg EMS, OSS, etc) So that A recent good solution configuration can be reverted to in case automations fail and corrupt configurations within the system
As a Head of Network Operations, I want to Ensure automations are able to manage the entire service lifecycle (add, modify/upgrade, suspend, restore, delete) So that Customer services can evolve to meet client expectations with little or no human intervention
As a Head of Network Operations, I want to Have a design and architecture that uses intent-based and/or policy-based actions So that The complexity of automations is minimised (eg automations don’t need to consider custom rules for different device makes/models, etc)
As a Head of Network Operations, I want to Ensure as many components of the solution (eg EMS, OSS, customer portals, etc) have programmatic interfaces (even if manual activities are required in back-end processes) So that Automations can initiate remedial actions in near real time
As a Head of Network Operations, I want to Ensure all components and data flows within the solution are securely hardened (eg encryption of data in motion and at rest) So that The power of the autonomous platform can not be leveraged for nefarious purposes
As a Head of Network Operations, I want to Ensure that all required metrics can be automatically sourced from the network / systems in as near real time as feasible / useful So that Automations have the full set of data they need to initiate remedial actions and it is as up-to-date as possible for precise decision-making
As a Head of Network Operations, I want to Use the power of learning machines So that The sophistication and speed of remedial response is faster, more accurate and more reliable than if manual interaction were used
As a Head of Network Operations, I want to Record actual event patterns and replay scenarios offline So that Event clusters and response patterns can be thoroughly tested as part of the certification process prior to being released into production environments
As a Head of Network Operations, I want to Capture metrics that can be cross-referenced against event patterns and remedial actions So that Regressions and/or refinements can improve existing automations (ie continuous retraining of the model)
As a Head of Network Operations, I want to Be able to seed a knowledge base with relevant event/action data, whether the pattern source is from Production, an offline environment, a digital twin environment or other production-like environments So that The database is able to identify real scenarios, even if  scenarios are intentially initiated, but could potentially cause network degradation to a production environment
As a Head of Network Operations, I want to Ensure that programmatic interfaces also allow for revert / rollback capabilities So that Remedial actions that aren’t beneficial can be rolled back to the previous state; OR other remedial actions are performed, allowing the automation to revert to original configuration / state
As a Head of Network Operations, I want to Be able to initiate circuit breakers to override any automations So that If a race condition is inadvertently triggered by an automation, it can be negated quickly and effectively before causing further degradation
As a Head of Network Operations, I want to Manually or automatically generate response-plans (ie documented sequences of activities) for any remedial actions fed back into the system So that Internal (eg quality control) or external (eg regulatory) bodies can review “best-practice” remedial activities at any point in time
As a Head of Network Operations, I want to Intentionally trigger catastrophic network failures (in non-prod environments) So that We can trial many remedial actions until we find an optimal solution to seed the knowledge base with

H-OSS-ton, we have a problem

You’ve all probably seen this scene from the Tom Hanks movie, Apollo 13 right? But you’re probably wondering what it has to do with OSS?

Well, this scene came to mind when I was preparing a list of user stories required to facilitate Autonomous Networking.

More specifically, to the use-case where we want the Autonomous Network to quickly recover (as best it can) from unplanned catastrophic network failures.

Of course we don’t want catastrophic network failures in production environments, but if one does occur, we’d prefer that our learning machines already have some idea on how to respond to any unlikely situation. We don’t want them to be learning response mechanisms after a production event.

But similarly, we don’t want to trigger massive outages on production just to build up a knowledge base of possible cause-effect groupings. That would be ridiculous.

That’s where the Apollo 13 analogy comes into play:

  • The engineers on the ground (ie the non-prod environment) were tasked with finding a solution to the problem (as they said, “fitting a square peg in a round hole”)
  • The parts the Engineers were given matched the parts available in the spacecraft (ie non-prod and prod weren’t an exact match, but enough of a replica to be useful)
  • The Engineers were able to trial many combinations using the available parts until they found a workable resolution to the problem (even if it relied heavily on duct tape!)
  • Once the workable solution was found, it was codified (as a procedure manual) and transferred to the spacecraft (ie migrating seed data from non-prod to prod)

If I were responsible for building an Autonomous Network, I’d want to dream up as many failure scenarios as I could, initiate them in non-prod and then duct-tape* solutions together for them all… and then attempt to pre-seed those learnings into production.

* By “duct-tape” I mean letting the learning machine attempt to find optimal solutions by trialing different combinations of automated / programmatic and manual interventions.

We use time-stamping in OSS, but what about geo-stamping?

A slightly left-field thought dawned on me the other day and I’d like to hear your thoughts on it.

We all know that almost all telemetry coming out of our networks is time-stamped. Events, syslogs, metrics, etc. That makes perfect sense because we look for time-based ripple-out effects when trying to diagnose issues.

But therefore does it also make sense to geo-stamp telemetry data too? Just as time-based ripple-out is common, so too are geographic / topological (eg nearest neighbour and/or power source) ripple-out effects.

If you want to present telemetry data as a geo/topo overlay, you currently have to enrich the telemetry data set first. Typically that means identifying the device name that’s generating the data and then doing a query on huge inventory databases to find the location and connectivity that corresponds to that device.

It’s usually not a complex query, but consider how much processing power must go into enriching at the enormous scale of telemetry records.

For stationary devices (eg core routers), it might seem a bit absurd adding a fixed geo-code (which has to be manually entered into the device once) to every telemetry record, but it seems computationally far more efficient than data lookups (please correct me if I’m wrong here!). For devices that move around (eg routers on planes), hopefully they already have GPS sensors to provide geo-stamp data.

What do you think? Am I stating a problem that has already been solved and/or is not worth solving? Or does it have merit?

The Autonomous Network / OSS Clock

In yesterday’s post, we talked about what needs to happen for a network operator to build an autonomous network. Many of the factors extended beyond the direct control of the OSS stack. We also looked at the difference between designing network autonomy for an existing OSS versus a ground-up build of an autonomous network.

We mostly looked at the ground-up build yesterday (at the expense of legacy augmentation).

So let’s take a slightly closer look at legacy automation. Like any legacy situation, you need to first understand current state. I’ve heard colleagues discuss the level of maturity of an existing network operations stack in terms of a single metric.

However, I feel that this might miss some of the nuances of the situation. For example, different activities are likely to be at different levels of maturity. Hence, the attempt at benchmarking the current situation on the OSS or Autonomous Networking clock below.

OSS Autonomy Clock

Sample activities shown in grey boxes to demonstrate the concept (I haven’t invested enough time into what the actual breakdown of activities might be yet).

  • Midnight is no monitoring capability
  • 3AM is Reactive Mode (ie reacting to data presented by the network / systems)
  • 6AM is Predictive Mode (ie using historical learnings to identify future situations)
  • 9AM is Prescriptive / Pre-cognitive Mode (ie using historical learnings, or pre-cognitive capabilities to identify what to do next)
  • Mid-day is Autonomous Networking (ie to close the loop and implement / control actions that respond to current situations automatically)

As always, I’d love to hear your thoughts!