Inability to serve the market (eg offerings, capacity, etc)
Inability to operate network assets profitably
In that article, we looked closely at a human factor and how current trends of open-source, Agile and microservices might actually exacerbate it. In yesterday’s article we looked at market-serving factors for us to investigate and monitor.
But let’s look at point 3 today. The profitability factors we could consider that reduce the chances of the big boss getting fired are:
Ability to see revenues in near-real-time (revenues are relatively easy to collect, so we use these numbers a lot. Much harder are profitability measures because of the shared allocation of fixed costs)
Ability to see cost breakdown (particularly which parts of the technical solution are most costly, such as what device types / topologies are failing most often)
Ability to measure profitability by product type, customer, etc
Are there more profitable or cost-effective solutions available
Is there greater profitability that could be unlocked by simplification
Inability to serve the market (eg offerings, capacity, etc)
Inability to operate network assets profitably
In that article, we looked closely at a human factor and how current trends of open-source, Agile and microservices might actually exacerbate it. In yesterday’s article we looked at the broader set of catastrophic failure factors for us to investigate and monitor.
But let’s look at some of the broader examples under point 2 today. The market-serving factors we could consider that reduce the chances of the big boss getting fired are:
Immediate visibility of key metrics by boss and execs (what are the metrics that matter, eg customer numbers, ARPU, churn, regulatory, media hot-buttons, network health, etc)
Response to “voice of customer” (including customer feedback, public perception, etc)
Human resources (incl up-skill for new tech, etc)
Ability to implement quickly / efficiently
Ability to handle change (to network topology, devices/vendors, business products, systems, etc)
Measuring end-to-end user experience, not just “nodal” monitoring
Scalability / Capacity (ability to serve customer demand now and into a foreseeable future)
Not sure whether you have a clear answer to that question – either the thought is enticing (you want the CEO to get fired), unthinkable (you don’t want the CEO fired) or somewhere in between. You’d invariably get different answers from different employees within most organisations (you can’t please all of the people all of the time).
Today’s post also has no singular situation, so it poses a number of hypothetical questions. But let’s follow the thread of the question from a purely technical perspective (ie discounting inappropriate personal decisions, incompetence and the like).
If we look at the technical factors that could get the big boss fired, they’re probably only limited to:
Inability to serve the market (eg offerings, capacity, etc)
Inability to operate network assets profitably
Let’s start with catastrophic failures, ones where entire networks and/or product lines go down for long periods (as opposed to branches of networks becoming faulty or intermittently failing). We’ve had quite a few catastrophic failures in my region in the last few years.
Interestingly, we’re designing networks and systems that should be more reliable day-by-day. In all likelihood they probably are. But with increased system complexity and programmed automation, I wonder whether we’re actually increasing the likelihood of catastrophic failures. That is, examples of cascading failures that are too complex for operators to contain.
Anecdotal evidence surrounding the failures mentioned above in my area suggest that a skills-gap after many retirements / redundancies has been partly to blame. The replacements haven’t been as well equipped as their predecessors to prevent the runaway train of catastrophic failure.
Do you think a similar risk awaits us with the current trend towards build-it-yourself OSS? We have all these custom-built, one-of-a-kind, complex systems being built currently. What happens in a few years time when their designers / implementers leave? Yes, their replacements might be able to interrogate code repositories, but will they know what changes to rapidly make when patterns of degradation start to appear? Will they have any chance of inherently understanding the logic behind these systems like those who created them? Will they be able to stop the runaway train/s?
There are many pros and cons of the build vs buy argument, but at least there’s a level of repeatability and modularity with off the shelf solutions.
I wonder if the CEOs of today/tomorrow are planning for sophisticated up-skilling of replacements in future when they decide to build in-house today? Or will they just hand the replacements the keys and wish them luck? The latter approach keeps costs down, but it just might get the CEO fired!
We’ll take a closer look at the other two factors (“serve the market” and “profitability”) in the next few days.
I’m currently reading a book entitled, “Jony Ive. The genius behind Apple’s greatest products.”
I’d like to share a paragraph with you from it (and probably expect a few more in coming days):
“…Apple’s internal culture heavily favored the engineers within the product groups. The design process was engineering driven. In the early days of Frog Design, the engineers had bent over backward to help implement the design team’s ambitions, but now the power had shifted. The different engineering groups gave their products in development to Brunner’s group, who were expected to merely “skin” them.
Brunner wanted to shift the power from engineering to design. He started thinking strategically… The idea was to get ahead of the engineering groups and start to make Apple more of a design-driven company rather than a marketing or engineering one.”
That’s an unbelievably insightful conclusion Robert Brunner made. If he wanted to turn Apple into a design-driven company, then he’d have to prepare design concepts that looked further into the future than where the engineers were up to. Products like the iPod and iPad are testimony that Brunner’s strategy worked.
We face the same situation in OSS today. The power of product development tends to lie with engineering, ie the developers. I have huge admiration for the very clever and very talented engineers who create amazing products for us to use, buuutttttt…….
I just have one reservation – is there a single OSS company that is design-driven? A single one that’s making intuitive, effective, beautiful experiences for their users? Of course engineering holds power over design in OSS – how many OSS vendors even have a dedicated design department???
Let me give a comparison (albeit a slightly unfair one). Both of my children were reasonably adept at navigating their way around our iPad (for multiple use cases) by the age of three. What would the equivalent “intuition age” be for navigating our OSS?
If you’re a product manager, have you ever tried it? Have you ever considered benchmarking it (or an equivalent usability metric) and seeing what you could do to improve it for your OSS products?
When I first started the Passionate About OSS site / blog many years ago, I was lucky to get a handful of views per day. It’s grown by many multiples since then, fortunately.
The launch of The Blue Book OSS/BSS Vendor Directory generated some exciting metrics yesterday. The directory alone came within 5 pageviews of the highest count we’ve ever seen on PassionateAboutOSS.com (and PAOSS is up to nearly 2,500 posts now). That total appeared in only a 14-hour window because we didn’t go live with The Directory or metric collection until ~10am local time! The graphs are indicating that we should easily exceed PAOSS’s best ever count today.
If you were one of the many viewers who popped in from all around the world to look at The Directory, thank you! If you have any suggested improvements, we’d love to hear from you as we’re sure to be making many further tweaks in coming days/months.
But the most interesting fact about the launch yesterday was that a job posting appeared on UpWork to scrape all the data we’ve presented. On our very first day!! In fact a gentleman in the US reviewed bids and awarded the UpWork job all within about 14 hours of go-live.
That’s positive news because it means that at least one person must’ve thought the data was useful. 🙂
It provides a comprehensive directory of over 400 suppliers that produce OSS, BSS and/or related network management tools. Company details, product details and functionality classifications are included.
Every network operator has a unique set of needs from their operational software – software that includes OSS (Operational Support Systems), BSS (Business Support Systems), NMS (Network Management Systems) and the many other related tools.
To service those many and varied needs, a large number of different products have been created by some very clever developers. But it’s a highly fragmented market. There are literally hundreds of product options out there and they all have different capabilities.
If you’re a typical buyer, how many of those products are you familiar with? Five? Ten? Fifty? How do you know whether the best-fit product or supplier is within the list you already know? Perhaps the best-fit is actually amongst the hundreds of other products and suppliers you’re not familiar with yet. How much time do you have to research each one and distill down to a short-list of possible candidates to service your specific needs? Where do you start? Lots of web searches? There has to be an easier way.
What if you’re a seller? These products tend to have lengthy life-cycles once they’ve been installed so it might be years before a prospect actually enters the buying phase. Yet there are so many prospects out there at different phases of their buying windows. There are bound to be some live ones at any time that suit your capabilities. The challenge for you as a supplier is how to make those prospects aware of you. You don’t have the time to establish trusted relationships with hundreds, perhaps even thousands, of buyers across the globe (or maybe just within your region/s). Wouldn’t you love to be presented with qualified prospects who are in (or nearing) their buying window?
Well we at Passionate About OSS have created The Blue Book OSS/BSS Vendor Directory to simplify the task of bringing buyers and sellers together. With over 400 suppliers listed (and climbing), we provide a single, comprehensive repository for searching, matching and connecting. The tools allow you to do it yourself, or we can help you using the approaches we’ve developed, used and refined over the years.
Now just click on “Directory” to start your journey of searching, matching and connecting (and updating your listing if you’re a supplier).
We’ve spoken at length about TM Forum’s, “Time to kill the RFP? Reinventing IT procurement for the 2020s,” report so far this week. We’ve also spoken about the feeling that the OSS/BSS RFP (Request For Proposal) still has relevance in some situations… as long as it’s more of a lighter-touch than most. We’ve spoken about a more pragmatic approach that aims to find best available fit (for key objectives through stages of filtering) rather than perfect fit (for all requirements through detailed analyses). And I should note that “best available fit” includes measurement against these three contrarian procurement KPIs ahead of the traditional ones.
Yesterday’s post discussed how we get to a short list with minimal involvement of buyers and sellers, with the promise that we’d discuss the detailed analysis stage today.
It’s where we do use an RFP, but with thought given to the many pain-points cited so brilliantly by Mark Newman and team in the abovementioned TM Forum report.
The RFP provides the mechanism to firm up pricing and architecture, but is also closely tied to a PoC (Proof of Concept) demonstration. The RFP helps to prioritise the order in which PoCs are performed. PoCs tend to be very time consuming for buyer and seller. So if there’s a clear leader from the paper studies so far, then they will demonstrate first.
If there’s not a clear difference, or if the prime candidate’s demonstration identified significant gaps, then additional PoCs are run.
Next steps are to form the more detailed designs, commercials / contracts and ratify that the business case still holds up.
In yesterday’s post, I also promised to share our “starting-point” procurement methodology. I say starting point because each buyer situation is different and we tend to customise it to each buyer’s needs. It’s useful for starting discussions.
The overall methodology diagram is shown below:
A few key notes here:
The process looks much heavier than it really is… if you use traditional procurement processes as an indicator
We have existing templates for all the activities marked in yellow
The activity marked in blue partially represents the project we’re getting really excited to introduce to you tomorrow
Having to get into significant discussions with vendors (yet)
Gathering all your stakeholders together to prepare a detailed list of requirements
We’ll call this “the long list,” which might consist of 5-20 suppliers. We use this evaluation technique (which we’ll share more about on Monday) to ensure we’ve looked at the broad market of suppliers rather than just the few the buyer already knows.
The next step we follow helps us to get to a much smaller list, which we’ll call “the short list.”
For this, we do need to contact vendors (the long list) and we do need to prepare a list of requirements to add to the objectives and key workflows we’ve previously identified. The requirements won’t need to be detailed, but will still probably number into the 100s – some from our pick-list, others customised to each client’s needs.
Then we engage in what we refer to as an EOI (Expression of Interest) phase. Our EOIs are not just a generic market capability analysis like many buyers conduct. Ours seek indicative vendor compliance (to objectives and requirements) and indicative pricing based on the dimensions we supply. We’ve refined this model over the years to make it quite quick and (relatively) easy for vendors to respond to.
Using compliance to measure suitability and indicative pricing to plug in to our long-term TCO (Total Cost of Ownership) model, the long list usually becomes a clear short list of 1-5 very quickly.
Now we can get into detailed discussions with a very small number of best-fit suppliers without having wasted much time of buyer or seller.
You may have noticed that we’ve run a series of posts about OSS/BSS procurement, and about the RFP process by association.
One of the first steps in the traditional procurement process is preparing a strategy and detailed set of requirements.
As TM Forum’s, “Time to kill the RFP? Reinventing IT procurement for the 2020s,” report describes:
“Before an RFP can be issued, the CSP’s IT or network team must produce a document detailing the strategy for implementing a technology or delivering a service, which is a lengthy process because of the number of stakeholders involved and the need to describe requirements in a way that satisfies them all.”
The problem with most requirements documents, the ones I’ve seen at least, is that they tend to get down into a deep, deep level of detail. And when it’s down in that level of detail, contrasting opinions from different stakeholders can make it really difficult to reach agreement. Have you ever been in a room with many high-value (and high cost) stakeholders spending days debating the semantics (and wording) of requirements? Every stakeholder group needs a say and needs to be heard.
The theory is that you need a great level of detail to evaluate supplier offerings for best-fit. Well, maybe, but not in the initial stages.
First things first – I seek to find out what’s really important for the organisation. That rarely comes from a detailed requirements spreadsheet, but by determining the things that are done most often and/or add the most value to the buyer’s organisation. I use persona mapping, long-tail and perhaps whale-curve mapping approaches to determine this.
Persona mapping means identifying all the groups within the buyer’s organisation that need to interact with the OSS/BSS (current and proposed). Then sitting with each group to determine what they need to achieve, who they need to interact with and what their workflows look like. That also gives a chance for all groups to be heard.
From this, we can collaboratively determine some high-level evaluation criteria, maybe only 15-20 to start with. You’d be surprised at how quickly this 15-20 criteria can help with initial supplier filtering.
Armed with the initial 15-20 evaluation criteria and the project we’re getting excited to launch on Monday, we can get to a relevant list of possible suppliers quite quickly. It allows us to do a broad market search to compile a list of suppliers, not just from the 5-10 suppliers the buyer already knows about, but from the 400+ suppliers/products available on the market. And we don’t even have to ask the suppliers to fill out any lengthy requirement response spreadsheets / forms yet.
We’ll continue the discussion over the next two days. We’ll also share our procurement methodology pack on Sunday.
There’s no doubt the current stereotypical RFP approach to procurement is broken. It needs to be done differently. That’s why we have been doing it differently with customers for years now (another hint regarding a project we’re getting excited to announce this Monday).
The TM Forum report is really powerful and well worth a read. There are a few additional (and somewhat random) thoughts that go through my head when considering the death of the RFP:
The TM Forum report is primarily coming at the problem from the perspective of a carrier that is constantly steering the development of its own systems, as implied through this quote, “The fundamental problem with the RFP process is that in a fast-paced technology environment, where cloud and software are fast becoming preferred options, it is difficult for CSPs to describe in lengthy, written documents what they want and need. The processes are simply too complex and cumbersome to support modern, Agile methods of working.”
That perspective is particularly applicable for some buyers, ones that have committed to having significant developer resources available to build exactly what they want. That could be in the form of in-house developers, contract developers, long-term panel arrangements with suppliers or similar
Others, perhaps such as utilities, enterprise and some telcos want to focus on their core business and delegate OSS/BSS configuration and customisation to third-parties.
Some of those rely on COTS (commercial off the shelf) software to leverage the benefits of innovation, cost and development time that have been spread across multiple customers. Their budgets simply don’t allow for custom-built solutions
COTS, be it on-prem through to cloud service models, are almost never going to be a perfect fit for a buyer’s needs. They’re designed to generically suit many buyers, so a certain amount of bloat becomes part of the trade-off
In recent weeks, I’ve seen two entirely in-house developed OSS/BSS. They fit their organisations like a glove and there’s almost no bloat at all. In fact it would be almost impossible for a COTS solution to replace what they’ve built. In both cases it’s taken a decade of ongoing development to get to that position. Most buyers don’t have that amount of time to get it right though unfortunately
Commercial realities imply a pragmatic approach is taken to procurement – which product/s provide default capability that best aligns with the buyer’s most important objectives.
RFPs often get bogged down at the far right-hand side of the long-tail of requirements (where impact tends to be negligible), or in trying to completely re-sculpt the solution to be the perfect fit (that it’s unlikely to ever be)
In my experience at least, the best-fit (not perfect fit) solution, or very short list of solutions, usually becomes apparent fairly quickly [we’ll share more about how we do that tomorrow]. It’s then just a case of testing objectives, assumptions and gaps (eg via a proof-of-concept) and getting to a mutually beneficial commercial agreement
As one respondent in the TM Forum report put it, “The RFP glorifies the process, not the outcome.” A healthy dose of outcome-driven pragmatism helps to reduce glorification of the RFP process
With so much fragmentation in the OSS/BSS market already (there are over 400 in our vendor directory), that means the talent pool of creators is thinly spread. Many of those 400 have duplicated functionality, which isn’t great for the industry’s overall progress. Custom development for each different buyer spreads the talent pool even further… unless buyers can get economies of development scale through shared platforms like ONAP
In summary, I love the concept of avoiding massive procurement events. I still can’t help but think the RFP still fits in there somewhere for many buyers… as long as we ensure we glorify the outcomes and de-emphasise the process. It’s just that we use RFPs like a primitive instrument and inflict blunt-force trauma, rather than using surgical precision.
Earlier this year, the TM Forum published a really insightful report called, “Time to kill the RFP? Reinventing IT procurement for the 2020s.” There are so many layers to the OSS/BSS procurement discussion and Mark Newman and team have done a fantastic job of capturing them. We’ll expand on a few of those layers in a series of posts this week.
For example, section 2 articulates the typical RFI / RFP / RFQ approach. It’s clear to see why the typical approach is flawed. Yesterday’s post pondered whether procurement events are flawed from the initial KPIs that are set by buyers. Today we’ll take a look at the process that follows.
Two quotes from the TM Forum report frame some of the challenges with RFPs from buyer and seller viewpoints respectively: QUOTE 1 (Buyer-side) – “CSPs normally distribute RFPs to a group of three to eight suppliers. These are most likely existing suppliers, previous vendors or companies the CSP is aware of through its own technology scouting. Suppliers are likely to include systems integrators who rely on other vendors to fulfill elements of the contract, and CSPs tend to invite bidders offering a range of options.
For example, they may invite a supplier that is likely to offer a good price, one that is a ‘safe’, low-risk option, and the incumbent supplier, which in many cases the CSP is looking to replace.
The document itself is likely to be several hundred pages long, a large portion of it comprising details of technology requirements, with suppliers asked to specify whether they comply with each requirement.”
The question I’d ask about this process is how does the CSP choose 3-8 out of the 400+ vendors that supply the OSS/BSS market? Does their “own technology scouting” adequately discount the hundreds of others that could potentially be best-fit for their needs?
QUOTE 2 (Seller-side) – “We were holed up in our hotel for a month working feverishly on different aspects of the bid. We had 15 people there in total, and we were asked to come in for meetings with five different teams. The meetings go on and on, and you really have no idea when they’re going to finish.”
Let’s do the sums on this situation. 15 people x 25 days x $1500 per day (a round figure that includes accommodation, meals, etc) = $562,500. That’s over half a million dollars just for the seller-side of the post-RFP evaluation phase. Now let’s say there were 4 sellers going through this. [Just a small aside here – reading between the lines, do you suspect the buyer was taking the seller on a journey into the minutiae or focusing on what will move the needle for them? Re-read that through the lens of yesterday’s contrasting KPI perspectives]
You can see exactly why Mark has proposed that it’s, “Time to kill the RFP,” at least in its traditional form. These two quotes lobby hard for the death penalty. More on that tomorrow!
Also note that another hint was contained above in the lead-up to a project launch on Monday that we’re really excited about.
You may’ve noticed that things have been a little quiet on this blog in recent weeks. We’ve been working on a big new project that we’ll be launching here on PAOSS on Monday. We can’t reveal what this project is just yet, but we can let you in on a little hint. It aims to help overcome one of the biggest problem areas faced by those in the comms network space.
Further clues will be revealed in this week’s series of posts.
The industry we work in is worth tens of billions of dollars annually. We rely on that investment to fund the OSS/BSS projects (and ops/maintenance tasks) that keeps many thousands of us busy. Obviously those funds get distributed by project sponsors in the buyers’ organisations. For many of the big projects, sponsors are obliged to involve the organisation’s procurement team.
That’s a fairly obvious path. But I often wonder whether the next step on that path is full of contradictions and flaws.
Do you agree with me that the 3 KPIs sponsors expect from their procurement teams are:
Negotiate the lowest price
Eliminate as many risks as possible
Create a contract to manage the project by
If procurement achieves these 3 things, sponsors will generally be delighted. High-fives for the buyers that screw the vendor prices right down. Seems pretty obvious right? So where’s the contradiction? Well, let’s look at these same 3 KPIs from a different perspective – a more seller-centric perspective:
I want to win the project, so I’ll set a really low price, perhaps even loss-leader. However, our company can’t survive if our projects lose money, so I’ll be actively generating variations throughout the project
Every project of this complexity has inherent risks, so if my buyer is “eliminating” risks, they’re actually just pushing risks onto me. So I’ll use any mechanisms I can to push risks back on my buyer to even the balance again
We all know that complex projects throw up unexpected situations that contracts can’t predict (except with catch-all statements that aim to push all risk onto sellers). We also both know that if we manage the project by contractual clauses and interpretations, then we’re already doomed to fail (or are already failing by the time we start to manage by contract clauses)
My 3 contrarian KPIs to request from procurement are:
Build relationships / trust – build a framework and environment that facilitates a mutually beneficial, long-lasting buyer/seller relationship (ie procurement gets judged on partnership length ahead of cost reduction)
Develop a team – build a framework and environment that allows the buyer-seller collective to overcome risks and issues (ie mutual risk mitigation rather than independent risk deflection)
Establish clear and shared objectives – ensure both parties are completely clear on how the project will make the buyer’s organisation successful. Then both constantly evolve to deliver benefits that outweigh costs (ie focus on the objectives rather than clauses – don’t sweat the small stuff (or purely technical stuff))
Yes, I know they’re idealistic and probably unrealistic. Just saying that the current KPI model tends to introduce flaws from the outset.
“From watching ESPN, I’d learned about the power of information bombardment. ESPN strafes its viewers with an almost hysterical amount of data and details. Scrolling boxes. Panels. Bars. Graphics. Multi-angle camera perspectives. When exposed to a surfeit of data, men tend to feel more masculine and in command. Do most men bother to decipher these boxes, panels, bars and graphics? No – but that’s not really the point.”
Martin Lindstrom, in his book, “Small Data.”
I’ve just finished reading Small Data, a fascinating book that espouses forensic analysis of the lives of users (ie small data) rather than using big data methods to identify market opportunities. I like the idea of applying both approaches to our OSS products. After all, we need to make them more intuitive, endearing and ultimately, effective.
The quote above struck a chord in particular. Our OSS GUIs (user interfaces) can tend towards the ESPN model can’t they? The following paraphrasing doesn’t seem completely at odds with most of the OSS that we interact with – “[the OSS] strafes its viewers with an almost hysterical amount of data and details.”
And if what Lindstrom says is an accurate psychological analysis, does it mean:
The OSS GUIs we’re designing help make their developers “feel more masculine and in command” or
Our OSS operators “feel more masculine and in command” or
Intriguingly, does the feeling of being more masculine and in command actually help or hinder their effectiveness?
I find it fascinating that:
Our OSS/BSS form a multi billion dollar industry
Our OSS/BSS are the beating heart of the telecoms industry, being wholly responsible for operationalising the network assets that so much capital is invested in
So little effort is invested in making the human to OSS interface far more effective than they are today
I keep hearing operators bemoan the complexities and challenges of wrangling their OSS, yet only hear “more functionality” being mentioned by vendors, never “better usability”
Maybe the last point comes from me being something of a rarity. Almost every one of the thousands of people I know in OSS either works for the vendor/supplier or the customer/operator. Conversely, I’ve represented both sides of the fence and often even sit in the middle acting as a conduit between buyers and sellers. Or am I just being a bit precious? Do you also spot the incongruence of point D on a regular basis?
Whether you’re buy-side or sell-side, would you love to make your OSS more effective? Let us know and we can discuss some of the optimisation techniques that might work for you.
We’re going to look into assurance models of the past versus the changing assurance demands that are appearing these days. The diagrams below are highly stylised for discussion purposes so they’re unlikely to reflect actual implementations, but we’ll get to that.
Under the old model, the heart of the OSS/BSS was the database (almost exclusively a relational database). It would gather data, via probes/MDDs/collectors, from the network under management (Note: I’ve shown the sources as devices, but they could equally be EMS/NMS). The mediation device drivers (MDDs) would take feeds from the network and homogenise them to be suitable for very precise loading into tables in the relational databases.
This data came in the form of alarms/events, performance counters, syslogs and not much else. It could come in all sorts of common (eg SNMP) or obscure forms / protocols. Some would come via near-real-time notifications, but a lot was polled at cycles such as 5 or 15 mins.
Then the OSS/BSS applications (eg Assurance, Inventory, etc) would consume data from the database and write other data back to the database. There were some automations, such as hard-coded suppression rules, etc.
The automations were rarely closed-loop (ie to actually resolve the assurance challenge). There were also software assistants such as trendlines and threshold alerts to help capacity planners.
There was little overlap into security assurance – occasionally there might have even been a notification of device configuration varying from a golden config or indirect indicators through performance graphs / thresholds.
But so many aspects of the old world have been changing within our networks and IT systems. The active network, the resilience mechanisms, the level of virtualisation, the release management methods, containerisation, microservices, etc. The list goes on and the demands have become more complex, but also far more dynamic.
Let’s start with the data sources this time, because this impacts our choice of data storage mechanism. We still receive data from the active network devices (and EMS/NMS), but we also now source data from other sources. They might be internal sources from IT, security, etc, but could also be external sources like social indicators. The 4 Vs of data between old and new models have fundamentally changed:
Volume – we’re seeing far more data
Variety – the sources are increasing and the structure of data is no longer as homogenised as it once was (in fact unstructured data is now commonplace)
Velocity – we’re receiving incoming data at any number of different velocities, often at far higher frequency than the 15 minute poll cycles of the past
Veracity (or trustworthiness) – our systems of old relied on highly dependable data due to its relational nature and could easily become a data death spiral if data quality deteriorated. Now we accept data with questionable integrity and need to work around it
Again the data storage mechanism is at the heart of the solution. In this case it’s a (probably) unstructured data lake rather than a relational database because of the 4 Vs above. The data that it stores must still be stored in a way that allows cross-referencing to happen with other data sets (ie the role of the indexer), but not as homogenised as a relational database.
The 4 Vs also fundamentally change the way we have to make use of the data. It surpasses our ability to process in a manual or semi-manual way (where semi-manual implies the traditional rules-based automations like suppression, root-cause analysis, etc). We have no choice but to increase dependency on machine-driven tools as automations need to become:
More closed-loop in nature – that is, to not just consolidate and create a ticket, but also to automate the resolution and ticket closure
More abundant – doing even more of the mundane, recurring tasks like auto-sizing resources (particularly virtual environments), restarting services, scheduling services, log clean-up, etc
To be honest, we probably passed the manual/semi-manual tipping point many years ago. In the meantime we’ve done as best we could, eagerly waiting until the machine tools like ML (Machine Learning) and AI (Artificial Intelligence) could catch up and help out. This is still fairly nascent, but AIOps tools are becoming increasingly prevalent.
The exciting thing is that once we start harnessing the potential of these machine tools, our AIOps should allow us to ask far more than just network health questions like past models. They could allow us to ask marketing / cost / capacity / profitability / security questions like:
Where do I urgently need to increase capacity in the network (and can you automatically just make this happen – a more “just in time” provisioning of resources rather than planning months ahead)
Where could I re-position capacity around the network to reduce costs, improve performance, improve customer experience, meet unmet demand
Where should sales/marketing teams focus their efforts to best service unmet demand (could be based on current network or in sequence with network build-out that’s due to become ready-for-service)
Where are the areas of un-met demand compared with our current network footprint
With an available budget of $x, is it best spent on which ratio of maintenance, replacement, expansion and where
How do we better understand profitability vectors in the network compared to just the more crude revenue metrics (note that profitability vectors could include service density, amount of maintenance on the supporting infrastructure, customer interactions, churn, etc on a geographic or similar basis)
Where (and how) can we progressively automate a more intent or policy-driven auto-remediation of the network (if we don’t already have a consistent approach to config management)
What policies can we tweak to get better performance from the network on a more real-time basis (eg tweaking QoS policies based on current traffic in various parts of the network)
Can we look at past maintenance trends to automatically set routine maintenance schedules that are customised by device, region, device type, loads, etc rather than using a one-size-fits-all maintenance schedule
Can I determine, on a real-time basis, what services are using which resources to get a true service impact estimate in a dynamic, packet-switched network environment
What configurations (or misconfigurations) in the network pose security vulnerability threats
If a configuration change is identified, can it be automatically audited and reported on (and perhaps even quarantined) before then being authorised (manually or automatically?)
What anomalies are we seeing that could represent security events
Can we utilise end-to-end constructs such as network services, customer services, product lifecycle, device lifecycle, application performance (as well as the traditional network performance) to enhance context and correlation
And so many more that can’t be as easily accommodated by traditional assurance tools
Seems this post from last week has triggered some really interesting debate – Is your service assurance really service assurance?? (Part 5). It was a post that looked into collecting end-to-end service metrics rather than our traditional method of collecting network device events/metrics and trying to reverse-engineer to form a service-level perspective.
Thought I’d give you an update. I’m thinking along the following lines, but admit that I don’t have it all worked out by any means yet:
We need to concept of span like OpenTelemetry does between microservices (in a way, it’s like nearest-neighbour of where each packet is getting pushed).
Note that for us a span is on a service-by-service basis between nodes, not just a network link-by-link basis between nodes
We need to be able to measure the real-time metrics of the performance of each span as well as any events/faults impacting them
One challenge (one of probably many) is how to avoid flooding the data/management planes. Possibly a telemetry beacon at each node that’s aggregating performance/events of each packet passed for each service?? But what aggregation-window / cache-size to use? Still too impossibly huge to process except with ridiculously low sampling rates??
By chaining the spans we get a real-time, end-to-end trace of services and the performance (and real-time snapshot of service-by-service resource usage in a packet-switched network)
How to efficiently get the beacon data to a centralised logging/management point? Send beacons via management plane? Send via data plane? Take an approach similar to Netflow / IPFIX-style protocols?
How to store data for a short period (ie for real-time analysis/reporting) as well as for long periods. Due to volumes, we’d have to apply aging policies to the data, but it would still be valuable for the purpose of mid and long-term SLA, network health, optimisation, capacity management, etc
As you can see, there are still so many wide-open questions about the feasibility of the concept. But getting feedback from multiple very clever people who read this blog is definitely helping! Thank you!!
“There’s the famous quote that if you want to understand how animals live, you don’t go to the zoo, you go to the jungle. The Future Lab has really pioneered that within Lego, and it hasn’t been a theoretical exercise. It’s been a real design-thinking approach to innovation, which we’ve learned an awful lot from.”
Jorgen Vig Knudstorp.
This quote prompted me to ask the question – how many times during OSS implementations had I sought to understand user behaviour at the zoo versus the jungle?
By that, how many times had I simply spoken with the user’s representative on the project team rather than directly with end users? What about the less obvious personas as discussed in this earlier post about user personas? Had I visited the jungles where internal stakeholders like project sponsors, executives, data consumers, etc. or external stakeholders such as end-customers, regulatory bodies, etc go about their daily lives?
I can truthfully, but regretfully, say I’ve spent far more time at OSS zoos than in jungles. This is something I need to redress.
But, at least I can claim to have spent most time in customer-facing roles.
Too many of the product development teams I’ve worked closely with don’t even visit OSS zoos let alone jungles in any given year. They never get close to observing real customers in their native environments.
I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because just as you have the concept of traces / metrics / logs for apps, you similarly have traces / metrics / logs for networks.
In the network world, we’re good at getting metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains) as described earlier in this blog series. And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering logs / events, which seems a bit backward to me.
Take a closer look at the OpenTelemetry link above, which provides an overview of how their team is going to gather application telemetry. With increasing software-ification of our networks (eg SDN / NFV) and the use of microservices / NaaS / APIs in our management stacks, could this actually be our path to the holy grail of service assurance (ie capturing trace data – network service telemetry)?? Is it data plane? Is it control / management plane? Is it something in between?
Note: The “active measurements” approach described in part 3 is slightly compromised in current form, which is why I’m so intrigued by the potential of extending the concepts of OpenTelemetry into our software / virtual networks.
I’d really love your take on this one because I’m sure there are many elements to this that I haven’t thought through yet. Please leave your thoughts on the viability of the approach.
Interestingly, I also just stumbled upon OpenTelemetry, an open source project designed to capture traces / metrics / logs from apps / microservices. It intrigued me because it introduces the concept of telemetry on spans (not just application nodes). Tomorrow’s article will explore how the concept of spans / traces / metrics / logs for apps might provide insight into the challenge we face getting true end-to-end metrics from our networks (as opposed to the easy to come by nodal metrics).
In the network world, we’re good at getting nodal metrics / logs / events, but not very good at getting trace data (ie end-to-end service chains, or an aggregation of spans in OpenTelemetry nomenclature). And if we can’t monitor traces, we can’t easily interpret a customer’s experience whilst they’re using their network service. We currently do “service assurance” by reverse-engineering nodal logs / events, which seems a bit backward to me.
Table 4 (from the Netrounds white paper link above) provides a view of the most common AI/ML techniques used. Classification and Clustering are useful techniques for alarm / event “optimisation,” (filter, group, correlate and prioritize alarms). That is, to effectively minimise the number of alarms / events a NOC operator needs to look at. In effect, traditional data collection allows AI / ML to remove the noise, but still leaves the problem to be solved manually (ie network assurance, not service assurance)
They’re helping optimise network / resource problems, but not solving the more important service-related problems, as articulated in Table 5 below (again from Netrounds).
If we can directly collect trace data (ie the “active measurements” described in yesterday’s post), we have the data to answer specific questions (which better aligns with our narrow AI technologies of today). To paraphrase questions in the Netrounds white paper, we can ask:
Has the digital service been properly activated.
What service level is currently being experienced by customers (and are SLAs being met)
Is there an outage or degradation of end to end service chains (established over multi-domain, hybrid and multi-layered networks)
Does feedback need to be applied (eg via an orchestration solution) to heal the network
Today we’ll discuss the approach/es to overcome the constraints described in yesterday’s post.
As shown via the inserted blue row in Table 6 below (source: Netrounds), a proposed solution is to use active measurements that reflect the end-to-end user experience.
The blue row only talks about the real-time monitoring of “synthetic user traffic,” in the table below. However, there are at least two other active measurement techniques that I can think of:
We can monitor real user traffic if we use port-mirroring techniques
We can also apply techniques such as TR-069 to collect real-time customer service meta data
Note: There are strengths and weaknesses of each of the three approaches, but we won’t dive into that here. Maybe another time.
You may recall in yesterday’s post that we couldn’t readily ask service-related questions of our traditional systems or data. Excitingly though, active measurement solutions do allow us to ask more customer-centric questions, like those shown in the orange box below. We can start to collect metrics that do relate directly to what the customer is paying for (eg real data throughput rates on a storage backup service). We can start to monitor real SLA metrics, not just proxy / vanity metrics (like device up-time).
Interestingly, I’ve only had the opportunity to use one vendor’s active measurement solutions so far (one synthetic transaction tool and one port-mirror tool). [The vendor is not Netrounds’ I should add. I haven’t seen Netround’s solution yet, just their insightful white paper]. Figure 3 actually does a great job of articulating why the other vendor’s UI (user interface) and APIs are currently lacking.
Whilst they do collect active metrics, the UI doesn’t allow the user to easily ask important service health questions of the data like in the orange box. Instead, the user has to dig around in all the metrics and make their own inferences. Similarly the APIs don’t allow for the identification of events (eg threshold crossing) or automatic push of notifications to external systems.
This leaves a gap in our ability to apply self-healing (automated resolution) and resolution prior to failure (prediction) algorithms like discussed in yesterday’s post. Excitingly, it can collect service-centric data. It just can’t close the loop with it yet!