Telco is a Circus with Thousands of Balls in the Air

Are you frustrated by the complexity of your OSS/BSS factory? With hundreds or even thousands of systems, all trying to juggle thousands of activities without dropping the ball is like organised chaos.

But what if we’re looking at it all wrong, both in the way we’ve always designed our legacy application architectures and how we’re now planning to use AI in it?

So many jugglers juggling

Let’s start with a metaphor.

I’m sure you’re already familiar with the TM Forum TAM (a slightly out of date version is shown below), or its more modern counterpart, the ODA Component Library.

They show around 75 different application building blocks that make up an OSS/BSS stack. Some of our biggest telcos have over 1,000 applications in their suite.

At any point in time, each of these applications are handling a variety of activities. What if we thought of each of these applications as a juggler, with each juggling multiple balls at once?

Now, each juggler doesn’t just keep juggling the same balls. It needs to do its thing, then pass the balls on to the next step in the end-to end workflow.

The problem we face is that the jugglers don’t always pass off the balls in a consistent sequence. It’s quite possible that, depending on the conditions, they could pass balls off to a number of different jugglers.

Our process maps ensure that there’s some semblance of coordination of who passes to who, but amongst the organised chaos, balls are regularly dropped.

Each juggler is really only focussing on the balls they have in the air. They’re not always aware of the new balls that are being tossed over to them, nor are they paying much attention to whether the adjacent juggler catches any of the balls they’ve handed off.

Clearly, the full TAM is too complex to visualise as a circus of jugglers juggling, but if we use the Simplified TAM map, then we get a picture that looks like this:

Walk the floor of any large telco and you will see the big top in full swing.

The different coloured balls equate to orders, activations, config pushes, alarms, tickets, changes, reconciliations, assurance checks, billing adjustments, usage mediations, customer updates. You get the idea!

Each of these is a ball in flight. Each system, team, or vendor platform is a juggler.

What keeps the lights on, with a steady stream of customers coming to watch this circus, is the skillful rhythm of throws and catches between jugglers:

CRM to order management
Order management to orchestration
Orchestration to domain controllers
Controllers to network elements
Network events back to assurance
Then into incident and problem management,
and so it goes

The problem is that we’ve designed our standardised architectural models around the jugglers – the transactions. This means that every single telco around the world has a different architectural model. No two jugglers are ever the same. Even if one juggler is cloned, it still throws a different pattern at each different circus (telco).

What if instead, we architected solutions around juggle chains that are more-or-less the same (at L2 or L3) to reduce the number of variants?

Let me give you an example. I was recently helping a client to develop a process and data model to support QR codes / barcodes in their warehouse. It’s really just an asset management lifecycle task, but the end-to-end chain follows a process something like:

Perform a network / capacity design [Network Inventory]
Prepare the Bill of Materials (BoM) [Procurement System]
Place order with supplier [ERP]
Supplier ships equipment to warehouse [Supplier’s system]
Warehouse accepts consignment [Warehouse / Spares]
Scans barcodes to log equipment in system [Card Reader -> Network Inventory]
Equipment is included in a site build request [Design Pack, Work Order, Workforce Management System]
And so on throughout the asset’s full lifecycle from commissioning to maintenance to decommissioning / scrapping

We don’t design a system to manage a juggle chain, so we end up with infinite variants and different jugglers that throw in different ways.

Also, since the ball travels through a variety of different systems [in square brackets above], it doesn’t have a single system-wide identifier to track progress against. There’s no single tracking mechanism through these different systems, so it’s easy for a ball to be dropped, especially at a boundary between jugglers.

Most estates cannot answer a simple question in the moment: where exactly is this specific ball right now, who is supposed to catch it next, by when, and with which evidence of success?

Let’s look at this 3-step process chain below. It should be quite simple to follow the bouncing ball through only 3 jugglers right? Would you believe that it had 313 flow variants because of the issues highlighted? Three hundred and thirteen unique flow sequences!!!!

In this process chain, each juggler claimed that it was handling a large number of balls at a rapid volume, so the problem wasn’t with them. It must be somewhere else in the “factory.” Unfortunately, there was no repeatability or predictability at all amongst those 313 variants because nobody was tracking the effectiveness of balls getting from BEGIN to END.

AI will Fix It!

Many people see AI as the panacea for solving all types of problems within the OSS world. We’re seeing all number of AI wrappers around old OSS products.

But here’s another observation for you. We’re mostly using them in the wrong way.

LLMs are actually really weak jugglers. But they are strong ring masters.

By nature, LLMs are non-deterministic. That means they’re not reliable or repeatable. They do things differently each time. They’re not great at throwing balls exactly the same way, every time, without deviation. Yet, we’re mostly trying to use them as automation tools. As the jugglers.

Conversely, they are really good at coordinating – reading across disparate sources, summarising context, surfacing contradictions, finding unexpected pattern matches, and proposing next best actions.

Clearly, we should consider using them as ring masters, not jugglers.

We have the opportunity to put AI to work on the coordination tasks that our OSS/BSS estates do so badly today:

Stitching context across tools so the next juggler knows the next ball is coming
Tracking balls as they move across the ring and back
Managing the effective outcomes, not just of one ball, but of all the balls as an overall intent / objective measure
Detecting dropped balls and picking them back up
Explaining where each ball is so far in its journey and predicting the likelihood of success and likely time of completion
Routing around problems to preserve the original intent and overall circus performance

That way the model is an informed ring master – spotting, guiding, prioritising, escalating – not a rogue juggler inventing new tricks mid-act.

An Ontology. My Circus for an Ontology!

The mistake many programmes make is trying to embed intelligence into every juggler. You end up with a series of clever magic tricks and no reliable act.

Instead, we could consider adding an ontology above the existing stack that observes, correlates, and steers without displacing legacy systems that already know how to throw their particular balls.

What the ontology centralises is:

Shared context that explains intent, priority, and status across the whole act
Policies for routing, escalation, and ownership that apply consistently across boundaries
Playbooks for how to recover when a ball is late, missing, or dropped

What the ontology offers is reusable services:

Correlation for keeping each ball alive across all tools
Routing and escalation that place the ball with the right or best hands based on the state of all the juggling jugglers
Reconciliation that proposes safe ways to realign divergent states

Critically, the ontology is likely to be graph-native. Flows in telco are effectively just dependency graphs in disguise. Orders depend on service models that depend on resource topologies that depend on platform health, etc. Coordination is stronger when the ring master can see the graph and compute impact, not just react to isolated events.

Shared IDs to enable cross-stack tracing

You cannot coordinate what you cannot name or identify. Every ball needs a passport that travels with it. In practice, most telcos already have many IDs, identifiers that are lost at the hand-off between systems.

We need to think of agents in terms of roles. Coordination is a team sport. Clear, specialised roles make it legible and governable:

Detector agents watch boundaries for promised hand-offs that are never caught, or don’t arrive by their due-by window
Prioritiser agents score impact based on customer tier, order value, criticality of the affected service, jeopardy indicators, etc so the highest priority balls move first
Router agents assign the next owner using awareness of next step, capacity, and proximity policies that adapt to real-time conditions
Reconciler agents propose actions when two systems disagree about the state of the same ball, avoiding duelling automations or manual flip-flops
Librarian agent curates runbooks and playbooks from every resolved case so the estate learns and optimises continuously

Measure success at two levels: Each pass, and the whole circus

The temptation is to jump straight from single step measurements (like the 3-step debacle above) to end-to-end KPIs. Resist stopping there. Certainly start by measuring the health of each pass, but then roll up to optimise for the performance of the whole circus.

Pass-level signals

Pass efficiency – the expected next move happens within its window for the majority of throws, with visibility of who is late when it does not
Rescue speed – median time from the moment a ball is classified as stuck to the moment it is moving again, tracked by boundary and by product
Orphan rate – count and age of transactions with no active owner or no next step, a strong leading indicator of customer pain
SLA integrity – every breach tied to the API or contract that failed, not just to a system name
Jeopardy Indicators – that provide early warning of pass-level breaches before they happen

Map-level signals

Traceability – percentage of flows across all jugglers (vendors / tools) that arrive within SLAs / SLOs
Coordination tax – average hops and durations per flow types, hopefully trending down as the coordination processes continually improve
Noise-to-signal – reduction in alerts or notifications that do not correspond to a real throw, catch, drop, or expected next move

Success in this model is tangible on the ground. Those improvements are visible to customers and front-line teams, not only in executive dashboards. Because we’re tracking the health of the entire system, there’s less chance of gaming the system like in the 3-step diagram above.

Putting it together – a practical first move

If the above feels like a multi-year journey, start small. Pick one high-volume, high-friction flow. For most operators that is either a bread-and-butter order journey or a repeated incident class.

Quantify the real flow variants by visualising from log data, such as the example below:

The SMEs thought they had about 15-20 steps in this flow above. Log data revealed 116 activity types (process states) and 1,300+ flow variants!

Sometimes this requires different system logs to be stitched together.

Once you understand what you’re working with (current state), you can then get a better feel for what needs fixing (if anything):

Issue passports at the entry point and make sure they survive every hand-off in the chosen flow
Stand up the detector and escalation roles first so drops surface quickly with evidence, then add prioritiser and router where it hurts most
Create a simple trace view that shows throws, catches, the promised next move, and the current timer for lateness

You will hopefully see immediate lift in metrics in the chosen flow. That early impact builds the political capital you need to extend to optimise other flows (juggle chains), and to embed the ring-master approach.

Speaking of juggle chains, we plan on creating and sharing a set of standardised workflows that should be applicable for any telco to help create an architecture template with less variants than the juggler system we build on today. It will be the next iteration of the 50+ process maps that you can freely download here, or by clicking on the image below:

November 11, 2025
Ryan

If you found this article useful or valuable, subscribe (in the top-right corner of this page) and share. Let's spread the word and inspire more people to become passionate about OSS. Ryan is Passionate About OSS and has dedicated the last two decades to sharing his passion for OSS with the world. He is a founder, author, blogger, Engineer, connector and inquisitive learner about OSS and managing networks. To find out a little about his back-story and why he's so Passionate About OSS, click on the About Page. To connect with Ryan and the PAOSS team, click on the Contact page.

All Posts