Optus - ‘Trouble at mill’

Optus disaster a lesson in transparency for communications network

 

November 2023

Now , is it just me , or did I note an element of ‘look at the monkey’ in Optus’ response to the recent nationwide network outage which affected in the order of 10million users ( subscribers ) . An estimated 400,000 businesses stalled for up to 13 hours as the network slowly relearned how to communicate..

 

Businesses were unable to operate, transport networks were frozen, 000 emergency callers were isolated  ,my colleagues 15 year old daughter was unable to post a really cool dance sequence on TikTok , the list is endless. The cost to the economy has been estimated by government analysts at somewhere north of $1.5B, some pegging it as high as $2B+ based on the extrapolated GDP data where 1 day of the Australian economy is valued at A$6.16B etc etc.

 

This outage cost those using the network in a raft of ways, credibility, financially, safety , security , just plain old ‘pain in the arse’ but we lose or seem to have lost track of the ‘non-human’ network inhabitants. These subterranean entities keeping a watchful eye on the state of some very valuable state and privately owned assets, they are the cogs that keep the engine chugging along , but are something you would rarely pay attention to , until something goes wrong – such as the energy grid, the water distribution systems , the buses and trains we catch , mining , oil and gas assets – and what makes these such an important issue?? It’s the fact that they are mostly in remote locations or at least not easy to access locations.

 

So, what is it that these spooky inhabitants are watching over and what are they doing lurking about on my Facebook vessel?

 

How does industry make use of the phone network?

 

Utilities and Industry have for many years used the public cellular phone network as a convenient and reasonably cost effective way of ‘keeping an eye’ on their disparate assets. So , we are not alone in the land of networking. These non-human inhabitants are sending constant and timely telemetry data back to some great big control rooms with loads of screens, sometimes with flashing red lights indicating that there’s a problem.

Across Australia on November 8th 2023 some of these operators would have seen a full screen of red flashing lights – literally an Ah F%^$^& moment..

 

The key here is visibility, and it seems to have been relegated to the minor league as a network priority.

 

The operators both state and private have a lot invested in their networks and they are seriously reliant on rock solid telecommunications to give them the visibility of instantaneous changes. A good example of this may be an autonomous iron ore train many hundreds of kilometres from its operations control room. By definition , this is a driverless train – right. This train is 2400m long carrying tens of thousands of tonnes of iron ore from the pit to the port at a speed of approximately 80Km/h along a ~300Km track – and there is another one following it behind and one in front. This iron ore is worth a fortune (just ask ABC’s Alan Kohler – he tells us every night were iron ore prices sit). These trains run continuously – not just 9-5 and if by some tragic press of an icon or control button the network stops then the control room monitoring this very valuable asset now lose visibility of it.

 

 

The folks at Rail Express published a useful explainer on the FMG autonomous Iron Ore rail system in Australia’s remote North West. Now – there is no suggestion that FMG were impacted by this outage but it does serve as an indicator of how financially large the stakes are should a nationwide telecommunications outage occur, which could affect the most remote of industries remote operations. If FMG’s network did become invisible as a result of this type issue then there is a safety imperative for the trains to stop (and they do).

 

The rules for some industries are simple – if you lose visibility of your asset then it must be shut down.

 

The science behind remote controlled assets

 

Clearly all of the remote controlled assets have alternative communications paths or primary private communications networks back to their respective control rooms – whether they be ore trains, oil pumps , reservoir valves and grid distribution assets – don’t’ they ?

 

Even so, usually the core connectivity, that is, the backhauling of data from the asset concentration point to the control room is almost invariably a telecommunications carrier network. These Telco’s offer 5 9’s in availability meaning they have belt and braces network resilience. This connectivity is via a host of centralised core routers – in the case of Optus these core routers are manufactured by CISCO.

 

The science goes like this, my train in the Pilbara wants to communicate with a control room in, say, Perth. As we are all becoming more and more aware of, every device we communicate with  ( iPhone, TV, fridge (if its smart), remote control children’s toy ) has an IP address.

This stands for Internet Protocol, and the address (like a postal address) is unique ( in a manner of speaking) on the world wide web – you know www.

The IP address of the train is ‘mapped’ through the network – this could be private or public and connected with the IP address of the Control Centre server (big computer).

Now on its merry way the ‘data’ – information critical to the operation of the locomotive (position, load, speed) – transits devices called routers – these are like a check-in station along the way – and these routers calculate the fastest path (route .. see where I’m going?) for data to go from, in our case, the train to the control room. The route is learned by the router and the data is then forwarded to the next nearest check-in station and so it goes – a virtual path or route is defined for each bit of data going from the train to the control room.

 

Why virtual, you ask? Well if they were physical there would be bits of wire hanging out of every conceivable subscriber to the network – including yours truly. So, we call these paths virtual as there is no wire, well in the middle at least. This definition is an oversimplification of what a virtual circuit is.

 

What really went wrong with Optus

 

Back to Optus – in the press Optus’ parent company Singtel have released a statement confirming that the cause of this major outage was 90 routers basically becoming amnesic – in other word they forgot all of the path or virtual path information. It’s a staggering number but you’d also be right to question how can this be. How can the entire protected IP core router network just all simultaneously stop? The reason being given is that a software upgrade to the network went spectacularly wrong, but says Optus, we’ve learned our lesson and it won’t happen again ( maybe )..

 

I find it very hard to believe that firstly the ‘new’ software was not pretested. Having worked inside Optus for nine years with their Integration Engineering division I know that all network upgrades are pretested. So why the diversion .. “Hey, look at monkey” – likely because it sounds technical and most folk would just say wow I just don’t understand this stuff.

 

Secondly – upgrading the network core routers – and let’s face it when the entire country goes down it’s definitely a core failure – why commence the sequence of upgrade at 4am (one hour or so before the east coast starts getting moving) – bad process? Nah , I don’t believe it. So – what caused it? Human error? Well that would explain the statement in a Senate Estimates Inquiry grilling that the short coming was identified and wouldn’t happen again ( as that person being grilled is now no longer with Optus )- by the way .. fast forward to September 2025 .. it happened again with devestating consequences- for now back to the past ….

 

I guess we’ll never know unless you were a fly on the wall of the bedroom of Singtel boss Kuan Moon Yuen at 2am Singapore time being briefed on what was actually going on- suffice to say it was serious and potentially repeatable.

 

For us poor subscribers of the Optus mobile or fixed line network there’s not much we can do – Optus owns the network, and they can use whatever technology they see as being best fitting for the purpose of upload /download, voice and video services. I ‘d like to know what Sydney Trains is going to do with their free 200GB of data , that would be per service – still what will they do with it ?

You Do Have Choice..

Let’s face it .. if you ( heavy industry are using the public cellular network for managing your critical assets you are pretty much flying blind.. the Telco’s own the network and make a load of money out of running those networks. They primarily cater for the average punter downloading X ,Facebook, Insta , YouTube - there in lies the secret code … download . The public networks are scewed toward downloading not uploading.

Good question.. How much data do these machines/assets generate?? Well , if we take a big yellow mining truck ( aka Autonomous Truck) , it generates log files of >10GB per outing which need to be uploaded on the go - the truck has between 10 and 12 outings per day .. yes , that’s a lot of data..

For at least the last 12 years big industry have been owning their own networks.. well sort of leasing/owning.. by in large they have opted for private LTE ( 4G ) either using infrastructure owned by ( for example Telstra ) or building their own infrastructure using spectrum leased to them for private use ( all in very remote locations) . RioTinto kicked the can down the road building their very own pLTE network - sunk cost .. a lot .. Nokia don’t sell their networks cheap. Capability to expand .. could do but at a huge cost ( so as mines spread farther or deeper the base stations can’t connect.. hmm a problem.. you have already paid a load for the spectrum usage and for the equipment..

More commonly though - Telstra will build , very kindly, a network node in the vicinity of the mine and the miners can use it ,.. just buy a SIM contract and you’re sweet .. but Telstra own the network and can ‘maintain’ it as they like ( see previous story ) and will primarily treat it like a mobile phone network - heavy on downlink , very light on uplink .. and so very restrictive for industrial ‘heavy on uplink’ usage.. the answer is to have a system you own and operate and therefore can change the up down symmetry of at a cost which can be considered as almost insignificant compared to the total cost of the above two options.. use a 5G standardised pLTE system which operates in the ISM band .. such a beast exists ..