The recent Optus outage – what happened, how it was fixed, and how to protect yourself in the future

29 Nov 2023, by Slade Baylis

Everyone across Australia is likely already aware of the issues that affected the Optus network earlier this month.  They were hard to miss, as Optus is Australia’s second-largest telecommunications company and the outage affected their entire landline, mobile, and internet networks.  In total, it’s estimated that around 10 million Optus customers and 400,000 businesses were affected - and that’s the low end of its actual impact, as these figures don’t include the people trying to communicate with them!

The total number of affected customers alone is astonishing, but it’s the other crucial services that were impacted by this that is of real concern.  The services that were affected ranged from public health systems, public transport systems, to even affecting government services.  Perhaps worst among them, hundreds of triple zero calls failed to connect during the outage1 - that’s right, Optus landlines and mobiles were unable to connect to the 000 emergency line unless the connection occurred through either Telstra or Vodafone infrastructure.

So, with the outage now far enough in the rear-view for an accurate review, we thought we’d put together this summary of what happened and how it was fixed, as well as go into detail on what our customers can do to protect themselves from similar issues in the future.

What happened at Optus?

In the hours and days following the outage of Optus’ services, information on what caused the issue was few and far between, mostly consisting of speculation after piecing together what information had been released. One of the first speculated causes on the day was that it could have been a potential cyber-attack against their infrastructure – it wouldn’t be the first large scale cyber-attack against Optus in recent news, with them having already been hit by a data breach back in 20222.   However, on the day, Kelly Bayer Rosmarin (the now former CEO of Optus) announced that she did not believe that it was the result of a cyber-attack, stating that it was highly unlikely due to their systems being very stable.

Since then, more information has been released, with the ultimate cause of the nationwide outage being determined to be “changes to routing information from an international peering network” run by its Singaporean parent company, Singtel.  Those routing changes “propagated through multiple layers” of the Optus network, and “exceeded preset safety levels on key routers” which couldn’t handle them.  This then resulted in “those routers disconnecting from the Optus IP Core network to protect themselves", and thus caused the outage that millions around Australia experienced.  To help make this more understandable though, we’ll need to quickly touch on “peering” - explaining what it is and how it works - we promise not to get too down in the weeds! 

The internet, at a basic level, can be thought of as a lot of different networks connected together.  When you look to access content that’s across the world, or even just across the street, it needs to travel over multiple different networks before it can find its way back to you.  Peering is a technique used to reduce the number of networks that data needs to travel across, by directly connecting different networks together to allow them to exchange traffic – this allows data to be sent more directly, without needing to follow the otherwise long routes they would need to follow. 

With this in mind, the “changes to routing information from an international peering network” mentioned in the above explanation of the Optus outage cause, merely refers to changes about how that network traffic was instructed to be “routed” through the Optus network via those peering agreements.

Why did it take so long to diagnose and how did they fix it?

With the outage starting at 4:05am, it wasn’t until 4pm that Optus declared that the network outage had officially ended, with more than 99.72 percent of its network having been restored at the time.  One of the reasons for both the delay in diagnosing and ultimately resolving the issue is purely due to the scope of the outage.

When a problem occurs, usually the symptoms of the problem are a major insight into what the potential causes may be.  As an example, let’s look at a hypothetical issue of sending an email - if you find yourself being unable to send an email from your computer, but the same email is able to be sent from your mobile, then it's likely a password issue is at fault, rather than an issue with the email service itself.  As you can see, the limited scope of the problem actually helps to identify what caused it.

Back to the Optus outage, one of the biggest reasons for the delay is that it affected everything – it wasn’t just their landlines, or their mobiles, or their internet services -  it was all the services they provided.  This opens up the door to a whole range of potential causes, with each one needing to be investigated and ruled out.  This is evident in the information reported by ITNews3  that in the “first six hours or so, the engineers pursued six different possible explanations for the large-scale outage”.  These explanations included whether their own overnight updates were the cause, whether it was a DDoS attack, a network authentication issue, or issues with their upstream content delivery network (CDN) provider.

When it was discovered that the routing information was the culprit, the resolution for the issue required engineers set themselves to work on “resetting and clearing routing connectivity on network elements which had disconnected themselves from the network, physically rebooting and reconnecting some network elements to restore connectivity”.   With this requiring all the affected appliances to be physically rebooted in person, it’s no wonder that the resolution took as much time, if not longer, to resolve as it took to diagnose.

What can you do to protect yourself from this in the future?

Whilst Optus has announced that they have “since made changes to its network to address the issue so it does not occur again”,  one of the key things the outage highlights is the need for redundancy in telecommunications.  So many businesses - especially ones that run critical services that people rely on - found themselves with no internet when the Optus outage occurred, simply because they hadn’t implemented a proper fallback.

Having fallback telecommunications usually isn’t hard to achieve, with the main hurdle being determining if the cost of implementing it is justified.  However, even for smaller organisations with tighter budgets, simple redundancy can be achieved through utilising portable internet dongles, which are able to provide 4G and 5G mobile internet connections as a backup to your main landline connection.  These 4G/5G dongles can be directly connected to your networking infrastructure to act as a rudimentary internet backup.

It’s due to issues like this that we recommend that all organisations look to implement some form of backup of their internet connection, which we’re always available to help with.  Here at Micron21 we utilise our own dark fibre; services from Telstra, Optus, and NBN; as well as peer with over 1900 networks globally - but not everyone needs to go to these lengths!

Though, redundancy and resiliency of your telecommunications is just one aspect of your IT infrastructure – it’s also very important to consider the other areas of your business or business-systems that could be improved by adding additional redundancy.  For those interested, in our How to make sure your business can survive adversity article, we go into more details about developing Business Continuity and Disaster Recovery plans to make sure you’re protected.

Have any questions about the Optus outage or looking to improve your telecommunications?

If you have any other questions about the Optus outage that we haven’t covered here, let us know!  We’re more than happy to answer any questions you have. 

In addition, if you’re looking to have someone analyse how your systems are set up, as well as come up with a plan for improving your telecommunications, we can help with that too! 

Feel free to email us at sales@micron21.com or call us on 1300 769 972 (Option #1)

Sources

1, 9 News, “Hundreds of triple zero calls failed during Optus outage, CEO reveals while dodging questions over future”, <https://www.9news.com.au/national/optus-ceo-kelly-bayer-rosmarin-to-face-senate-over-network-crash/d8786be0-7b81-420e-8b2c-3c87bd991670>

2, Micron21, “Optus, Medibank, and now Harcourts – If they can be breached, what can you do to prevent it?” <https://www.micron21.com/blog/optus-medibank-and-now-harcourts-if-they-can-be-breached-what-can-you-do-to-prevent-it>

3, IT News, “Optus outage blamed on edge router default settings”, <https://www.itnews.com.au/news/optus-outage-blamed-on-edge-router-default-settings-602442

See it for yourself.

Australia’s first Tier IV Data Centre
in Melbourne!

Speak to our Australian based team.

24 hours a day, 7 days a week
1300 769 972

Sign up for the Micron21 Newsletter