Failure Mode Testing is the testing of a component or a system to see how it behaves under failure conditions. This testing entails disconnecting our production network from the outside world to verify that our DR network will automatically take over in that event (i.e. similar to the problem encountered on 1 Feb, to ensure that that specific outage cannot reoccur).
As one of the final steps following from our network outage on 1 Feb to ensure it never happens again, we will be undertaking failure mode testing on the night of the 7th-8th of March.
One key exacerbating factor for last month's outage was an inability to activate our DR (Disaster Recovery) network in a timely fashion. This is effectively a "mirror" of the production network that we keep on warm standby at all times, that can take over from the production network in the event of a major outage. The ISP that the DR network was connected through was unable to activate the network to receive traffic as a stand-in for the production network. We have since connected the network to a new provider who is able to handle this.
We anticipate two outages of between 90 seconds and 5 minutes during the scheduled time period which has been chosen as it's well outside business hours in every market our clients operate in.
We expect to run this test every 3 months to ensure any drift in the capabilities of our upstream providers, or issues with hardware, software, or configuration, are caught well before they become an unplanned outage.
Mar 2, 15:24 AEDT