Business continuity, disaster recovery, and operations management are three common phrases used when talking about redundancy. This often encompasses the more ubiquitous IT infrastructures such as servers, switches, routers, racks, cabling, redundant site locations and so on. However, we can sometimes undervalue the importance of what backs this infrastructure, namely the on-site physical data centre infrastructures, namely electrical and mechanical systems. Regardless of whether you have or are considering an on-site, cloud, or colocation data centre, it is paramount to assess and understand your required level of redundancy to maintain availability and functionality requirements.
Consideration 1 – Single Point of Failure
One of the most relatable considerations when talking about redundancy is single points of failure. While every effort should be made to minimise single points of failure, sometimes they cannot be eliminated. Where you can’t mitigate this, you need to be aware of where your single points of failure exist and ensure that the required maintenance and monitoring are in place to minimise the risk of failures. For example, all your systems and infrastructure could be fully redundant, but if your power supply is going through a single STS, then you should validate that the STS is regularly and easily maintained.
Consideration 2 – Maintenance
Following from the above, there should be considerations for maintenance of infrastructure and equipment. Inevitably, equipment does fail, regardless of measures put in place to minimise this. Ultimately, every piece of equipment needs to be replaced once it reaches end of life. To maximise the uptime of equipment, it needs to be suitably maintained. This starts at the design phase to ensure that these facilities exist, and continues at the operational level to ensure that the appropriate procedures are in place. For example, ensuring that an external wraparound bypass is in place allows for the entire UPS to be shut down for maintenance without removing power from the IT load. The facilities team should also be aware of the operational procedures to ensure seamless operation.
Consideration 3 – Mechanical and Electrical Dependencies
If you have a redundant mechanical system (chillers, CRACs, CRAHs, cooling towers etc) but fail to consider the power that is supplying it, it may cause a major disruption or outage at your data centre. If the power goes out, the mechanical system will go down. Interdependencies between systems should be considered at all stages.
Consideration 4 – IT Load and Cooling During an Electrical Failure
In the event of an electrical failure, generators tend to take between 1-2 minutes to kick off. This is where UPSs come in – to provide power to the IT load (servers, routers etc) for that 2-minute gap. But what happens if that generator doesn’t start? The UPS will continue to support the IT load for its period of battery autonomy, but how long should that be? You may think the initial answer is to increase the autonomy time on the UPS, but what if your cooling redundancy is provided by non-functioning generators? There is a limited amount of time that an IT load can continue to run without cooling before it reaches the stage that it enters thermal overload. If your cooling cannot handle the IT load in this scenario, then increased battery autonomy may be wasted, your IT load will have not only failed but may also have caused irreparable damage.
Consideration 5 – Monitoring
If there is no load management system or facility in place to monitor IT loads (such as server and rack distribution, heat load and power consumption), you may not be aware of potential overload scenarios in a system failure. Although your data centre may operate without issues in a business as usual situation, it may not maintain the level of redundancy it was designed for during an unexpected event.
The above only very briefly touches on data centre infrastructure and the complicated nature of its electrical and mechanical components. Every data centre and its redundancy and business requirements are different, and it is important to understand all aspects when designing a redundancy strategy.
About the Author
Emily Sclater is a Consultant in the Australian practice of CS Technology where she specialises in application migration strategy including the disaster recovery requirements of applications and how these align into an organisation’s business continuity.
This article was originally posted on LinkedIn here on the 22nd October 2020.