Sometimes its easy to get wrapped up in the more complicated things and forget about the basics. Something as simple as power to your equipment being fully resilient can be forgotten but if it fails you can find yourself dealing with an outage that you don’t really want to have had to deal with.
Balancing the load across a redundant pair of PDUs (power distribution units) within a cabinet is one of these things that is easy to forget.
Typically a server rack/cabinet will have two independent power feeds, we’ll call these A and B. For your equipment to be properly resilient in a minimum of a N+1 arrangement, you need one PSU of the server/host plugged into the A side PDU, and the other PSU plugged into the B side PDU that way a loss of a PSU or PDU doesn’t affect the operation of the server/host or equipment.
Now, also very important is the power load on each feed, let’s say you have two 16 Amp supplies, so 16 Amps to the A side PDU and 16 Amps to the B side PDU. Its important to remember you don’t have 32 Amps you can use, its at most 16Amps, and in reality you’ll not want to be drawing much over 14 Amps to be fully safe.
Now i’m using Amps because typically a PDU will show you the current (in Amps), these are what the breakers use to determine an overload.
But back to the explanation, and using two examples:
10 Servers, 10 Amps
Let’s say you have 10 servers in the rack each is drawing around 1 Amp each in normal operation.
So you have an overall load of the rack at 10 x 1 Amp, so 10 Amps. You have two 16 Amp PDUs and all the equipment is connected in a redundant fashion. Depending on how the PSUs in the servers work, they might be working in an Active/Passive configuration where by they are pulling the full 1 Amp from just one PSU, i.e. one PDU while the other side is using no load. Or they may be working an an Active/Active configuration whereby both PSUs are drawing a load so about 0.5 Amps each.
Now let’s say that the power feed to the A side PDU fails, in this case the B side feed must be able to provide the full 10 Amps on its own to ensure the equipment keeps working. In this example we’re fine the 10 Amps draw is below the maximum 16 Amps, in normal operation the servers are probably pulling about 5 Amps a PDU.
10 Big Servers, 20 Amps
So to illustrate how things can go wrong consider this. We have 10 servers in the rack, but this time each draws 2 Amps each in normal operation.
So if they are pulling around 10 Amps per PDU, then the A side is pulling about 10 Amps, and the B side is pulling about 10 Amps, all is good right? We’ll yes and no. Yes everything is fine, assuming both PDUs are supplied with power, the load is spread across both PDUs.
Now what happens when say the A side PDU fails? Well suddenly that 20 Amps will need to be supplied only by the B side PDU, but that is limited to 16 Amps, so what happens?
Well, the 16 Amp breaker will trip and all the servers will lose power, not a great situation to be in.
So how can you get around this problem?
There’s a few ways, one is to ensure that you calculate the best and worst case power consumption of your equipment and ensure your power infrastructure can take it.
The other way (and its not mutually exclusive) is to monitor the load on the PDUs, and put in thresholds and alerts to ensure that if power load enters a range where you might be at risk if a PDU supply was to fail you can take action.
An example script can be found here: https://exchange.nagios.org/directory/Plugins/Hardware/UPS/APC/APC-PDU-Pair-Load-Monitoring/details