Many people know that our corporate HQ was an interesting place, network wise. Like so many great hideouts, (for example: the Bat-Cave, the Millenium Falcon, DS9, Warehouse 13 or the TARDIS), our secret lair had a history and a will of its own and a sense of Endless Wonder. And interesting failure modes to match.
I joked that the building would get lonely and throw temper tantrums when I got too far away. There was every kind of AC failure you could ever imagine — in ways that you could never monitor for. Wires would arc and drip slag on the floor when you stared at them wrong. There was a UPS for the server rooms that dated back to the Reagan administration. There was a japanese lady in the back with a metal shop that could make swords in the event of a zombie apocalpyse.
In particular, our network stack was interesting.
It was couple of cisco GSR’s (routers the size of dishwashers, but from a bygone era) which had fallen out of a lab at Cisco. I believe they ran on Magic and Willpower, because it sure wasn’t “supported code”. 40G of dark fiber via CWDM back to the data center. And at the center of it, a 1U Procurve switch, to which every other switch connected.
This was all before we moved out, of course.
Now, for those who know me well, know that I had abdominal surgery in early 2016. This little adventure occured a week after recovery from that, when I still had a wound-vac attached to me. (About the size of a lunchbox, a device that uses negative pressure to close deep wounds up. They truly work miracles.)
I got a call that one of our customers in the back room couldn’t ping their kit. We drive over, and the switch is up, but not pinging. I reboot it, but still can’t reach it. From my phone, I try to traceroute, and realize I can’t reach the core switch, or any other switches.
In our main room, I find the core switch (which every other switch links up to) with every LED powered on. This is a failure mode known as “xmas tree mode”. I reboot it. It still xmas-trees.
It is the only switch we have onsite that can terminate the 10G link coming out of the router and hand it to the 1G uplink that the other switches have. The dead core switch is at the top of a tight run of racks where the only way to replace everything going to it requires removing it and putting another in its place. And copying the config on to it. Which in turn requires the network to be up. Or a serial cable.
Thus, to get the building back up, I also need to log in to the GSR and switch all our vlans (50 or so of them) over to doing 1G links.
I am not at this time fluent in the nuances of IOS-XR.
I call our network admin, who works in the midwest, and explain the issue. I explain that he will not be able to log in, but should email me the commands I will need to run. There are a lot of commands, because I need to basically replay the configuration for most of the routers interfaces into a different block.
The router cannot reach its tacplus server (which is in the building), thus refuses to let me log in. Which means just to get into it, I need to reboot it and catch it at the right point. The router’s serial ports are via a console server that can only be reached via plugging in to it with a keyboard and mouse. (I can’t ssh to it with the network down). My only internet is an ipad with an LTE connection. It does not have a serial cable on it. I have no copy and paste.
And remember, I am carrying a Wound Vac. If that thing detaches or shuts down for more than a half hour, I need to go to the hospital, because the foam that’s packed into me will no longer have suction, and start re-opening the wound.
This is the part of the story we call foreshadowing.
So…I start loosening the rack screws, and slowly pulling the switch out, with another at the ready, in the hands of my assistant.
In order to fully get the switch clear of the rack and the other cables, I needed to tilt it. And that’s when about 8 ounces of water poured out. So now, I was standing at the top of the rack with all our most precious network equipment, with my sleeve barely holding the water that I don’t want to hit any of the fiddly bits.
Apparently, water had found a leak in the ceiling, and rolled a natural 20 on the worst possible place to go. It ran along conduits and ceiling tiles, and rode down an antenna cable for our NTP GPS antenna right into the rack, where it found a home.
“Of course.” I muttered, but the only solution really was to get everything moved over. I kept going.
Then, the alarm on my wound-vac went off. “Canister full. Replace immediately.”
Do I need to tell you, gentle reader, where that replacement canister was? Here’s a hint: I didn’t have one with me. I had a houseguest that could run one over, but in the mean time, the wound vac would continue beeping and blaring every three minutes before it would shut itself down. Which, again, would be bad.
I had a junior admin with me. They wanted to know how to help. “Order pizza. It’s going to be a long night.”
I mean, like every great sci-fi movie (The Martian, Apollo 13, Sully, Hackers, Sneakers), things went dark for a while, and we worked the problem, and fixed things, to a point, and then we fixed things even more. This blog is not about detailing every single technical issue and how to solve it. It is, instead, to tell a fanciful tale of “what else could go wrong?”.
The short version was: this was work that could only be done by one person, at one keyboard, and it took hours. Every sysadmin I know has had nights like these. Maybe not all of them while recovering from surgery. Sitting in a loud server room with a ceiling you just discovered a leak in.
That’s not quite the end of the story, though. At the end of it all, I decided I was in the mood for a cup of hot chocolate. And like most silicon valley office kitchens, we had an espresso machine.
I fill a milk pitcher, and turn on the steam, wound-vac still happily doing its job again. About four minutes into the froth cycle, as the milk is just starting to form a good head of foam, I hear a bang like a gunshot, and the espresso machine blows a seal, showering me in hot foam.
I lost it at that point. I was yelling, I was pissed. I was covered in chocolated milk. (I wasn’t hurt, thankfully). I just did a snoopy-cry and headed back to my office and flopped on the couch. “I’m ready to leave now.”
The day’s lesson turned out to be this one: