<Span Class="caps">TELCO</Span> <Span Class="caps">FAIL</Span>

Read­ing Time: 5 min­utes

Last Thurs­day after­noon, at approx­i­mate­ly 2:25pm, there was a loud suck­ing sound that can only be heard by net­work engi­neers con­di­tioned to expect bad, ugly things to hap­pen at inop­por­tune times, and all upstream con­nec­tiv­i­ty to our cor­po­rate office died.

*Ka-phoot*

Pre­dictably, IT was imme­di­ate­ly assist­ed by many, many help­ful peo­ple wan­der­ing by our area, send­ing emails, mak­ing phone calls, or stop­ping us in the hall to ask if we knew that the net­work was down.  Usu­al­ly in these sit­u­a­tions the first cou­ple of peo­ple get a good expla­na­tion of what we think the prob­lem is, and what an ETA might be.  After the 10th per­son, how­ev­er, my respons­es tend to devolve a bit and I either end up giv­ing a curt one-word answer, or feign­ing shock and amaze­ment.

I should explain here that the way the archi­tec­ture of our net­work works, we have our IP provider, SIP Trunks, Point-to-Point cir­cuits, VPN end-points, and all of our exter­nal-fac­ing servers in a very robust tele­com hotel–The West­in Build­ing, for those keep­ing score–in down­town Seat­tle.  From there, we move every­thing over our DS3 to our cor­po­rate head­quar­ters not far from Seat­tle.  We also have many oth­er ded­i­cat­ed cir­cuits, IPsec tun­nels, and assort­ed bal­ly­hoo to oth­er loca­tions around the world, but for dis­cus­sion here just keep in mind the three loca­tions I’ve described.

So the DS3 that is our life­line was down.  It was after hours in our Cana­di­an loca­tion so with any luck nobody would notice all night–they use a lot of crit­i­cal ser­vices across anoth­er DS3, but that also routes through Seat­tle first.  Addi­tion­al­ly, it was a par­tic­u­lar­ly nice day in Seat­tle (rare) and a lot of peo­ple were already out of the office when this link went down.  Hope­ful­ly we could file a trou­ble tick­et and get this resolved fair­ly quick­ly.

With­in just a few min­utes of fil­ing said trou­ble tick­et, I had a rep­re­sen­ta­tive of the pro­vi­sion­ing tele­com on the line who said that, yes, they saw a prob­lem and would be dis­patch­ing tech­ni­cians.  There were some oth­er calls fol­low­ing that, but  the short ver­sion is that by 5:30pm “every­thing was fixed” accord­ing to the tele­com and would we please ver­i­fy so they could close the tick­et.  Unfor­tu­nate­ly, the prob­lem was not fixed.

Now the fun began.  To appease the tele­com rep­re­sen­ta­tive, I accept­ed the pos­si­bil­i­ty that my DS3 con­troller card had coin­ci­den­tal­ly died or locked the cir­cuit or some oth­er bunch of weird pseu­do-engi­neer guess­ing from the tele­com rep­re­sen­ta­tive.  This meant I had to dri­ve to our data cen­ter in Seat­tle, in rush hour traf­fic, to per­son­al­ly kick the offend­ing router in the teeth.

After an hour or so of typ­i­cal­ly nasty Seat­tle rush-hour traf­fic I arrived at the dat­a­cen­ter and began test­ing.  Our DS3 con­troller was show­ing AIP on the line, so more tech­ni­cians were dis­patched to find the offend­ing prob­lem.  Mean­while, I wan­dered over to the Icon Grill to get some din­ner and an a__près-ski bev­er­age or two.

Fast for­ward a few hours and the AIP con­di­tion on the DS3 con­troller was gone, but I now had an inter­face sta­tus of “up up (looped)” which is less than ide­al, shall we say.  I decid­ed at this point to cut my loss­es and head home and pos­si­bly get some sleep while the tele­com engi­neers and their cohort tried to fig­ure out how this might be my fault.

With some three hours of sleep or so, I woke up at 5am and start­ed look­ing at all of my emails, lis­ten­ing to all of my voice­mails, and gen­er­al­ly curs­ing any­one with­in earshot–mostly con­sist­ing of the cats–as my wife was still asleep.  At this point I got on a con­fer­ence bridge with the Pres­i­dent of the tele­com bro­ker we use and togeth­er we man­aged to drag a rep in from the pro­vi­sion­ing com­pa­ny who could then drag in as many engi­neers as need­ed to get the prob­lem solved.  Not, how­ev­er, before I was rather point­ed­ly told by said pro­vi­sion­ing woman that I would have to pay for all of this cost since the prob­lem was “obvi­ous­ly with my equip­ment, since her soft­ware showed no loops in the cir­cuit.”

Once the engi­neers start­ed hook­ing up testers to the circuit–physically this time–they could see a loop, but at the Seat­tle side (the side report­ing the loop.)  Anoth­er engi­neer saw a loop on the head­quar­ters side, and still a third saw no loop at all.  As it turns out, the cir­cuit was pro­vi­sioned by com­pa­ny “A” who then hand­ed off to com­pa­ny “B” and final­ly to com­pa­ny “C” who ter­mi­nat­ed the cir­cuit at the demar­ca­tion point at our head­quar­ters.  All for less than 20 miles, but I digress.  Final­ly we all agreed to have Com­pa­ny “C” come onsite, inter­rupt the cir­cuit phys­i­cal­ly at the demar­ca­tion equip­ment and look back down the link to see what he could see.  As a pre­cau­tion at this point, and tired of being blamed for ridicu­lous things, I and my staff phys­i­cal­ly pow­ered down our routers on either side of the link.  Since the loop stayed, that was the last time I had any­one point the fin­ger my way.  Small mir­a­cles and all of that.

Once the rep from Com­pa­ny “C” got onsite and inter­rupt­ed the cir­cuit for tests, he was still see­ing “all green” down the line.  Since the oth­er engi­neers mon­i­tor­ing were still see­ing a loop, they asked him to shut down the cir­cuit.  He did, and they still saw a loop.  This was one of those “Aha” moments for all of us except the engi­neer from Com­pa­ny “C” who just could­n’t fig­ure out what the prob­lem might be.  All of us sus­pect­ed that the loop was between the Fujit­su OC‑3 box at our Demarc, and the upstream OC-48 Fujit­su Mux a cou­ple of miles away and we final­ly con­vinced this guy to go check out the OC-48.  Sure enough, a few min­utes after he left our cir­cuit came back on again.  And we all rejoiced, and ate Robin’s Min­strels.

At the end of the day, we end­ed up with just short of 24 hours of down­time,  for a DS‑3 from a major tele­com provider that every­one here would rec­og­nize; 23 hours and 5 min­utes, to be exact.  So what was the prob­lem, and the solu­tion?  Any tele­com guys want to stop read­ing here and take a guess?

As it turns out, the orig­i­nal cause of our link going down was this same engi­neer pulling the cir­cuit by mis­take.  When the trou­ble tick­et was orig­i­nal­ly filed, he rushed out and “fixed” his mis­take.  But, what he had­n’t noticed the first time is two crit­i­cal things:

(1)    The cir­cuit had failed over to the pro­tect pair.  DS3 cir­cuits use one pair of fiber for the nor­mal­ly used (or work­ing) cir­cuit, and a sep­a­rate fiber pair for the fail-over (or pro­tect) cir­cuit.

(2)    The pro­tect pair at the OC‑3 box at the demar­ca­tion point had­n’t ever been installed.

For lessons learned here, the main thing that comes to me is that we absolute­ly have to find a way to get true redun­dan­cy on this link, even if it means con­nect­ing our own strings between tin-cans.  I should explain, by the way, that redun­dan­cy to this head­quar­ters build­ing is very dif­fi­cult due to loca­tion: the last mile provider is the same no mat­ter who we use to pro­vi­sion the cir­cuit.  In addi­tion, with one major fiber loop in the area, even if we could get redun­dan­cy on the last mile we would still be at the mer­cy of that loop.  We are at this point, after this inci­dent, look­ing at a fixed LoS wire­less option that has recent­ly become avail­able.  Appar­ent­ly we can get at least 20Mb/s although I haven’t heard any claims on the laten­cy, so we’ll see.

I’m also shocked and appalled that three major tele­coms, all work­ing in con­cert, took almost a full day to run this prob­lem to ground.  I’m prob­a­bly naive, but I expect more.  The only sav­ing grace in all of this is the lev­el of pro­fes­sion­al­ism and sup­port I received from the tele­com bro­kers we use.  They were absolute­ly on top of this from the begin­ning, shep­herd­ed the whole process along, even facil­i­tat­ing com­mu­ni­ca­tions between the play­ers with their own con­fer­ence bridge for the bet­ter part of a day.  If any­one needs any tele­com ser­vices bro­kered, any­where in the world I’m told, con­tact Rick Crabbe at Thresh­old Com­mu­ni­ca­tions.

With this sum­ma­tion done, my vent­ing com­plete, and every­thing right with the world, I’m off for a bev­er­age.