At approx­i­mate­ly 1:58pm PST last Thurs­day the two edge ASA 5510 units at our cor­po­rate head­quar­ters dropped off the net­work.  At the time I was in a dif­fer­ent office up in Que­bec, Cana­da and so del­e­gat­ed to one of the oth­er engi­neers to work the prob­lem with TAC and bring them back online.  That process took much longer than expect­ed, and I won’t bore you with the details.  What I will bore you with, how­ev­er, are a few obser­va­tions I have now that we have more time and expe­ri­ence work­ing with Cis­co’s ASA prod­uct line:

  • The ASA has some sort of sys­temic, though exceed­ing­ly rare, prob­lem on 8.3(x) and new­er code.
  • Said prob­lem caus­es the units to reboot and take out the sys­tem flash (disk0:) but not user flash (disk1:).
  • The flash appears to be erased, but it is in fact the MBR that is gone, not the data (we used a hard­ware foren­sic disk analy­sis unit to ver­i­fy this).
  • Cis­co does­n’t have enough data points yet to even acknowl­edge this is an issue.  I don’t believe they’re “hid­ing” a prob­lem; I just don’t think enough peo­ple have expe­ri­enced the par­tic­u­lar set of cir­cum­stances that would cause this and sub­se­quent­ly report­ed back to Cis­co.

My own sus­pi­cions about the root cause are below, though I’d wel­come any addi­tion­al thoughts from any­one with expe­ri­ences in this area.  I should also point out that I have heard from at least two oth­er peo­ple that they have expe­ri­enced this exact prob­lem.

  • The behav­ior and crash lead me to believe that the ASA expe­ri­ences, at the point of fail­ure, the equiv­a­lent of a Win­dows “BSOD”.  This would point to either mem­o­ry or moth­er­board itself as these are the pri­ma­ry hard­ware-based caus­es of this type of crash in any sys­tem.  Most oth­er crash­es can be recov­ered from and pro­duce data.
  • The ASA access­es the flash on ini­tial load, but then runs from mem­o­ry.  The flash cards in these units had trashed MBRs which leads me to believe that the ASA was touch­ing the MBR at the time of the crash, which is incon­sis­tent with what I know about how the ASA is sup­posed to oper­ate.  It’s pos­si­ble it was just access­ing the flash to write a crash-dump and crashed part­way through.  That makes some sense to me.
  • All fail­ures I have expe­ri­enced and heard of from oth­ers have at least a cou­ple of things in com­mon:  They are all on 8.3(x) code.  They are all post user-upgrad­ed to sup­port 8.3(x).  This code required a mem­o­ry and flash upgrade, and so you had to buy upgrades from Cis­co and field-install them your­self.  These units were also all man­u­fac­tured imme­di­ate­ly fol­low­ing the Cis­co man­u­fac­tur­ing slow­down in 2008/2009 when lead times were run­ning into the sev­er­al months range.  This makes me a bit sus­pi­cious that qual­i­ty con­trol on either the mem­o­ry or the units them­selves could be to blame.  I’ve tried to ver­i­fy with revi­sion num­bers, etc., but I haven’t been able to gath­er enough data from “out there” to set­tle on this as a cause.

I hope this helps some­one out there, and I tru­ly am inter­est­ed in get­ting more infor­ma­tion from any­one that has it.  Cis­co is tak­ing our units back, but pulling them aside before refur­bish­ment so that their engi­neers can dis­sect the units.  If I find any­thing out from that I’ll post the find­ings here.

The con­fig­u­ra­tion and build-out of the ASA 5510 units is as fol­lows:

  • 1 Giga­byte of mem­o­ry, 512MB of sys­tem flash, 256MB of user flash.  IPS Mod­ule, Secu­ri­ty-Plus, Bot­net fil­ter, Any­Con­nect Essen­tials, Mobile, etc. licens­es.  Actu­al­ly, just about every license is on board; these units are at this point maxed on every­thing.  Uti­liza­tion is at a rea­son­able lev­el still.
  • Con­fig­u­ra­tion includes use of mul­ti­ple IPsec site-to-site VPNs, SSL VPN for all Mac, Lin­ux, Win­dows, iPad and iPhone, sub-inter­faces, state­ful failover, both IPv4 and IPv6, OSPF with sta­t­ic redis­tri­b­u­tion, and full IPS func­tion­al­i­ty.