Behind the Scenes: Bugs and Networking Equipment

June 8, 2011 — netequalizer

If you relied only on conspiracy theories to explain the origin of software bugs, they would likely leave little trust in the vendors and manufacturers providing your technology. In general, the more skeptical theories chalk software bugs up to a few nefarious, and easily preventable, causes:

Corporate greed and the failure to effectively allocate resources
Poor engineering
Companies deliberately withholding fixes in an effort to sell upgrades and future support

Although I’ve certainly seen evidence of these policies many times over my 25-year career, the following case studies are more typical for understanding how a bug actually gets into a software release. It’s not necessarily the conspiracy it might initially seem.

My most memorable system failure took place back in the early 1990s. I was the system engineer responsible for the underlying UNIX operating system and Redundant Disk Drives (RAID) on the Audix Voice Messaging system. This was before the days of widespread e-mail use. I worked for AT&T Bell Labs at the time, and AT&T had a reputation of both high price and high reliability. Our customers, almost all Fortune 500 companies, used their voice mail extensively to catalog and archive voice messages. Customers such as John Hancock paid a premium for redundancy on their voice message storage. If there were any field-related problems, the buck stopped in my engineering lab.

For testing purposes, I had several racks of Audix (trade mark) systems and simulators combined with various stacks of disk drives in RAID configurations. We ran these systems for hours, constantly recording voice messages. To stress the RAID storage, we would periodically pull the power on a running disk drive. We would also smash them with a hammer while running. Despite the deliberate destruction of running disk drives, in every test scenario the RAID system worked flawlessly. We never lost a voice mail message in our laboratory.

However, about six months after a major release, I got a call from our support team. John Hancock had a system failure and lost every last one of their corporate voice mails. (AT&T had advised backing data up to tape, but John Hancock had decided not to utilize that facility because of their RAID investment. Remember, this was in the 1990s and does not reflect John Hancock current policies.)

The root cause analysis took several weeks of work with the RAID vendor, myself and some of the key UNIX developers sequestered in a lab in Santa Clara, California. After numerous brainstorm sessions, we were able to re-create the problem. It seemed the John Hancock disk drive had suffered what’s called a parity error.

A parity error can develop if a problem occurs when reading and writing data to the drive. When the problem emerges, the drives try to recover, but in the meantime the redundant drives read and write the bad data. As the attempts at auto recovery within the disk drive go on (sometimes for several minutes), all of the redundant drives have their copies of the data damaged beyond repair. In the case of John Hancock, when the system finally locked up, the voice message indices were useless. Unfortunately, very little could have been done on the vendor or manufacturing end to prevent this.

More recently, when APconnections released a new version of our NetEqualizer, despite extensive testing over a period of months including a new simulation lab, we had to release a patch to clean up some lingering problems with VLAN tags. It turned out the problem was with a bug in the Linux kernel, a kernel that normally gets better with time.

So what happened? Why did we not find this VLAN tag bug before the release? Well, first off, the VLAN tagging facility in the kernel had been stable for years. (The Linux kernel had been released as stable by Kernel.org.) We also had a reliable regression test for new releases that made sure it was not broken. However, our regression test only simulated the actual tag passing through the kernel. This made it much easier to test, and considering our bandwidth shaper software only affected the packets after the tag was in place, there was no logical reason to test a stable feature of the Linux kernel. To retest stable kernel features would not have been economically viable considering these circumstances.

This logic is common during pre-market testing. Rather than test everything, vendors use a regression test for stable components of their system and only rigorously test new features. A regression test is a subset of scenarios and is the only practical way to make sure features unrelated to those being changed do not break when a new release comes out. Think of it this way: Does your mechanic do a crash test when replacing the car battery to see if the airbags still deploy? This analogy may seem silly, but as a product developer, you must be pragmatic about what you test. There are almost infinite variations on a mature product and to retest all of them is not possible.

Therefore, in reality, most developers want nothing more than to release a flawless product. Yet, despite a developer’s best intentions, not every stone can be turned during pre-market testing. This, however, shouldn’t deter a developer from striving for perfection — both before a release as well as when the occasional bugs appear in the field.

Posted in Commentary and Editorials. Tags: audix, bandwidth control, RAID. Leave a Comment »

NetEqualizer News Blog

Follow NetEq Blog

Categories

Keyword Search

Pages

Recent Posts