Killer software: 4 lessons from the deadly 737 MAX crashes

It’s been widely reported that Boeing’s decision to use a flight control software fix known as MCAS in its 737 MAX planes was one of the key factors that led to two crashes that killed 346 people.

But it is as much the cultural and organizational issues associated with the design of complex systems that led to the tragedies, according to Gregory Travis, a veteran instrument-rated pilot and career software engineer.

“The MCAS system was just unbelievably deficient, but it was the culture at Boeing that allowed this to happen…,” he said in a recent interview with FierceElectronics.

Travis has written and spoken extensively on the topic and the lessons to be learned from it for an audience of engineers. (He is scheduled to deliver a keynote address on the topic at Sensors Expo in San Jose on June 23.)

His mission? To make sure that this doesn’t happen ever again, a potentially tall order he added, given that software is becoming a larger component of nearly all systems, from the most basic consumer products to safety-critical systems that control autonomous vehicles.

The Boeing tragedies, Travis believes, can serve as powerful example of how things can go wrong—beginning with what he says was the company’s software fix for a basic design snafu.

Early concern over MCAS software

After the deadly crashes on Oct. 29, 2018 and March 10, 2019, agencies in the U.S., Indonesia and Ethiopia opened inquiries into the cause, with suspicion falling quickly onto the Boeing flight control system known as MCAS (Maneuvering Characteristics Augmentation System).

Last October, Indonesian officials cited nine factors in the Lion Air crash of 2018. They largely blamed MCAS, noting it was “not a failsafe design and did not include redundancy.” 

According to the Indonesian findings, MCAS relied on only one sensor, which had a fault, and the flight crews hadn’t been well trained in how to use it. Also they said: There was also no cockpit warning light; the Lion Air pilots were unable to determine their true airspeed and altitude as the plane oscillated for nearly 10 minutes; and every time the pilots pulled up from a dive, MCAS pushed the nose down again, horribly. 

Ethiopian investigators are also focusing on MCAS.

Travis believes that MCAS was essentially Boeing’s software fix for an airframe design snafu, the details of which he outlined in an IEEE Spectrum article  as well as a more recent FAQ.

Boeing’s 737 models going back for decades sit close to the ground, but years later the company decided that in order to stay competitive with Airbus planes, larger engines were needed, Travis said. Instead of increasing the length of the landing gear to give the engines clearance or making other alterations, Boeing moved the engines further forward, well in front of the wing.  In turn, Boeing found that when pilots applied power to the engine, the aircraft would pitch up, or raise its nose, Travis noted.

“Instead of going back to the drawing board and getting the airframe hardware right, Boeing relied on MCAS,” Travis said. “Boeing’s solution to its hardware problem was software.” A hardware fix would have been costly, delaying delivery of new planes to customers and potentially incurring certification delays lasting more than a year.

Reliance on a single sensor

Travis said that perhaps the biggest failing of MCAS was that it relied on only one angle-of-attack sensor located on either side of the plane, not both. “Those sensors fail all the time when they get hit by a bird or freeze, and engineers decided to use only one of them, which is mind-boggling,” he said.

He noted that in the Ethiopian Airlines crash in March 2019, a faulty angle of attack sensor went from 12 degrees to 70 degrees in less than a second, but MCAS trusted that reading in making a pitch adjustment instead of comparing the reading with the other angle of attack sensor on the other side of the aircraft.

“The MCAS software didn’t have any basic sanity checks to confirm the data was bad,” Travis said. What’s even more astounding is that Boeing is still trying to fix MCAS while all of its 737 MAX planes are grounded around the globe, he added. That includes 737 MAX planes used by American and Southwest in the U.S. “I don’t understand why Boeing is hell bent on fixing MCAS as opposed to retreating and taking another tack.”

Boeing's response

For its part, Boeing has begun taking steps to return the 737 MAX to service safely.

Boeing has made major management changes since the crashes, including installing a new CEO, David Calhoun, in December 2019 and naming a new CIO, Susan Doniz, in February. Notably, she serves as the vice chair of the Digital Transformation Advisory Council of the International Air Transport Association.   

The company has also created separate chairman and CEO roles for an increased focus on safety.

“First and foremost, our primary focus continues to be returning the 737 MAX to service safely,” Calhoun said on a recent earnings call. He said the Federal Aviation Administration and global regulators will determine the timeline for certification and return to service “and we remain fully committed to supporting this process… We’ll get it done, and we’ll get it done the right way.”

With regards to MCAS, Boeing is testing an updated MCAS in-flight and had conducted it in 1,157 flights, totaling 2,175 hours as of Feb. 24, a Boeing spokesman told FierceElectronics. He said additional layers of protection have been added, including that MCAS now compares data from both angle of attack sensors before activating and will only respond if data from both sensors agree. MCAS also only will activate a single time and never provide more input than the pilot can counteract using the control column alone.

Boeing also has posted on its website a discussion of how it has handled its own investigations into the MAX’s angle of attack indicator and its AOA Disagree alert, which notes its AOA Disagree alert will be implemented as a standard, standalone feature before the MAX plane returns to service. AOA Disagree is a software-based information feature designed to alert flight crews when data from the left and right angle of attack sensors disagree, but it had not been standard originally.

Yes, Boeing has a new CEO and CIO, but the future of the 737 MAX and even the future of Boeing remain in question.

Lessons (hopefully) learned

No matter how the Boeing story plays out, Travis says that there are lessons from the case of the 737 MAX that can help other organizations designing complex systems avoid making similar mistakes in the future:

  1. Keep software and systems in complex machines as simple as possible, but not too simple.  Engineers call this the Goldilocks approach, when things are “just right.”
  2. Don’t impose software on an intractable hardware problem. MCAS didn’t fundamentally change the way the 737 MAX would fly. Keeping the engines further back on the wing would have.
  3. Remember redundancy. Do not rely on data readings from just one angle of attack sensor or any single data input.
  4. Communicate with empathy. Boeing executives and engineers seem to have failed on that point.

On the lack of empathy, the lesson is a particularly important one, Travis said.

“What has gripped me the whole time with the Boeing experience is how was it possible that something so manifestly deadly ever saw the light of day?” he said. “What happened to allow that to happen and not point out just how deadly the system was? Anybody who knew anything about planes would have known that relying on data from a single sensor was taking a big risk. There was some kind of huge breakdown in the way communications occurred,” he said.

 As a FierceElectronics reader, you are eligible to use the following promotion code to receive $100 off a conference pass or for free entry to the Sensors Expo Hall: FE100

Note that Super Early Bird Rates for the Sensors Expo and Conference are in effect until April 17, 2020. Passes are limited – so act fast! Register here.