We live in a universe filled with Fragile Systems
How ChatGPT views the nature of Fragile Software Systems

We live in a universe filled with Fragile Systems

In 2011 Marc Andreessen wrote, “In short, software is eating the world.” But are the digital systems as robust as the analog systems they are replacing?

Four years ago, the day after the Republican National Convention, Southwest Airlines experienced a major software outage.

Today, the day after the Republican National Convention in 2024, Flights are delayed worldwide, people cannot access some financial institutions, and service organizations cannot deliver services due to the Azure outage impacting secured VDI workstations.   A retired colleague called me to share that his local doctor could not cancel an appointment or have his car inspected in Massachusetts due to this outage. Even local radio stations have been experiencing issues inserting bumper music, advertising, and providing guests and hosts with segment and show clocks due to an outage on a system known as Wide Orbit. 

Consider the worldwide economic impacts of these outages in terms of lost billing, excess costs related to delayed or canceled flights, and even a commission check lowered to the salesperson for advertisements at the local radio station.  The cause?  Outages related to an update to CrowdStrike on Windows systems and an outage in Microsoft’s Azure cloud.

If your organization doesn’t have a relationship with QA Consultants today, then you very likely have unmanaged software risk that is making its way into production. You can change that with an email to [email protected].

Airlines for America, a trade group for the airline industry, estimates the cost of an airline being delayed is just over $100 USD per minute. (https://www.airlines.org/dataset/u-s-passenger-carrier-delay-costs/). As of this writing, some 2,800 flights have been canceled, which is roughly double the daily today (seems odd) in the USA, and 27,000 airline flights have been delayed. It doesn’t take long for these types of delays to run into the hundreds of millions of dollars in costs due to delays. An average 20-minute delay across 27000 flights translates to $54 Million in additional costs.

To translate this into human terms, look at the Flight Aware Misery Map, which considers the number of flights delayed at each airport. https://www.flightaware.com/miserymap/

We can even look back at the Colonial Pipeline episode in 2021. Even with digital offline pipeline control, there should have been enough people in the loop to assume manual control of the pumping stations to deliver fuel to the market with minimal interruptions. Was there insufficient labor and working knowledge of running a system manually left in the company? We may never know the answer to that question -- even though we understand the frailty of the digital solution.

I have maintained for a while that we live in a universe surrounded by fragile software. As CrowdStrike has demonstrated, we are only one software patch away from disaster on a scale not possible before in human history because of our high dependence on software-defined systems. How confident are you that your development teams and quality assurance organizations are up to finding these issues before they become a part of your or your customers’ operational systems?

Let’s be blunt: For many organizations, development and quality assurance are costs financial managers seek to minimize, often accepting proposals driven by price without recognizing the unhandled risk of being transferred elsewhere. Check a box. Pay a manager a bonus based on how far below the goal rate set at the beginning of the year. Today, that risk from buying based on price transfers into production systems. Fragile software may be cheap to produce, but it is anything but cheap once it gets to production.

This cost vs quality issue is an old problem. John Ruskin (1819-1900) summed up this problem in the 1800s by coining the "Common law of business balance.”

The common law of business balance prohibits paying a little and getting a lot. It can’t be done. ... If you deal with the lowest bidder, it is well to add something for the risk you run, and if you do that, you will have enough to pay for something better.

How many software development organizations do you know that are setting aside the extra (contingency funds?) when accepting the lowest bid?

These situations are preventable. The people, the processes, and the skills that are a part of your quality assurance organization are key. Yes, it might be easier and faster for your development team to promote code to production without the pesky delays related to quality. How is that working out for you? Today that isn’t going so well for a large number of people who have to bear the costs for a developer to have an easier/cheaper existence.

I want to challenge you and your organization. The organization I work for, QA Consultants , can help your organization significantly enhance its QA practices to minimize the risks you face in production (and potentially the greater risks at stake post-production, including legal, reputational, financial, and potentially even risk of human life), not amplify them. Email me at [email protected]. Let’s discuss how we can transform your production risks into success stories, so you don’t become the next headline in your industry.

Kranthi Paidi

Principal Cloud Engineer @ ResMed | Performance & Resilience | Cloud Infrastructure, AWS Certified, Datadog

1mo

The article doesn’t talk about how MSFT and Crowdstrike could have prevented the problem without the third party. If the last patch had never caused an issue, the lack of testing in “some cases” would have gone unnoticed. I say some cases because I do not even think that MSFT and Crowdstrike don’t have a QA team with them. If they did not have a QA, this latest problem wouldn’t be the first catastrophic one. Catastrophic bugs like these happen every day in every firm. The blast radius of those bugs is minimal in most cases and far-reaching in cases like MSFT. The question we should all be asking is how do we test what we test and why don't we test what we don't test. By the way, the correlation between RNC and the catastrophic events is nice! 😊

Like
Reply
Seth Eliot

AWS Cloud expert and Resiliency Specialist | ex-AWS | ex-Amazon | ex-Microsoft | AWS Certified Solutions Architect – Professional

2mo

Well, my intolerance of AI imagery is undiminished.... if that helps 😂🤣😂

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics