System Availability Is About More Than Just Operations

System Availability Is About More Than Just Operations

  • Mateusz Myszka

We know that application and infrastructure incidents can generate losses, but apart from the operational and financial side, some other dimensions are usually ignored: reputational, organizational, industry supply-chain. Losses on these ‘hidden’ areas can be even more troubling for many organizations.

In modern days, our work is marked by increasing reliance on cloud infrastructure and growing complexity on infrastructure and application architectures. All of that places more pressure on DevOps, SRE, and development teams.

These trends have many facets that go well beyond operations and it is becoming clear that reality won’t let us continue to do things the old ways. The speed of change, fierce market competition, and rising costs require more technology into operational practices and higher productivity for developers, SREs, and DevOps teams.

In such a scenario, it would be hard to argue against using more automation in operations. Let’s explore in this article why this is the case and what can be done.

Reliability is getting harder

Recent studies show that the hourly losses during downtime grow 6.5% per year, on average, as reliance on cloud systems is constantly growing.

Preventing losses from downtime is also becoming more difficult than in the past. The high level of interdependence between IT ecosystems, widespread usage of third-party APIs, and distributed architectures are more ubiquitous now than in the past.

Complex technologies and infrastructures, once only accessible to very large organizations, are now affordable to every one of us, regardless of the size of our organization. Although small and mid-sized businesses experience outage losses on a smaller scale, the complexity of incidents and remediation needed to keep Mean Time To Recovery (MTTR) down is closer to that experienced by large corporations than ever before.

Business and financial losses is often cited as an example of how system availability is correlated to financial results. One estimate, based on a survey into how impatient internet users are, “calculated that a page load slowdown of just one second could cost $1.6 billion in sales [to Amazon] each year. Although such estimates are arguably contrived to some extent, it is fairly sensible to consider that application performance and sales are positively correlated: the faster and reliable the system, the higher the revenue.

As we said, the scale of losses depends on the organization size and may vary considerably among companies and industries as well:

Company Size Average cost of downtime
Small $8,580 per hour
Medium $215,637 per hour
Large $686,250 per hour
All companies $163,674 per hour

Table: Average cost per hour of downtime broken down by company size

Large costs associated with system downtime are not a modern phenomenon, though. In 1998, IBM published a similar report on hourly outage costs, breaking down by industry. Figures ranged from $89,000 per hour in the Transportation industry to $6.5 Million per hour in the Financial sector.

Organizational impacts

The occurrence of failures with widespread damages can submit ourselves and the entire engineering team to high levels of stress, especially when remediation tasks need to be performed manually while the whole organization is looking after us for a solution. When incident causes are rooted in code changes, the stress also reaches the development team.

This can instill fear for experimentation and innovation in both the product and operational teams. Higher turnover in the technical teams is another potential result of not having a well crafted plan to mitigate application downtime. Organizations that fall in this situation risk being outpaced by their competitors sooner or later.

High levels of stress can affect the satisfaction of the technical team on the job. The number one driving force behind engineer productivity is job happiness. It’s been demonstrated by multiple research that happier engineers are better at analytical problem solving and critical thinking. Positive emotions on the job are also correlated with shorter issue fixing time. There is no doubt that job satisfaction plays a major role in the volume and quality of engineer outputs and system outages, especially when frequently recurring, which drives happiness down.

Supply-chain effects

It is easy for us, engineers, to get used to working with a feeling of isolation. Some of us will in fact even enjoy that. In reality, though, our organizations have to engage with an increasingly interconnected world, and everything we do can affect those connections positively or negatively.

We must be able to identify how downtime can propagate and escalate in the organization’s supply chain, for example. Although businesses usually rely on insurance to protect themselves against material losses, reputational and trust repercussions among customers and partners can be devastating, similarly to the intra-organizational impacts discussed above.

Downtime costs related to supply-chain interconnections are a double-sided coin. Consideration needs to be taken into how external dependencies might fail and how internal systems will behave in such scenarios. One approach to avoid supply-chain incidents from escalating into internal failures is building a “supply chain redundancy”.

If a third-party, managed database goes down, for example, a redundant storage system could be used. Unfortunately, sometimes it is prohibitively expensive or operationally unfeasible to duplicate resources, but that doesn’t mean we can ignore the issue.

Having two completely different databases in full sync for redundancy may not be a good solution. In such cases, read-requests may fail in case of downtime, but write requests could be kept in a queue service, for instance, reducing the number of failed status responses for such requests.

Correlation between downtime costs and duration

Focusing on lower MTTR (Mean Time To Recovery), instead of higher MTBF (Mean Time Between Failures) in most cases leads to higher availability levels. In a cluster with multiple nodes, a failure in one or a few can go almost unnoticed if they can recover fast enough (low MTTR).

In some cases, downtime costs and duration have a linear and proportional relationship. This means: the longer it takes to remediate a failure, the higher the losses will be (proportionately to the time). For these scenarios, which seem to be the majority of cases, a strategy to focus on lower MTTR is solid.

There are particular cases, though, in which downtime costs are more correlated to the frequency and not so much impacted by the outage duration. This usually happens when there is a high fixed cost associated with the outage (e.g. system affects an industrial machine that is very expensive to restart, or there could be damages to human lives). In situations like these, a low MTTR may not be as beneficial as a higher MTBF. MTTR does not become irrelevant, but the balance shifts in favor of dedicating more resources to optimizing MTBF.

Operational automation: there’s no way around it

The vast majority of remediation actions to correct and handle failure incidents are repetitive and non-creative tasks that have to be implemented as quickly and precisely as possible. the decision-making process behind remediation actions is commonly objective, rational, and quantifiable.

Good candidates for automation, we must agree.

Until some time ago, recipes to reduce MTTR were reliant on heavy investments in training or increasing the size of the operational team. Nowadays, though, automation platforms already simplified the implementation of auto-remediation to such a degree that, in many cases, DevOps expertise is not even an absolute requirement.

Gartner offers a model for quantifying the relationship between system availability and required investments. The authors first define a minimum cost corresponding to a “standard IT service”. Starting from this baseline cost, they derive estimates for how much investment is needed to achieve certain levels of availability, which we summarize in the following table:

System availability level Cost multiplier
Highly Available (99.3%) 2.15 X
Continuously Available (99.81%) 6.45 X
Multisite Continuous Availability 8.6 X

Table: cost multiplier to achieve different availability levels, based on a “standard IT service” cost “X”

As discussed before, availability is determined by MTTR, which is impacted largely by the speed with which faulty systems can be remediated. Increasing remediation speed used to be extremely expensive, as the table shows: the higher the availability ladder is climbed, the higher are the cost multipliers. The relationship between availability and cost isn’t linear and each additional decimal of percentage point in the availability level can demand heavy investments.

A low-code automation platform can provide better abstraction, speed up the process of automating tasks, promote higher levels of code reuse, flatten the learning curve and improve knowledge transfer across organization teams. Achieving high availability doesn’t have to be prohibitively expensive anymore.

In Kaholo, for example, automation pipelines are built in a visual dashboard, similar to creating a flowchart. But instead of static elements in a diagram, the blocks represent real resources and can execute any sort of actions upon those resources. These building blocks can represent metrics or logs ingested from APM services like AppDynamics, NewRelic or DataDog, or cloud resources, such as a virtual server, a Docker container, or a Kubernetes cluster, for example, or SaaS services like a CRM or product analytics.

Kaholo can support from simple to the most advanced automation logic needed. For example, it is possible to create pipelines for event-driven workloads, recurring and scheduled jobs, asynchronous and concurrent pipeline branches, and much more. Several architectural design patterns can be easily implemented, such as finite-state machines or fan-out/fan-in pipelines.

In case you would like to read more on this subject, we recently published a whitepaper about strategies to reduce MTTR, increase availability, and use a low-code approach to automating the remediation processes, DOWNLOAD it here

Check out Kaholo’s sandbox environment where you can build automation in minutes!


Close Menu