Introduction
“Although alert fatigue is blamed for high override rates in contemporary clinical decision support systems, the concept of alert fatigue is poorly defined. We tested hypotheses arising from two possible alert fatigue mechanisms: (A) cognitive overload associated with amount of work, complexity of work, and effort distinguishing informative from uninformative alerts, and (B) desensitization from repeated exposure to the same alert over time.”
Ancker, Jessica S., et al. “Effects of workload, work complexity, and repeated alerts on alert fatigue in a clinical decision support system.” BMC Medical Informatics and Decision Making, vol. 17, no. 1, 2017.
My name is Andrew Morris, and I’m the founder of GreyNoise, a company devoted to understanding the internet and making security professionals more efficient. I’ve probably had a thousand conversations with Security Operations Center (SOC) analysts over the past five years. These professionals come from many different walks of life and a diverse array of technical backgrounds and experiences, but they all have something in common: they know that false positives are the bane of their jobs, and that alert fatigue sucks.
The excerpt above is from a medical journal focused on drug alerts in a hospital, not a cybersecurity publication. What’s strangely refreshing about seeing these issues in industries outside of cybersecurity is being reminded that alert fatigue has numerous and challenging causes. The reality is that alert fatigue occurs across a broad range of industries and situations, from healthcare facilities to construction sites and manufacturing plants to oil rigs, subway trains, air traffic control towers, and nuclear plants.
I think there may be some lessons we can learn from these other industries. For example, while there are well over 200 warning and caution situations for Boeing aircraft pilots, the company has carefully prioritized their alert system to reduce distraction and keep pilots focused on the most important issues to keep the plane in the air during emergencies.
Many cybersecurity companies cannot say the same. Often these security vendors will oversimplify the issue and claim to solve alert fatigue, but frequently make it worse. The good news is that these false-positive and alert fatigue problems are neither novel nor unique to our industry.
In this article, I’ll cover what I believe are the main contributing factors to alert fatigue for cybersecurity practitioners, why alert fatigue sucks, and what we can do about it.
Contributing Factors
Alarm fatigue or alert fatigue occurs when one is exposed to a large number of frequent alarms (alerts) and consequently becomes desensitized to them. Desensitization can lead to longer response times or missing important alarms.
https://en.wikipedia.org/wiki/Alarm_fatigue
Technical Causes of Alert Fatigue
Overmatched, misleading or outdated indicator telemetry
Low-fidelity alerts are the most obvious and common contributor to alert fatigue. This results in over-alerting on events with a low probability of being malicious, or matching on activity that is actually benign.
One good example of this is low-quality IP block lists – these lists identify “known-bad IP addresses,” which should be blocked by a firewall or other filtering mechanism. Unfortunately, these lists are often under-curated or completely uncurated output from dynamic malware sandboxes.
Here’s an example of how a “known-good” IP address can get onto a “known-bad” list: A malicious binary being detonated in a sandbox attempts to check for an Internet connection by pinging Google’s public DNS server (8.8.8.8). This connection attempt might get mischaracterized as command-and-control communications, with the IP address incorrectly added to the known-bad list. These lists are then bought and sold by security vendors and bundled with security products that incorrectly label traffic to or from these IP addresses as “malicious.”
Low-fidelity alerts can also be generated when a reputable source releases technical indicators that can be misleading without additional context. Take, for instance, the data accompanying the United States Cybersecurity and Infrastructure Security Agency (CISA)’s otherwise excellent 2016 Grizzly Steppe report. The CSV/STIX files contained a list of 876 IP addresses, including 44 Tor exit nodes and four Yahoo mail servers, which if loaded blindly into a security product, would raise alerts every time the organization’s network attempted to route an email to a Yahoo email address. As Kevin Poulsen noted in his Daily Beast article calling out the authors of the report, “Yahoo servers, the Tor network, and other targets of the DHS list generate reams of legitimate traffic, and an alarm system that’s always ringing is no alarm system at all.”
Another type of a low fidelity alert is the overmatch or over-sensitive heuristic, as seen below:
Alert: “Attack detected from remote IP address 1.2.3.4: IP address detected attempting to brute-force RDP service.”
Reality: A user came back from vacation and got their password wrong three times.
Alert: “Ransomware detected on WIN-FILESERVER-01.”
Reality: The file server ran a scheduled backup job.
Alert: “TLS downgrade attack detected by remote IP address: 5.6.7.8.”
Reality: A user with a very old web browser attempted to use the website.
It can be challenging to security engineering teams to construct correlation and alerting rules that accurately identify attacks without triggering false positives due to overly sensitive criteria.
Legitimate computer programs do weird things
Before I founded GreyNoise, I worked on the research and development team at Endgame, an endpoint security company later acquired by Elastic. One of the most illuminating realizations I had while working on that product was just how many software applications are programmed to do malware-y looking things. I discovered that tons of popular software applications were shipped with unsigned binaries and kernel drivers, or sketchy-looking software packers and crypters.
These are all examples of a type of supply chain integrity risk, but unlike SolarWinds, which shipped compromised software, these companies are delivering software built using sloppy or negligent software components.
Another discovery I made during my time at Endgame was how common it is for antivirus software to inject code into other processes. In a vacuum, this behavior should (and would) raise all kinds of alerts to a host-based security product. However, upon investigation by an analyst, this was often determined to be expected application behavior: a false positive.
Poor security product UX
For all the talent that security product companies employ in the fields of operating systems, programming, networking, and systems architecture, they often lack skills in user-experience and design. This results in security products often piling on dozens—or even hundreds—of duplicate alert notifications, leaving the user with no choice but to manually click through and dismiss each one. If we think back to the Boeing aviation example at the beginning of this article, security product UIs are often the equivalent of trying to accept 100 alert popup boxes while landing a plane in a strong crosswind at night in a rainstorm. We need to do a better job with human factors and user experience.
Expected network behavior is a moving target
Anomaly detection is a strategy commonly used to identify “badness” in a network. The theory is to establish a baseline of expected network and host behavior, then investigate any unplanned deviations from this baseline. While this strategy makes sense conceptually, corporate networks are filled with users who install all kinds of software products and connect all kinds of devices. Even when hosts are completely locked down and the ability to install software packages is strictly controlled, the IP addresses and domain names with which software regularly communicates fluctuate so frequently that it’s nearly impossible to establish any meaningful or consistent baseline.
There are entire families of security products that employ anomaly detection-based alerting with the promise of “unmatched insight” but often deliver mixed or poor results. This toil ultimately rolls downhill to the analysts, who either open an investigation for every noisy alert or numb themselves to the alerts generated by these products and ignore them. As a matter of fact, a recent survey by Critical Start found that 49% of analysts turn off high-volume alerting features when there are too many alerts to process.
Home networks are now corporate networks
The pandemic has resulted in a “new normal” of everyone working from home and accessing the corporate network remotely. Before the pandemic, some organizations were able to protect themselves by aggressively inspecting north-south traffic coming in and out of the network on the assumption that all intra-company traffic was inside the perimeter and “safe,” Today, however, the entire workforce is outside the perimeter, and aggressive inspection tends to generate alert storms and lots of false positives. If this perimeter-only security model wasn’t dead already, the pandemic has certainly killed it.
Cyberattacks are easier to automate
A decade ago, successfully exploiting a computer system involved a lot of work. The attacker had to profile the target computer system, go through a painstaking process to select the appropriate exploit for the system, account for things like software version, operating system, processor architecture and firewall rules, and evade host- and system-based security products.
In 2020, there are countless automated exploitation and phishing frameworks both open source and commercial. As a result, exploitation of vulnerable systems is now cheaper, easier and requires less operator skill.
Activity formerly considered malicious is being executed at internet-wide scale by security companies
“Attack Surface Management,” a cybersecurity sub-industry, identifies vulnerabilities in their customers’ Internet-facing systems and alerts them of such. This is a good thing, not a bad thing, but the issue is not what these companies do—it’s how they do it.
Most Attack Surface Management companies constantly scan the entire internet to identify systems with known-vulnerabilities, and organize the returned data by vulnerability and network owner. In previous years, an unknown remote system checking for vulnerabilities on a network perimeter was a powerful indicator of an oncoming attack. Now, alerts raised from this activity provide less actionable value to analysts and happen more frequently as more of these companies enter the market.
The internet is really noisy
Hundreds of thousands of devices, malicious and benign, are constantly scanning, crawling, probing, and attacking every single routable IP address on the entire internet for various reasons. The more benign use cases include indexing web content for search engines, searching for malware command-and-control infrastructure, the above-mentioned Attack Surface Management activity, and other internet-scale research. The malicious use cases are similar: take a reliable, common, easy-to-exploit vulnerability, attempt to exploit every single vulnerable host on the entire internet, then inspect the successfully compromised hosts to find accesses to interesting organizations.
At GreyNoise, we refer to the constant barrage of Internet-wide scan and attack traffic that every routable host on the internet sees as “Internet Noise.” This phenomenon causes a significant amount of pointless alerts on internet-facing systems, forcing security analysts to constantly ask “is everyone on the internet seeing this, or just us?” At the end of the day, there’s a lot of this noise: over the past 90 days, GreyNoise has analyzed almost three million IP addresses opportunistically scanning the internet, with 60% identified as benign or unknown, and only 40% identified as malicious.
Non-Technical Causes of Alert Fatigue
Fear sells
An unfortunate reality of human psychology is that we fear things that we do not understand, and there is absolutely no shortage of scary things we do not understand in cybersecurity. It could be a recently discovered zero-day threat, or a state-sponsored hacker group operating from the shadows, or the latest zillion-dollar breach that leaked 100 million customer records. It could even be the news article written about the security operations center that protects municipal government computers from millions of cyberattacks each day. Sales and marketing teams working at emerging cybersecurity product companies know that fear is a strong motivator, and they exploit it to sell products that constantly remind users how good of a job they’re doing.
And nothing justifies a million-dollar product renewal quite like security “eye candy,” whether it’s a slick web interface containing a red circle with an ever-incrementing number showing the amount of detected and blocked threats, or a 3D rotating globe showing “suspicious” traffic flying in to attack targets from many different geographies. The more red that appears in the UI, the scarier the environment, and the more you need their solution. Despite the fact that these numbers often serve as “vanity metrics” to justify product purchases and renewals, many of these alerts also require further review and investigation by the already overworked and exhausted security operations team.
The stakes are high
Analysts are under enormous pressure to identify cyberattacks targeting their organization, and stop them before they turn into breaches. They know they are the last line of defense against cyber threats, and there are numerous stories about SOC analysts being fired for missing alerts that turn into data breaches.
In this environment, analysts are always worried about what they missed or what they failed to notice in the logs, or maybe they’ve tuned their environment to the point where they can no longer see all of the alerts (yikes!). It’s not surprising that analyst worry of missing an incident has increased. A recent survey by FireEye called this “Fear of Missing Incidents” (FOMI). They found that three in four analysts are worried about missing incidents, and one in four worry “a lot” about missing incidents. The same goes for their supervisors – more than six percent of security managers reported losing sleep due to fear of missing incidents.
Is it any wonder that security analysts exhibit serious alert fatigue and burnout, and that SOCs have extremely high turnover rates?
Everything is a single pane of glass
Security product companies love touting a “single pane of glass” for complete situational awareness. This is a noble undertaking, but the problem is that most security products are really only good at a few core use cases and then trend towards mediocrity as they bolt on more features. At some point, when an organization has surpassed twenty “single panes of glass,” the problem has become worse.
More security products are devoted to “preventing the bad thing” than “making the day to day more efficient”
There are countless security products that generate new alerts and few security products that curate, deconflict or reduce existing alerts. There are almost no companies devoted to reducing drag for Security Operations teams. Too many products measure their value by their customers’ ability to alert on or prevent something bad, and not by making existing, day-to-day security operations faster and more efficient.
Product pricing models are attached to alert/event volume
Like any company, security product vendors are profit-driven. Many product companies are heavily investor-backed and have large revenue expectations. As such, Business Development and Sales teams often price products with scaling or tiered pricing models based on usage-oriented metrics like gigabytes of data ingested or number of alerts raised. The idea is that, as customers adopt and find success with these products, they will naturally increase usage, and the vendor will see organic revenue growth as a result.
This pricing strategy is often necessary when the cost of goods sold increases with heavier usage, like when a server needs additional disk storage or processing power to continue providing service to the customer.
But an unfortunate side effect of this pricing approach is that it creates an artificial vested interest in raising as many alerts or storing as much data as possible. And it reduces the incentive to build the capabilities for the customer to filter and reduce this “noisy” data or these tactically useless alerts.
If the vendor’s bottom line depends on as much data being presented to the user as possible, then they have little incentive to create intelligent filtering options. As a result, these products will continue to firehose analysts, further perpetuating alert fatigue.
False positives drive tremendous duplication of effort
Every day, something weird happens on a corporate network and some security product raises an alert to a security analyst. The alert is investigated for some non-zero amount of time, is determined to be a false positive caused by some legitimate application functionality, and is dismissed. The information on the incident is logged somewhere deep within a ticketing system and the analyst moves on.
The implications of this are significant. This single security product (or threat intelligence feed) raises the same time-consuming false-positive alert on every corporate network where it is deployed around the world when it sees this legitimate application functionality. Depending on the application, the duplication of effort could be quite staggering. For example, for a security solution deployed across 1000 organizations, an event generated from unknown network communications that turns out to be a new Office 365 IP address could generate 500 or more false positives. If each takes 5 minutes to resolve, that adds up to a full week of effort.
Nobody collaborates on false positives
Traditional threat intelligence vendors only share information about known malicious software. Intelligence sharing organizations like Information Sharing and Analysis Centers (ISACs), mailing lists, and trust groups have a similar focus. None of these sources of threat intelligence focus on sharing information related to confirmed false-positive results, which would aid others in quickly resolving unnecessary alerts. Put another way: there are entire groups devoted to reducing the effectiveness of a specific piece of malware or threat actor between disparate organizations. However, no group supports identifying cases when a benign piece of software raises a false positive in a security product.
Security products are still chosen by the executive, not the user
This isn’t unusual. It is a vestige of the old days. Technology executives maintain relationships with vendors, resellers and distributors. They go to a new company and buy the products they are used to and with which they’ve had positive experiences.
Technologies like Slack, Dropbox, Datadog, and other user-first technology product companies disrupted and dominated their markets quickly because they allowed enterprise prospects to use their products for free. They won over these prospects with superior usability and functionality, allowing users to be more efficient. While many technology segments have adopted this “product-led” revolution, it hasn’t happened in security yet, so many practitioners are stuck using products they find inefficient and clunky.
Why You Should Care
The pain of alert fatigue can manifest in several ways:
- Death (or burnout) by a thousand cuts, leading to stress and high turnover
- Lack of financial return to the organization
- Compromises or breaches missed by the security team
There is a “death spiral” pattern to the problem of alert fatigue: at its first level, analysts spend more and more time reviewing and investigating alerts that provide diminishing value to the organization. Additional security products or feeds are purchased that generate more “noise” and false positives, increasing the pressure on analysts. The increased volume of alerts from noisy security products cause the SOC to need a larger team, with the SOC manager trying to grow a highly skilled team of experts while many of them are overwhelmed, burned out, and at risk of leaving.
From the financial side of things, analyst hours spent investigating pointless alerts are a complete waste of security budget. The time and money spent on noisy alerts and false positives is often badly needed in other areas of the security organization to support new tools and resources. Security executives face a difficult challenge in cost justifying the investment of good analysts being fed bad data.
And worst of all, alert fatigue contributes to missed threats and data breaches. In terms of human factors, alert fatigue can create a negative mindset leading to rushing, frustration, mind not on the task, or complacency. As I noted earlier, almost 50% of analysts who are overwhelmed will simply turn off the noisy alert sources. All of this contributes to an environment where threats are more easily able to sneak through an organization’s defenses.
What can we do about it?
The analyst
Get to “No” faster. To some extent, analysts are the victim of the security infrastructure in their SOC. The part of the equation they control is their ability to triage alerts quickly and effectively. So from a pragmatic viewpoint, find ways to use analyst expertise and time as effectively as possible. In particular, find tools and resources that helps you to rule out alerts as fast as possible.
The SOC manager
Tune your alerts. There is significant positive ROI value to investing in tuning, diverting, and reducing your alerts. Tune your alerts to reduce over-alerting. Leverage your Purple Team to assist and validate your alert “sensitivity.” Focus on the critical TTPs of threat actors your organization faces, and audit your attack surface and automatically filter out what doesn’t matter. These kinds of actions can take a tremendous load off your analyst teams and help them focus on the things that do matter.
The CISO
More is not always better. Analysts are scarce, valuable resources. They should be used to investigate the toughest, most sophisticated threats, so use the proper criteria for evaluating potential products and intelligence feeds, and make sure you understand the potential negatives (false positives, over-alerting) as well as the positives. Be skeptical when you hear about a single pane of glass. And focus on automation to resolve as many of the “noise” alerts as possible.
Security vendors
Focus on the user experience. Security product companies need to accept the reality that they cannot solve all of their users’ security problems unilaterally, and think about the overall analyst experience. Part of this includes treating integrations as first-class citizens, and deprioritizing dashboards. If everything is a single pane of glass, nothing is a single pane of glass—this is no different than the adage that “if everyone is in charge, then no one is in charge.” Many important lessons can be learned from others who have addressed UI/UX issues associated with alert fatigue, such as healthcare and aviation.
The industry
More innovation is needed. The cybersecurity industry is filled with some of the smartest people in the world, but lately we’ve been bringing a knife to a gunfight. The bad guys are scaling their attacks tremendously via automation, dark marketplaces, and advanced technologies like artificial intelligence and machine learning. The good guys have been spending all their time in a painfully fragmented and broken security environment, with all their time focused on identifying the signal, and none on reducing the noise. This has left analysts struggling to manually muscle through overwhelming volumes of alerts. We need some security’s best and brightest to turn their amazing brains to the problem of reducing the noise in the system, and drive innovation that helps analysts focus on what matters the most.
Conclusion
Primary care clinicians became less likely to accept alerts as they received more of them, particularly as they received more repeated (and therefore probably uninformative) alerts.
– Ancker, et al.
Our current approach to security alerts, requiring analysts to process ever-growing volumes, just doesn’t scale, and security analysts are paying the price with alert fatigue, burnout, and high turnover. I’ve identified a number of the drivers of this problem, and our next job is to figure out how to solve it. One great area to start is to figure out how other industries have improved their approach, with aviation being a good potential model. With some of these insights in mind, we can figure out how to do better in our security efforts by doing less.
Andrew Morris
Founder of GreyNoise