EDITORIAL | May 6, 2020

A Reverse Engineer’s Perspective on the Boeing 787 ‘51 days’ Airworthiness Directive

Several weeks ago, international regulators announced that they were ordering Boeing 787 operators to completely shut down the plane’s electrical power whenever it had been running for 51 days without interruption.1 The FAA published an airworthiness directive elaborating on the issue, and I was curious to see what kind of details were in this document.

While I eventually discovered that there wasn’t much information in the FAA directive, there was just enough to put me on track to search for the root cause of the issue. This blog post will leverage the interesting bits of information in the FAA directive to gain knowledge about some avionics concepts.

First, we need to introduce the parts of the 787’s architecture and systems that are relevant to this issue. The FAA directive explicitly uses acronyms, such as CDN or CCS, that need to be defined before moving to root cause analysis.

What is the Common Core System (CCS)?

As opposed to the federated avionics architectures, which make use of distributed avionics functions that are packaged as self-contained units, Integrated Modular Avionics (IMA)2 architectures employ a high-integrity, partitioned environment that hosts multiple avionics functions of different criticalities on a shared computing platform. Boeing engineers went one step further and developed the Common Core System (CCS) for the 787, a further enhancement based on an open IMA avionics technology.

Essentially the CCS is a hardware/software platform that provides computing, communication, and input-output (I/O) services for implementing real-time embedded systems, known as hosted functions.

Multiple hosted functions share the platform resources within a virtual system environment enforced by partitioning mechanisms that are implemented as part of the platform design, relying on a VxWorks 6533,4 OS.5

This virtual system partitioning environment guarantees that hosted functions are isolated from each other, so it supports highly critical applications but also lower levels of application integrity. Remember that international regulations define five levels of failure conditions, categorized by their effects on the aircraft, crew, and passengers:

Level A–Catastrophic

Failure may cause multiple fatalities, usually with loss of the airplane.

Level B–Hazardous

Failure has a large negative impact on safety or performance, reduces the ability of the crew to operate the aircraft due to physical distress or a higher workload, or causes serious or fatal injuries among the passengers.

Level C–Major

Failure significantly reduces the safety margin or significantly increases crew workload. May result in passenger discomfort (or even minor injuries).

Level D–Minor

Failure slightly reduces the safety margin or slightly increases crew workload. Examples might include causing passenger inconvenience or a routine flight plan change.

Level E–No Effect

Failure has no impact on safety, aircraft operation, or crew workload.

Software approved to levels A, B, or C requires strong certification involving formal processes for verification and traceability

As a result, a DO-178B6 Level-A software may coexist in the same physical shared resource with a Level-E application.

Figure 1. VxWorks 653 Architecture7,A

Ideally, the applications cannot interfere with each other, regardless of faults that may occur within the hosted functions or the platform resources, which are predetermined and communicated to the platform components via loadable configuration files usually in either XML or proprietary binary formats.

Within the CCS we can find the following major components:

  • General Processing Modules (GPMs) to support functional processing needs
  • Remote Data Concentrators (RDCs) to support system analog signals, analog discrete signals, and serial digital interfaces (CAN bus8, A4299, etc.)
  • Avionics Full Duplex (A664-P7) Switched Ethernet10 network for communication between platform elements

These elements can be packaged as Line Replaceable Units (LRUs)11 or in module or card form, which can then be grouped within cabinets or integrated LRUs. As a result, the CCS is made up of:

  • Two (2) Common Computing Resource (CCR) cabinets
  • The Common Data Network (CDN)
  • 21 RDCs
Figure 2. CCS Architecture A

Common Computing Resource Cabinets

Each CCR cabinet has:

  • Two (2) Power Conditioning Modules (PCMs)
  • Eight (8) General Processing Modules (GPMs)
  • Two (2) ARINC 664-P7 network Cabinet Switches (ACSs)
  • Two (2) Fiber Optic Translator Modules (FOXs)
  • Two (2) Graphic Generators (part of the Display and Alert Crew System)
Figure 3. Boeing 787 CCR Cabinet12, A

Each GPM is an independent computing platform that hosts airplane systems’ operational software and provides the hosted applications a partitioned environment based on the ARINC 653 standard. Each GPM has the same hardware and core operating system.

The GPMs in these CCR cabinets run hosted functions such as Remote Power Distribution System (RPDS), Generator/Bus Power Control Unit (GCU/BPCU),13 Circuit Breaker Indication and Control, Landing Gear Indication and Control, Thrust Management Function, and Flight Management Function.

Common Data Network

The CDN is a high-integrity IEEE 802.3 Ethernet network utilizing IP addressing and related transport protocols (UDP). As an A664-P7 compliant network, it also implements deterministic timing and redundancy management protocols. The CDN uses both fiber optic cable and copper wire and moves system information between the various airplane systems connected to it, either directly or through ACSs, FOXs, or RDCs.

Figure 4. 787 Network Architecture A

The CDN is comprised of the network End System (ES) hosted in each connecting end node and multiple network switches.

End System

Within the context of an avionics network, as defined in the A664-P7 specification, we find that:

The main function of the End System (ES) is to provide services, which guarantee a secure and reliable data exchange to the partition software.

Essentially, the ES is assuming the role of a Network Interface Controller (NIC), capable of maintaining communication ports (queuing, sampling, or SAP) for messages written and read by multiple hosted applications. This is performed by exchanging Ethernet frames through a Virtual Link (VL), which is a conceptual communication object that defines a logical unidirectional connection from one source to one or more destination ES. The traffic flow in the VL is shaped not to exceed a configured Bandwidth Allocation Gap (BAG), which represents the minimum time interval between the first bits of two consecutive frames.

Figure 5. ES Communications in the CDN

The ES operating in the CDN (also in the switches) is physically separated from the host processor, interfacing through a PCI Bus. From a high-level perspective, it is comprised of:

  • One (1) custom ASIC
  • Two (2) COTS Ethernet PHY transceivers
  • Two (2) serial configuration memories
  • RAM
Figure 6. High-level Overview of End System Board

The ES can be configured from the host through a proprietary API. This configuration data has been previously generated using a DO-178B Level-A tool (ESBIN) and then stored in a custom file (es_config.bin).

The ES in a CDN switch implements much the same functionality except for some addressing and redundancy operations.

Remote Data Concentrators

There are 21 RDCs in the CCS.

Figure 7. Remote Data ConcentratorsA

These RDCs provide the interface between airplane systems that do not have the ability to support A664-P7 in the CDN.

The RDCs convert these signals to ARINC 664 data and vice versa, thus effectively acting as a gateway for a variety of analog devices, such as sensors or valves, ARINC 429 buses, and CAN subnets.

From an A664-P7 perspective, these RDCs map:

  • Analog signals to parameters
  • A429 to communication ports
  • CAN bus to both parameters and communication ports

As a result, the high-level architecture would be as follows.

Figure 8. High-level Overview of CCS

To better illustrate this architecture as a whole, we can oversimplify one of the hosted functions to see how all the pieces work together.

The landing gear control software is running in one of the CCRs, hosted in a GPM’s partition. This hosted function partition receives gear lever up/down data as well as gear and gear door position data from one of the 21 RDC via the CDN. Then, depending on the signals received, the landing gear control software may issue gear-sequencing commands to the proper RDC via the CDN. The RDC can then transfer the specific signal to those actuators that, for example, energize the control valves to retract and extend the landing gear or open and close the gear doors.

Root Cause Analysis

The FAA’s directive is scarce in technical details. It only contains a high-level description of the issue; however, it provides the reader with some key facts that can help with root cause analysis:

The FAA has received a report indicating that the stale-data monitoring function of CCS may be lost when continuously powered on for 51 days. This could lead to undetected or unannunciated loss of CDN message age validation, combined with a CDN switch failure. The CDN handles all the flight-critical data (including airspeed, altitude, attitude, and engine operation), and several potentially catastrophic failure scenarios can result from this situation. Potential consequences include:

  • Display of misleading primary attitude data for both pilots.
  • Display of misleading altitude on both pilots’ primary flight displays (PFDs).
  • Display of misleading airspeed data on both pilots’ PFDs, without annunciation
  • of failure, coupled with the loss of stall warning, or over-speed warning.
  • Display of misleading engine operating indications on both engines.

The potential loss of the stale-data monitoring function of the CCS when continuously powered on for 51 days, if not addressed, could result in erroneous flight-critical data being routed and displayed as valid data, which could reduce the ability of the flight crew to maintain the safe flight and landing of the airplane.

I will be carefully analyzing every single sentence.

The FAA has received a report indicating that the stale-data monitoring function of CCS may be lost when continuously powered on for 51 days.

Back in 2015, the FAA issued a similar directive14 although in that case the underlying problem was described a bit more explicitly.

We have been advised by Boeing of an issue identified during laboratory testing.

The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power.

So basically, we can probably assume the situation is pretty much the same: Boeing identified this current issue during laboratory testing.

The 2015 FAA directive also explicitly mentioned that Boeing was working on a software patch to fix the issue; however, there is no mention of any upcoming patch in this current directive. As we will see later on, this makes sense if the vulnerability is hardware-related.

Once again, the mention of “51 days” initially points towards some kind of overflow in a counter.

This could lead to undetected or unannunciated loss of CDN message age validation, combined with a CDN switch failure.

This sentence tells us a lot of things about the nature of the issue. First, any catastrophic error in the CDN that goes undetected or ‘unannunciated’ in the 787 is highly unexpected, although it’s not entirely clear to me whether both the loss of CDN message age validation and CDN switch failure go undetected or just the first issue. Both maintenance engineers and pilots have the ability to check the status of the CCS switches and CDN LRUs through the maintenance pages in the Flight Deck. Also, any significant fault will be centralized, logged, and processed via the Central Maintenance Computing Function (CMCF).

Figure 9. CCS Status in the Flight Deck15

Also, pilots can reset both left and right CCR on the overhead panel; however, as the FAA directive states, a complete power shutdown is required, so we can assume a CCR reset doesn’t solve the problem. This means the issue is located deep in the hardware of a component that is present not only in the CCR, but also in other parts of the CDN.

Figure 10 CCR Reset Buttons16

So we have that:

  • The CDN loses the ability to perform age validation.
  • The CDN switches fail.

Let’s narrow down the list of potential suspects by analyzing how data communication integrity is enforced in the CDN.

Integrity Checking in the CDN

Bear in mind that the CCS is an asynchronous system where each partition is not only controlling when its data is produced but also decoupling this operation from the network interface. At the MAC level, the A664-P7 spec mandates that the output interfaces need to be transmitting, regardless of the PHY status, in order to prevent error propagation or re-transmission of old frames. Still, in an AFDX avionics network the order matters, so when the transmitting partition produces certain data, the receiver partition expects to collect that data in the same order.

In addition, the CCS operates following a redundancy policy having two different channels (‘A’ and ‘B’), although it is theoretically possible to configure them to operate independently.

Figure 11. Frame Processing Logic

In order to fulfill these requirements, the ES adds a Sequence Number (SN) after the AFDX payload when transmitting frames. This SN is 0 right after the ES reset and then 1-255. The redundant frames received in the ES follow the ‘first valid wins’ policy. Please note that in addition to the ordinal integrity there is a procedure to detect real redundant frames, where a configured constant (Skew Max) is used to limit the valid time window for two potentially redundant frames.

Figure 12. Regular AFDX Frame

This logic is common to all AFDX ES and I don’t think this functionality is where the actual flaw lies, as any problem would be more dependent on the amount of traffic flowing through the CDN rather than a specific time period. However, interestingly, there is something in the ES’ integrity checking and redundancy management that makes the 787 a little bit special: Boeing’s proprietary Error Detection Encoding (EDE) protocol.

EDE Protocol: A Promising Suspect

The EDE protocol is working at the VL level to add an additional layer of end-to-end integrity in the CDN.

When a VL is enabled with EDE, which is mandated by Boeing for critical and essential data, the transmitting ES encapsulates the payload with an EDE header and footer.

Figure 13. EDE Wrapped Frame

The EDE header and footer include the following fields:

  • SN: A 2-byte sequence number bound to a specific COM port. This value is incremented for each frame that is transmitted.
  • Timestamp: A 6-byte value that holds the time when the message was transmitted, using the local clock domain of the transmitting ES.
  • CRC X and CRC Y: These CRCs are calculated using the EDE Source ID (a 32-bit value only known for the ES transmitter and receiver in a VL), EDE timestamp, and payload.

The EDE timestamp is relative to the transmitting ES’ clock domain, so the CCS needs a way to centralize and keep track of all the local time references so any age validation can be performed accurately. This task is cyclically performed by the Time Management function, which maintains a table of relative offsets with the relationships between the time references for each ES present in the CDN. This is possible thanks to a request/response protocol where the Time Agent in each ES is periodically questioned by the Time Managers.

The resulting table of offsets is then broadcast to each ES through the Time Offset message so an ES can perform EDE age verification when receiving data from another ES. Obviously, the EDE Time Management packets required to calculate and propagate these offset tables are not subject to EDE age verification. 

Age verification in the CDN, in the context of the EDE protocol, relies on the consistency of these offset tables. So, what would happen if, for any reason, this fails? It is difficult to say without having access to a 787 (currently accepting donations) 😉 but I will try my best.

There are several possible scenarios:

  • The ES did not receive the offsets table.
    The message is forwarded to the EDE Redundancy Manager but a flag is set to indicate its age cannot be validated.
  • The age is greater than the maximum configured age.
    The message is directly discarded.
  • The age is less than the maximum configured age.
    This is the expected case. The message is forwarded to the EDE Redundancy Manager, eventually reaching the COM port.
  • The age is inconsistent.
    For some reason, the message seems to have an age that makes no sense. For example, let’s imagine that the timestamp set by the transmitting ES is close to its wrap-around value. After performing the required calculation, the receiving ES obtains a timestamp that has already wrapped-around, so it would look like the message had been received before it was actually sent. The message is accepted but still handled as its age is unknown.

Bearing in mind that this functionality is implemented in the ASIC and the timestamp should be derived from a counter, I think the whole issue may be around this logic. 

The key question is: How does the 51-day period fit in this scenario? Ok, let me present my theory.

A Potential Explanation

The 6-byte EDE timestamp is the key to make sure everything goes smoothly in the CDN. The most significant bit in this timestamp is set 0 by definition, so ideally we have 0x7FFFFFFFFFFF as the maximum coherent value for the EDE timestamp.

The ES receives the data from the hosted application through PCI, running at 33MHz, so it would be reasonable to implement a counter at a similar clock frequency so the ASIC can use that clock reference to timestamp ready-to-go messages. So let’s assume the counter is ideally operating at 33MHz and the timestamp is somehow derived from that counter, also taking into account different parameters, such as delays and latencies due to moving data across the different interfaces (RMII, PCI, etc.).

By calculating the frequency at which an ideal counter (starting at 0) should be operating in order to wrap-around the EDE timestamp (0x800000000000) after 51 days, we obtain ~32MHz. That’s pretty close to our assumption.

The CDN handles all the flight-critical data (including airspeed, altitude, attitude, and engine operation), and several potentially catastrophic failure scenarios can result from this situation.

We previously introduced the DO-178B certification levels where level A corresponds to a catastrophic failure, which prevents continued safe flight or landing.

Potential consequences include:

  • Display of misleading primary attitude data for both pilots.
  • Display of misleading altitude on both pilots’ primary flight displays (PFDs).
  • Display of misleading airspeed data on both pilots’ PFDs, without annunciation of failure, coupled with the loss of stall warning, or over-speed warning.
  • Display of misleading engine operating indications on both engines.

The consequences covered in the FAA document seem to be strictly related to the scenario where pilots can no longer trust their instruments, a problem that in past incidents has led to tragic consequences.

In a Boeing 787, all this data is handled by the Display Crew Alert System (DCAS). This system provides the pilots with all the audio, tactile, or visual indications that are necessary for the safe operation of the airplane, as you can see in the following image. 

Figure 14. DCAS includes Multiple Displays

The potential loss of the stale-data monitoring function of the CCS when continuously powered on for 51 days, if not addressed, could result in erroneous flight-critical data being routed and displayed as valid data, which could reduce the ability of the flight crew to maintain the safe flight and landing of the airplane.

We can read this last paragraph as a summary of what has been elaborated in this blog post.

Conclusion

Aviation security research is a complicated field, not only because of the secrecy that surrounds these technologies but also the commercial and corporate barriers that prevent access to the required equipment. Despite all these challenges, I think that any effort to promote this kind of research always pays off.

The timing is also interesting, as this flaw is coming to light almost a year after reporting our research to Boeing. Boeing acknowledged that they set up a fully functional aircraft and a laboratory to assess our claims (which involved the CDN), so I guess there is a chance that, maybe, follow-up research on their part identified this issue. In general terms, this would be a good side-effect of any security research, which is all about fostering the appropriate level of trust in the devices and organizations the people depend upon.

Do not take what I have presented here as the real root cause of the problem that Boeing detected. I may be right, but it’s just as likely that I’m wrong, and this was an exercise intended to satisfy my curiosity. Hopefully, you have learned something new and enjoyed reading about the topic. The more thoughtful people there are carefully scrutinizing critical systems, the better those systems will be in the long-term. That’s what this is all about.

For additional reading, please refer to the white paper of my original research, which was released during Black Hat 2019.


[A] IOActive White Paper: Arm IDA and Cross Check: Reversing the 787’s Core Network
[1] https://www.federalregister.gov/documents/2020/03/23/2020-06092/airworthiness-directives-the-boeing-company-airplanes
[2] https://www.aviationtoday.com/2007/02/01/integrated-modular-avionics-less-is-more/
[3] https://en.wikipedia.org/wiki/ARINC_653
[4] https://www.windriver.com/products/product-overviews/vxworks-653-product-overview-multi-core/vxworks-653-product-overview-multi-core.pdf
[5] https://www.windriver.com/customers/customer-success/aerospace-defense/boeing/ (404 link broken)
[6] https://en.wikipedia.org/wiki/DO-178B
[7] http://www.artist-embedded.org/docs/Events/2007/IMA/Slides/ARTIST2_IMA_WindRiver_Wilson.pdf
[8] https://en.wikipedia.org/wiki/CAN_bus
[9] https://en.wikipedia.org/wiki/ARINC_429
[10] https://pdfs.semanticscholar.org/5db4/b539ed7bdec182448ac8d7219db12a8bbc12.pdf
[11] https://en.wikipedia.org/wiki/Line-replaceable_unit
[12] https://bioage.typepad.com/.a/6a00d8341c4fbe53ef0162fbf813b6970d
[13], [14] https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf
[15], [16] https://www.instagram.com/787guide/

COLLATERAL | April 22, 2020

IOActive Corporate Overview

Research-fueled Security Assessments and Advisory Services

IOActive has been at the forefront of cybersecurity and testing services since 1998. Backed by our award-winning research, our services have been trusted globally by enterprises and product manufacturers across a wide variety of industries and in the most complex of environments.

Tailored to meet each unique organization’s requirements, IOActive services offer deep expertise and insight from an attacker’s perspective. 

COLLATERAL | April 17, 2020

IOActive Red and Purple Team Service

Building Operational Resiliency Through Real-world Threat Emulation.

Who better to evaluate security effectiveness – compliance auditors or attackers? Vulnerability assessments and penetration tests are critical components of any effective security program, but the only real way to test your operational resiliency is from an attacker’s perspective.

Our red and purple teams bring you this insight through full threat emulation, comprehensively simulating a full range of specific attacks against your organization – cyber, social, and physical.
We can provide or advise on the creation of continuous, independent, and customized real-world attacker-emulation services that work with your blue team – your own security operations personnel – to prepare them to face the adversaries your enterprise is likeliest to encounter.

 

COLLATERAL |

IOActive Services Overview

Security services for your business, situation, and risks.

With our breadth and depth of services offerings across more environments than any other firm today, we can deliver specific, high-value recommendations based on your business, unique situation, and the risk you face. We are a pure-play security services provider, offering services across the spectrum to include: cybersecurity advisory, full-stack security assessments, SDL, red/purple team and security team development (training) services.

EDITORIAL | April 13, 2020

Mismatch? CVSS, Vulnerability Management, and Organizational Risk

I’ll never forget a meeting I attended where a security engineer demanded IT remediate each of the 30,000 vulnerabilities he had discovered. I know that he wasn’t just dumping an unvetted pile of vulnerabilities on IT; he’d done his best to weed out false-positive results, other errors, and misses before presenting the findings. These were real issues, ranked using the Common Vulnerability Scoring System (CVSS). There can be no doubt that in that huge (and overwhelming) pile were some serious threats to the organization and its digital assets.

The reaction of the IT attendees did not surprise me, nor did the security engineer’s subsequent reaction. It didn’t go well. Presented with that much work, IT refused, describing their already fully loaded plans and referring the security engineer to the CIO. In other words, “Security, take your vulnerabilities and be gone. We have other work to do.”

I’ve seen this same dynamic play out over and over again. Faced with 72,000 unqualified static analysis findings, the application team addressed none of them. Given 130,000 issues across an organization’s entire (scannable) infrastructure, the operations team’s first reaction is to do nothing.

The foregoing, real-world numbers are overwhelming. As one senior architect told me, “It took me an average of 15 minutes to figure out whether each of the five findings was a real issue. None of them was. In order to work through this set of findings, the effort will take about six person-weeks. We don’t have a resource to dedicate to this for six weeks, especially at a high false-positive rate. Much of it will be wasted effort.”

At the same time, we intuitively know that somewhere in those piles of vulnerabilities are issues that will be exploited and whose exploitation will cause serious harm: we know there’s organizational risk in the pile. But how do we find the dangerous needle in the haystack of vulnerability findings?

Knee-jerk security responses don’t help. I cannot count the number of times a security person has flatly stated, “Just patch it.” As though patching is the simplest thing in the world.

Upgrading a security patch may be simple for a single application; it’s not quite so straightforward when faced with thousands of potential issues across thousands of software components. This is especially true as potential disruption from unexpected side-effects must be considered when introducing new software (patches) into complex systems whose failure might have disastrous consequences.

As far as I’ve been able to see, few organizations cope well with tens of thousands of issues, each demanding a fix.  Plus, a continuing flow of new issues discovered each day adds to the work queue.

Ultimately, managing cybersecurity risk is a business decision just as managing any other organizational or operational risk is for an organization. When these issues are viewed from different, sometimes adversarial, technical silos, it is not surprising that a consensus understanding of organizational risk management priorities does not coalesce.

Up until recently, industry practice has been to prioritize issues based upon CVSS base score. However, research from 2014 indicates that using the CVSS base score may be no better than “choosing at random.”1 Maybe that’s why even organizations with fairly mature security and operations functions continue to be compromised through unpatched vulnerabilities.

Perhaps we’re fixing the wrong issues. Are there attributes that will help find the most likely exploits?

If we rely solely on CVSS, especially, published CVSS base scores, then yes, we will prioritize many issues that will rarely, perhaps never, be exploited by real attackers. The 2014 ground-breaking academic analysis by Allodi and Massacci2 found that CVSS base scores have been a poor predictor of exploitation. Their results have since been validated by some vendor-funded studies.3 CVSS has certainly proven useful for potential severity ratings, but using it as a predictor, or worse, as a risk calculation seems to be a mistake, despite the prevalence of the practice.

If not CVSS, then how can we identify the issues most likely to be exploited? Allodi and Massacci found that the addition of an exploit to an Exploit Kit dramatically increases the likelihood of use “in the wild,” that is, by real-world attackers. Their second strong predictor is Dark Web chatter about a vulnerability and its exploitation. When these two activities happen in tandem, one should fix the issue, and soon. This aligns well with the intuitive insights of today’s security practitioners who focused on threat intelligence and assessing their security posture through adversary emulation assessments such as red team exercises.

That should be easy, right? Unfortunately, processing Dark Web chatter proves non-trivial. Commercial products4 might not provide quite the right information, meaning that users must craft their own searches. Search capabilities in these products vary dramatically from full regular expressions to simple keyword searches. Buyer beware.

However, a recent announcement may signal the path forward. The Cyentia Institute and Kenna Security announced the release of their Exploit Prediction Scoring System (EPSS)5 and the research from which EPSS was built. Kenna Security is supplying the data from which the EPSS calculator works. EPSS employs further predictors than the two primary ones named by Allodi and Massacci; please see the EPSS research6 to learn more. EPSS may be vulnerability management’s better mousetrap.

EPSS includes the CVSS severity score. But it offers an entirely different dimension into the potential for active vulnerability misuse by real attackers. Don’t mistake CVSS for EPSS. They deliver very different facets of the vulnerability picture. Severity is our best guess as to how bad successful exploitation might be in a normalized, generalized case. CVSS lacks context, often glaringly missing. In comparison, EPSS attempts to tell us which vulnerabilities attackers will try, producing a percentage prediction of how likely exploitation will be at the time of calculation.

In the research (please see endnotes), exploitation of high-severity issues is actually much rarer than the misuse of low and medium issues. That may come as a surprise. One reason for the preference for low and medium issues might be the ease of crafting exploits. Plus, attackers hesitate using issues that require significant setup and preconditions. Instead, they routinely string together issues that in isolation aren’t all that impactful. But taken as a set of steps, the “kill chain”, several low and medium issues can lead to full compromise. A quick survey through a few of MITRE’s ATT&CK Threat Groups7 demonstrates how techniques are used to generate a kill chain.

When we rely upon CVSS severity as our priority, we fix the issues that in the most generalized case might cause the most damage, scheduling the lower severities for some later date. This is precisely the problem predictive analysis addresses: identify those issues in which attackers are interested, and prioritize those. It turns out that quite often, some low and medium severity issues are the ones to worry about.

Remove attacker leverage by patching some kill chain steps, and we raise the cost or even prevent chained attacks. But we can only do that if we know which issues, irrespective of their potential severity, attackers are considering. EPSS and predictive models, in general, may offer users a way to sift attacker-preferred issues from the chaff of overwhelming vulnerability queues.

I must warn readers that there are problems with EPSS. Today, all one can get is a single, point-in-time predictive score through a web browser interface. One-at-a-time scoring isn’t how vulnerability management must work in order to scale and provide just-in-time information. Unless a score is high enough to act upon when calculated, any score’s increase over time is the quantity to watch. Each vulnerability’s score needs to be monitored in order to identify issues that exceed the organization’s risk tolerance. Going to a website and checking tens of thousands of issues one at a time isn’t really workable.

If EPSS is going to be of use, there must be some automation for organizations to periodically check scores. The threat landscape is dynamic, so any solution must be equally dynamic. I hope that Cyentia and Kenna Security will provide a service or API through which organizations can monitor predictive score changes over time, and at scale.

EPSS is tightly coupled to the management of vulnerabilities. It would be a major error to apply EPSS, or any vulnerability misuse prediction method, to other aspects of organizational risk management. As always, every organization needs a robust and thorough understanding of its risk tolerances, dedicated skilled people to managing risk, and must adopt a rigorous and proven risk scoring mechanism, for instance, The Open Group standard: Factor Analysis of Information Risk (FAIR)8.

Importantly, EPSS will not supersede human risk analysis. EPSS and CVSS as well, are adjuncts to human analysis, not replacements. Well-resourced attackers appear to be using more so-called, zero-day vulnerabilities9, that is, vulnerabilities unknown before use and not yet fixed. To confront zero-days we must rely on our threat intelligence gathering and contextual risk analysis. Human threat modeling continues to be one of the best techniques for assessing potential danger from the unexpected appearance of a possible threat vector.

The Cyentia researchers indicated to me that Kenna Security owns the data used by EPSS. I attempted to contact someone at Kenna Security multiple times for this article, but Kenna Security has, unfortunately, not responded.

IOActive offers a full range of security consulting services, including vulnerability management, risk assessment, software security, and threat modeling.

Hopefully, this post helps your organization deal with your unmitigated vulnerability queue and better translate it into definable organizational and operational risks. Effective vulnerability management has the potential to free up resources that can be applied to other aspects of a robust cyber-risk program.

Cheers,
/brook s.e. Schoenfield
Master Security Architect
Director of Advisory Services


[1] Allodi, Luca & Massacci, Fabio. (2014). Comparing Vulnerability Severity and Exploits Using Case-Control Studies. ACM Transactions on Information and System Security. 17. 1-20. 10.1145/2630069. Thanks to Luis Servin (@lfservin) for the reference to this academic paper.

[2] http://seconomicsproject.eu/sites/default/files/seconomics/public/content-files/downloads/Comparing Vulnerabilities and Exploits using case-control studies.pdf

[3] NopSec, Inc’s 2016 and 2018 State of Vulnerability Risk Management Reports: https://www.nopsec.com/

[4] There is an open-source, public Dark Web search engine, DarkSearch.io. DarkSearch doesn’t offer full regular expressions, but it does offer several keyword and grouping enhancements.

[5] https://www.kennaresearch.com/tools/epss-calculator/

[6] Prioritization to Prediction, Cyentia Institute, and Kenna Security: https://www.kennasecurity.com/prioritization-to-prediction-report/images/Prioritization_to_Prediction.pdf

[7] https://mitre-attack.github.io/attack-navigator/enterprise/

[8] https://www.opengroup.org/forum/security-forum-0/risk-management

[9] Please see https://www.fireeye.com/blog/threat-research/2020/04/zero-day-exploitation-demonstrates-access-to-money-not-skill.html

EDITORIAL | April 2, 2020

10 Laws of Disclosure

In my 20+ years working in cyber security, I’ve reported more than 1000 vulnerabilities to a wide variety of companies, most found by our team at IOActive as well as some found by me. In reporting these vulnerabilities to many different vendors, the response (or lack thereof) I got is also very different, depending on vendor security maturity. When I think that I have seen everything related to vulnerability disclosures, I’ll have new experiences – usually bad ones – but in general, I keep seeing the same problems over and over again.

I’ve decided it would be a good idea to write about some Laws of Disclosure in order to help those companies that are not mature enough to improve their vulnerability disclosure processes.

Law 1: The vulnerability reporter is always right

It doesn’t matter if the vulnerability reporter is gross, stupid, or insults you, they have zero-day findings on your technology, so you’d better say “please” and “yes” to everything you can. It’s less complicated to deal with someone you don’t like than dealing with 0days in the wild, hurting your business.

Law 2: Have an easy-to-find and simple way to report vulnerabilities

It shouldn’t take more than a few seconds browsing your website to find how to report a vulnerability. Make it easy and simple as possible; otherwise, you’ll learn about the vulnerability on the news.

Law 3: Your rules and procedures are not important

Some vulnerability reporters don’t care about your rules and procedures for reporting, they don’t want your bounty or compensation. They don’t have to follow your rules; they just want the vulnerability reported and fixed.

Law 4: Keep vulnerability reporter up to date

Never keep the vulnerability reporter in the dark. Instantly acknowledge when you receive a vulnerability report, and then keep the finder posted about your actions and plans.

Law 5: Don’t play dirty

Never try to trick the reporter in any way to buy time or avoid public disclosure. Sooner or later the reporter will find out and 0day you. Time is never on your side, so use it wisely.

Law 6: Compensate

The vulnerability reporter is working for free for you, so always compensate them in some way, like a bounty or at least public acknowledgement and thanks.

Law 7: Forget NDAs and threats

The vulnerability reporter is not part of your company and don’t care about your lawyers. The vulnerability must always be fixed and then published, not hidden.

Law 8: Put the right people in place

Your people handing vulnerability reports should have the right knowledge and proper training. Never put lawyers or marketing people in charge of vulnerability disclosure; vulnerability finders don’t want to hear BS from them.

Law 9: Coordinate

Properly coordinate the release dates of your fix and the vulnerability advisory publication. You don’t want your customers exposed for one second.

Law 10: Always publish

Don’t sweep vulnerabilities under the carpet with silent fixes without telling your customers how and why they should update. If you do, the vulnerability reporter will make sure your customers know it, and they won’t be happy when they find out.

These Laws are based on my own experience, but if I’ve missed something, feel free to share your own experience and help contribute to a better vulnerability disclosure process. Also, if you ever need help with disclosures yourself, let me know via Twitter DM or email. I’ll be happy to help.

ADVISORIES | March 23, 2020

GE Reason S20 Industrial Managed Ethernet Switch Multiple Vulnerabilities

The S20 Ethernet Switch is a device manufactured by GE Grid Solution which is deployed in industrial environments. This device is part of ICS/SCADA architectures.

Stored XSS flaws can result in a large number of possible exploitation scenarios. With most XSS flaws, the entirety of the JavaScript language is available to the malicious user.

ADVISORIES | March 6, 2020

pppd Vulnerable to Buffer Overflow Due to a Flaw in EAP Packet Processing (CVE-2020-8597)

Due to a flaw in the Extensible Authentication Protocol (EAP) packet processing in the Point-to-Point Protocol Daemon (pppd), an unauthenticated remote attacker may be able to cause a stack buffer overflow, which may allow arbitrary code execution on the target system.

This vulnerability is due to an error in validating the size of the input before copying the supplied data into memory. As the validation of the data size is incorrect, arbitrary data can be copied into memory and cause memory corruption possibly leading to the execution of unwanted code.

EDITORIAL | February 13, 2020

Do You Blindly Trust LoRaWAN Networks for IoT?

Do you blindly trust that your IoT devices are being secured by the encryption methods employed by LoRaWAN? If so, you’re not alone. Long Range Wide Area Networking (LoRaWAN) is a protocol designed to allow low-power devices to communicate with Internet-connected applications over long-range wireless connections. It’s being adopted by major organizations across the world because of its promising capabilities. For example, a single gateway (antenna) can cover an entire city, hundreds of square miles.

With more than 100 million LoRaWAN-connected devices in use across the globe, many cellular carriers are racing to join in by offering LoRa nationwide coverage as a service for a low price: on average, a tenth of LTE-based services. However, neither equipment vendors nor service providers nor the end users who are implementing the technology are paying attention to security pitfalls, and are instead spreading a false sense of security.

Our New Research

In exploring the LoRaWAN protocol, we found major cyber security problems in the adoption of this technology. LoRaWAN is advertised as having “built-in encryption,” which may lead users to believe it is secure by default. When talking about the networks that are used across the globe to transmit data to and from IoT devices in smart cities, industrial settings, smart homes, smart utilities, vehicles, and healthcare, we can’t afford to blindly trust LoRaWAN and ignore cyber security. Last week, IOActive presented these LoRaWAN cyber security problems at The Things Conference in Amsterdam, and it grabbed the attention and interest of conference attendees.

The Root of the Risk

The root of the risk lies in the keys used for encrypting communications between devices, gateways, and network servers, which are often poorly protected and easily obtainable. Basically the keys are everywhere, making encryption almost useless. This leaves networks vulnerable to malicious hackers who could compromise the confidentiality and integrity of the data flowing to and from connected devices.

For example, if malicious hackers want to launch a Denial of Service attack, once they have the encryption keys, they can access the network and disrupt communications between connected devices and the network server, meaning companies can’t receive any data.

Alternatively, attackers could intercept communications and replace real sensor or meter readings with false data. Hackers could exploit this to damage industrial equipment, potentially halting operations and putting company infrastructure at risk.

These are just two examples of how attackers can leverage LoRaWAN to execute malicious attacks, but the list goes on. From preventing utility firms from taking smart meter readings, to stopping logistics companies from tracking vehicles, to blocking industrial control processes from receiving sensor readings, if we unwittingly trust flawed technology, we will pay the price.

What Now?

Currently there isn’t a way for an organization to know if a LoRaWAN implementation is or has been hacked or if an encryption key has been compromised. Furthermore, there are no tools to audit/penetration test/hack LoRaWAN implementations. Standing in the gap, IOActive has released a set of useful tools, the LoRaWAN Auditing Framework, which allows organizations to audit and penetration test their infrastructure, detect possible attacks, and eliminate or reduce the impact of an attack. Our goal is to ensure LoRaWAN is deployed securely.

Resources

IOActive LoRaWAN Networks Susceptible to Hacking: Common Cyber Security Problems, How to Detect and Prevent Them (whitepaper)
IOActive LoRaWAN Auditing Framework tools

WHITEPAPER | February 10, 2020

LoRaWAN Networks Susceptible to Hacking: Common Cyber Security Problems, How to Detect and Prevent Them

LoRaWAN is fast becoming the most popular wireless, low-power WAN protocol. It is used around the world for smart cities, industrial IoT, smart homes, etc., with millions of devices already connected.

The LoRaWAN protocol is advertised as having “built-in encryption” making it “secure by default.” As a result, users are blindly trusting LoRaWAN networks and not paying attention to cyber security; however, implementation issues and weaknesses can make these networks easy to hack.

Currently, cyber security vulnerabilities in LoRaWAN networks are not well known, and there are no existing tools for testing LoRaWAN networks or for detecting cyber attacks, which makes LoRaWAN deployments an easy target for attackers.

In this paper, we describe LoRaWAN network cyber security vulnerabilities and possible cyber attacks, and provide useful techniques for detecting them with the help of our open-source tools.