The scale of data theft is staggering. In 2018, data breaches compromised 450 million records, while 2019 has already uncovered the biggest data breach in history, with nearly 773 million passwords and email addresses stolen from thousands of sources and uploaded to one database.

Current cyber defense tactics simply aren’t enough, a new model of defense is needed. In research published recently in Future Generations Computer Systems, my co-researchers and I propose a framework harnessing the power of machine learning to accurately predict attacks and identify perpetrators.


Outdated Tactics

The current manual security models are quickly becoming obsolete for a number of reasons. For one, there is simply too much data for human analysts to manually sift through. Hail-a-TAXII, a repository of Open Source Cyber Threat Intelligence feeds, provides more than one million threat indicators. IBM X-Force reports thousands of malware weekly. Verizon’s Data Breach Investigations Report details millions of incidents. These are just a few of the many data sources analysts have at their fingertips.

Another problem is that current cyber threat intelligence (CTI) tactics look only at low-level indicators, small attack signatures such as IP addresses, domain names and file hashes. Low-level indicators are easy for companies to block by plugging them into firewalls and security devices. Unfortunately, they’re also easy for hackers to change. Using only low-level indicators to stop a cyberattack is a little like trying to prevent thieves from robbing your home by enforcing only one window. The thief will just find another window.

The glut of data and preoccupation with low-level indicators contribute to a serious lag in identifying threats. The median time for an organization to determine it is under attack is 46 days. Attacks can go undetected for much longer, the massive data breach at Equifax in 2017, involving nearly 150 million pieces of personal data, went undetected for 76 days.

Relying on low-level indicators simply doesn’t make sense given what we now know about hackers: They use common patterns of attack that can be identified by looking at high-level indicators, otherwise known as Tactics, Techniques and Procedures (TTPs).

Examples of tactics common to certain threat groups involving the compromise of victims’ credentials include:

  • the exploitation of the victim’s remote access tools and the network’s endpoint management platforms by threat group TG-1314.
  • employing key loggers and publicly available credential dumper toolkits by TG-3390.
  • spear phishing using URL shortened links pointing to malicious websites by TG-4127, which targets government and military networks for espionage and cyber warfare.

Typically, hackers will specialize in one attack tactic and gradually evolve the tactic over time. Consider what’s happened with RAM scrapers: malware that enters servers and combs through the memory to find a distinctive code pattern, such as a credit card’s 16 digits. A RAM scraper was behind the 2013 Target data breach that compromised 40 million credit cards, as well as the 2018 Marriott and Hyatt breaches and many others in between.

While the tactic has remained the same, what has changed is how the malware transfers data to the attacker, advancing from FTP to the web protocol and finally to encrypting the information and moving on its own, no longer reliant on a human to copy and transfer the data. Fifty different families of RAM scrapers for stealing personal data currently exist.

The cyber intelligence community already maintains databases detailing high-level indicators. More than 130 adversary technique documents exist. As of late 2018, there were 45 known threat actors and 123 known software tools included in the ATT&CK taxonomy, a globally accessible knowledge base of adversary tactics and techniques based on real-world observations. The ATT&CK taxonomy shows that the number of TTPs used in threat incidents range from one to 34, with an average of six.

If so much is known about TTPs, why aren’t analysts relying on them for cyber defense? Again, the problem is the massive amounts of data. Manually searching for correlated TTPs is tedious, error-prone and a nearly impossible task. That there is no commonly used vocabulary to describe attacks and attack tactics compounds the problem. TTPs are mostly reported as unstructured textural descriptions, which makes it difficult to correlate attack incidents of the same threat group based on similar TTPs due to synonyms and polysemous words. The same style of attack can be labeled one thing in one database and something completely different in another.


Building a New Framework

The framework we propose in Future Generations Computer Systems is based on our knowledge of TTPs and the problems plaguing the cyber intelligence community, too much data and no automated way to rely on more effective high-level indicators.

The framework creates a network of Threats, TTPs and Detection (TTD) mechanisms. To accomplish this, data was collected from related cyber breach incidents and reliable source threats in the public domain.

In total, more than 327 unstructured documents from about two dozen sources were used. Although machines will likely one day be able to deal with all the nuances of human language, we’re not there yet. This means the data had to be curated and semantically correlated before it could be analyzed by machines: we used ATT&CK.

Next, threat prediction and detection. To determine the most probable threat family based on detected TTPs in a network, a probabilistic machine learning-based analysis using belief networks between threats and TTPs is needed. This approach has a few advantages: it can outperform sophisticated classification methods and treat all predictor attributes independently, and it also can handle large datasets with incomplete data. We tested various machine learning techniques and found deep learning neural networks to be the most successful.

To test the framework, we built a benchmark dataset from data available in ATT&CK taxonomy. Threat artifacts were collected based on threat incidents reported by a variety of sources, including IBM, X-Force, Symantec, FireEye and CrowdStrike.

These were plugged into the system and automated investigation was conducted to uncover attacks. The framework achieved about 92 percent accuracy with low false positives. Unlike with human analysts, speed isn’t an issue: The average detection time of a data breach incident was only 0.15 seconds.

With cyberattacks, it’s no longer a matter of if but when. It’s past time for organizations to step up their cyber threat intelligence efforts. Attackers spend many years planning and developing their tools to break into systems.

By targeting the heart of their methods, rather than inconsequential details, we can disrupt their standard way of doing things and perhaps set them back many years in terms of the ease and sophistication with which they steal data. In addition, accurate identification of perpetrators can be a significant help when it comes to prosecuting cyber attackers.