The Role of Machine Learning in Cybersecurity
Rene Descartes, philosopher and mathematician, wrote: “Mathematics is a more powerful instrument of knowledge than any other that has been bequeathed to us by human agency.”
The problem is that enterprise security personnel are defending a castle riddled with holes, filled with secret passageways and protected by ineffective barriers. These weak points are a consequence of anemic security software, inferior hardware and backdoors planted by malicious insiders. The result is a galling acceptance that the attackers are winning as they continue to evolve in complexity. Part of that evolution involves the employment of evasion techniques designed to bypass existing security. Detecting these advanced threats after they execute is hard enough. Proactive prevention has eluded us.
The Human Factor
To keep up with modern attackers, security needs to evolve alongside them – without relying on human intervention. That’s where AI’s math and machine learning have the advantage. Classifying “benign” files from “malicious” based on mathematical risk factors, allows one to teach a machine to make the appropriate characterization of these files in real time.
A math and machine learning approach to security can fundamentally change the way we understand, categorize and control the execution of every file. Industries such as healthcare, insurance and high-frequency trading have long applied the principals of machine learning to analyze enormous quantities of business data, driving autonomous decision making. The core of such an approach is a massively scalable data-processing ‘brain’ capable of applying highly-tuned mathematical models to enormous amounts of data.
What is Machine Learning?
“Machine learning, a branch of artificial intelligence, involves the construction and study of systems that can learn from data ... The core of machine learning deals with representation and generalization. Representation of data instances are part of all machine learning systems. Generalization is the property that the system will perform well on unseen data instances; the conditions under which this can be guaranteed are a key object of study in the subfield of computational learning theory.” —Wikipedia
Over time, billions of files have been created – both malicious and benign. In the file creation evolution, patterns have emerged, reflecting how specific types are constructed. Variability and anomalies exist, but generally the computer science process is reasonably consistent.
The patterns become even more consistent across development shops such as Microsoft®, Adobe® and other large software vendors. That consistency increases as one looks at development processes used by specific developers and attackers. The challenge lies in identifying patterns, understanding how they are manifest and recognizing what consistent patterns tell us about the nature of these files.
Math vs. Malware
The magnitude of the data involved, the tendency towards bias, and the number of computations required, render humans incapable of leveraging this data to determine whether a file is malicious or not. Most security companies still rely on humans to make these determinations, hiring large teams to examine millions of files to determine the “good” from the “bad”. Humans have neither the brainpower nor the physical endurance to keep up with the volume and sophistication of modern threats. Advances made in behavioral and vulnerability analysis, as well as identifying indicators of compromise, all suffer from the same fatal flaw – all are based on a human perspective and analysis of a problem – which can err, are slow, tend toward over-simplification. Machines are less likely to suffer from such constraints.
Machine learning and data mining go hand-in-hand. Machine learning focuses on prediction based on properties learned from earlier data. This is how we can now differentiate malicious files from legitimate or benign ones. Data mining focuses on the discovery of previously unknown properties of data, so they can be used in future machine learning decisions. In my next installment we’ll examine how ML actually works.