Machine Learning leverages a four-phase process: Collection, Extraction, Learning and Classification.
Like DNA analysis, file analysis starts with massive data quantities – specific types of files (executables, PDFs, Microsoft Word® documents, Java, etc.). Millions of files are collected from industry sources, proprietary repositories and inputs from active computers.
The goal is to ensure:
- statistically significant sample sizes
- sample files of the broadest type and authorship (author groups such as Microsoft, Adobe, etc.)
- an unbiased collection, not over-collecting specific file types.
Files are then reviewed and placed into three buckets: known and verified valid; known and verified malicious; and unknown. An accurate review is imperative – the inclusion of malicious in the valid bucket or valid in the malicious bucket would create incorrect bias.
The extraction of attributes follows, which is substantively different from behavior identification or malware analysis historically conducted by threat researchers. Rather than seeking things analysts believe might be malicious, this approach leverages the compute capacity of machines and data-mining to identify the broadest possible set of file characteristics — some as basic as the file size and others as complex as the first logic leap in the binary.
The atomic characteristics are then extracted, depending on file type (.exe, .dll, .com, .pdf, .java, .doc, .ppt, etc.). By identifying the broadest possible set of attributes, manual classification bias is removed. Use of millions of attributes also increases the cost an attacker incurs, creating a piece of malware that could go undetected. This attribute identification and extraction process creates a file genome comparable to the human genome and can be used to mathematically determine expected characteristics of files, just as human DNA analysis is leveraged, determining characteristics and behaviors of cells.
Once collected, the output is normalized and converted to numerical values for use in statistical models. Vectorization and machine learning are then applied to eliminate human impurities and to speed analytical processing. Leveraging the attributes identified in extraction, mathematicians then develop statistical models that predict whether a file is benign or malicious. Dozens of models are created with key measurements, ensuring the predictive accuracy. Ineffective models are scrapped. Effective models are subjected to multiple levels of testing.
The first level starts with a sample of known files. Later stages involve the entire file corpus (tens of millions of files). The final models are then loaded into a production environment for use in file classification.
It’s important to remember that for every file scrutinized, millions of attributes are analyzed to differentiate between legitimate files and malware. This is how machine learning identifies malware – whether known or unknown – and achieves unprecedented levels of accuracy. It divides a single file into an astronomical number of characteristics and analyzes each against hundreds of millions of other files to reach a decision about the health of each characteristic.
Statistical models once built can be used by math engines to classify files, which are unknown (e.g., files never seen before). This analysis takes milliseconds and is extremely precise because of the breadth of the file characteristics analyzed.
Using statistical models, the classification is not opaque. A “confidence score” is included as part of the process. This score provides incremental insight that can inform decisions regarding what action to take – block, quarantine, monitor or analyze further.
An important distinction between a machine-learning approach and a traditional approach is that the mathematical approach builds models that specifically determine if a file is benign or malicious. It returns a response of “suspicious” if confidence about a file's malicious intent is less than 20 percent and there are no other indications of maliciousness. An enterprise can thus gain a holistic perspective on the files running in their environment.