Malware prediction

by Ketan Nilangekar

banking software development

Data Science and commercially available AI/ML implementations now make it possible to predict whether a vulnerability can be weaponized into malware. This could be a critical moment in cybersecurity as it allows vulnerability management to be truly proactive and reduces the remediation workload. But why bother with this? And even if we did, how could this be done? Lets take a look.

Vulnerability Management A.K.A The story of Sisyphus

Vulnerability management lies at the intersection of two spheres of pretty big data sets. The first is the universe of vulnerabilities that are getting published and updated every day by vendors and researchers all over the world. The second is your attack surface which is constantly changing as well. We are told to keep scanning our attack surfaces every day, week, month. And hope that we catch and patch the vulnerabilities that one day, may or may not be weaponized against us. So just like Sisyphus did in Greek mythology, we keep rolling the boulder of scanning up the hill every day, week, month. And it just keeps rolling back down with little to show.

This may not be as bad as it sounds, were it not for the fact that the boulder keeps getting heavier and the hill keeps getting steeper every day. Why? Consider this, 17308 vulnerabilities were published in 2019. In 2015, this number was 6487. And these are only the vulnerabilities which are accounted for in NVD. Our research tells us that there were atleast 5745 vulnerability advisories from open source projects with no CVE numbers assigned to them just in the past year. It is clear that this problem will continue to grow as software continues to drive more and more of our everyday life.

With this deluge though, comes noise. And a lot of it. Very few of these vulnerabilities are actually going to be of interest to you based on your attack surface. To add to that, even fewer of these have or can be exploited. Even fewer will be exploitable over the network or remotely and not just by a script kiddie, local-access, root-privilege requiring vulnerability. Even for remotely exploitable vulnerabilities that require little or no privilege, chances are that very select few of those will be weaponized into malware. This is not to say that the number of malwares released every year is going to be anything to ignore as this report from MalwareBytes tells us. But it is almost certain that you will be looking for that proverbial needle in the haystack for the “big one” that you must fix.

So when we run our vulnerability scans every day, week, month, what are we really getting out of them? What are we really getting out of pushing the boulder up each time only for it to roll back down? Can we do anything to lighten the load perhaps?

The Promise of Exploit and Malware Prediction

Commercially available AI and ML algorithms now give us the ability to crunch very large historical datasets into models that can predict the chances of exploitability and malware emerging vulnerabilities. If our assertion about the kind of vulnerabilities that eventually get weaponized into malware is correct, then in theory we can train a neural net to identify those kind of vulnerabilities early on and give us a decent indicator the probability of those being used in an exploit or malware.

The dataset that we would have to use may includes features such as attack vectors, weakness type and description of vulnerabilities which have been successfully exploited or weaponized in the past. It is important to choose the right feature set to build the ML model. Some features such as vendor / product lists may end up biasing the model while others like title / summary may have limited impact on the effectiveness of the model. This dataset needs to be carefully curated, labeled and partitioned. Choice of ML model (neural net, SVM etc.) depends on the size of the dataset. Layers and NLP techniques that will help effectively summarize the vulnerability will be very useful to improve the accuracy of the model.

The Use Case

A successful prediction engine should be able to flag vulnerabilities that are most likely to be exploited or weaponized with a high degree (>95%) of accuracy. These inputs can be used to prioritize vulnerabilities and reduce the remediation workload. It is important to highlight the fact that this prioritization should be a guideline for SecOps and should not be taken as a pass to ignore other critical vulnerabilities even if they are not flagged.

In the ongoing fight against cyber threats, techniques like this will help us keep up with the adversaries by increasing the effectiveness of our proactive security tools.


Leave a Reply

Your email address will not be published. Required fields are marked *