Machine learning methods are often presented by developers of security solutions as a silver bullet, or a magic catch-all technology that will protect users from a huge range of threats. But just how justified are these claims? Unless explanations are provided as to where and how exactly these technologies are used, these assertions appear to be little more than a marketing ploy.
For many years, machine learning technology has been a working component of Kaspersky Lab’s security products, and our firm belief is that they must not be seen as a super technology capable of combating all threats. Yes, they are a highly effective protection tool, but just one tool among many. My colleague Alexey Malanov even made the point of writing an article on the Myths about machine learning in cybersecurity.
At Kaspersky Lab, machine learning can be found in a number of different areas, especially when dealing with the interesting task of spam detection. This particular task is in fact much more challenging than it appears to be at first glance. A spam filter’s job is not only to detect and filter out all messages with undesired content but, more importantly, it has to ensure all legitimate messages are delivered to the recipient. In other words, type I errors, or so-called false positives, need to be kept to a minimum.
Another aspect that should not be forgotten is that the spam detection system needs to respond quickly. It must work pretty much instantaneously; otherwise, it will hinder the normal exchange of email traffic.
A graphic representation can be provided in a project management triangle, only in our case the three corners represent speed, absence of false positives, and the quality of spam detection; no compromise is possible on any of these three. If we were to go to extremes, for example, spam could be filtered manually – this would provide 100% effectiveness, but minimal speed. In another extreme case, very rigid rules could be imposed, so no email messages whatsoever would pass – the recipient would receive no spam and no legitimate messages. Yet another approach would be to filter out only known spam; in that case, some spam messages would still reach the recipient. To find the right balance inside the triangle, we use machine learning technologies, part of which is an algorithm enabling the classifier to pass prompt and error-free verdicts for every email message.
How is this algorithm built? Obviously, it requires data as input. However, before data is fed into the classifier, is must be cleansed of any ‘noise’, which is yet another problem that needs to be solved. The greatest challenge about spam filtration is that different people may have different criteria for deciding which messages are valid, and which are spam. One user may see sales promotion messages as outright spam, while another may consider them potentially useful. A message of this kind creates noise and thus complicates the process of building a quality machine learning algorithm. Using the language of statistics, there may be so-called outlier values in the dataset, i.e., values that are dramatically different from the rest of the data. To address this problem, we implemented automatic outlier filtration, based on the Isolation Forest algorithm customized for this purpose. Naturally, this removes only some of the noise data, but has already made life much easier for our algorithms.
After this, we obtain data that is practically ‘clean’. The next task is to convert the data into a format that the classifier can understand, i.e., into a set of identifiers, or features. Three of the main types of features used in our classifier are:
- Text features – fragments of text that often occur in spam messages. After preprocessing, these can be used as fairly stable features.
- Expert features – features based on expert knowledge accumulated over many years in our databases. They may be related to domains, the frequency of headers, etc.
- Raw features. Perhaps the most difficult to understand. We use parts of the message in their raw form to identify features that we have not yet factored in. The message text is either transformed using word embedding or reduced to the Bag-of-Words model (i.e., formed into a multiset of words which does not account for grammar and word order), and then passed to the classifier, which autonomously identifies features.
All these features and their combinations will help us in the final stage – the launch of the classifier.
What we eventually want to see is a system that produces a minimum of false positives, works fast and achieves its principal aim – filtering out spam. To do this, we build a complex of classifiers, and it is unique for each set of features. For example, the best results for expert features were demonstrated by gradient boosting – the sequential building up of a composition of machine learning algorithms, in which each subsequent algorithm aims to compensate for the shortcomings of all previous algorithms. Unsurprisingly, boosting has demonstrated good results in solving a broad range of problems involving numerical and category features. As a result, the verdicts of all classifiers are integrated, and the system produces a final verdict.
Our technologies also take into account potential problems such as over-training, i.e., a situation when an algorithm works well with a training data sample, but is ineffective with a test sample. To preclude this sort of problem from occurring, the parameters of classification algorithms are selected automatically, with the help of a Random Search algorithm.
This is a general overview of how we use machine learning to combat spam. To see how effective this method is, it is best to view the results of independent testing.