Identificación de "malware" perteneciente a ataques APT mediante la selección de características altamente discriminatorias usando técnicas de "Machine Learning"

  1. Martín Liras, Luis Francisco
Supervised by:
  1. Miguel Ángel Prada Medrano Director
  2. Adolfo Rodríguez de Soto Director

Defence university: Universidad de León

Fecha de defensa: 03 February 2023

Committee:
  1. Luis Magdalena Layos Chair
  2. Vicente Matellán Olivera Secretary
  3. Alejandro de la Calle Negro Committee member

Type: Thesis

Abstract

This thesis aims to contribute to the detection of cybersecurity attacks known as “Advanced Persistent Threats (APTs)”. These attacks are characterised by the difficulty to detect, their severity, and the fact that they are mainly targeted at corporations such as companies or governmental institutions. Anti-malware software is not always able to identify this type of malware as they are often hidden as benign software or generic malware (the one sent daily to millions of people) and require experts to detect them. The line of research carried out in this work builds a solution for the identification of APTs through the detection of the malware used in the attack. Several machine learning techniques have allowed us to classify malware according to its likely use in such an attack. After the introductory chapter 1, this Thesis contains a description of the previous works, in chapter 2, and the methodology used throughout the work, in chapter 3. Chapter 4 is devoted to the description of the first corpus of data generated, a set of 19,457 malware samples with 1,941 different binary and numerical features. To the author’s knowledge, this is the most complete repository published to date for the purpose of identifying malware belonging to APT attacks. The analysis of the dataset shows that there is a relationship between the APT malware samples. In chapter 5, the selection of the 238 most discriminative features that would allow the identification of APT attack malware from a set of generic malware samples is detailed. The automatic feature selection revealed knowledge about APTrelated malware, such as the importance of the functions imported by the malware samples and the APIs used during their execution to identify that a malware sample could belong to an APT. The classification experiments performed on this feature pre-selection yielded very good results, allowing more than 97% of the samples to be detected as APT malware. Three years after the initial dataset was obtained, a second, smaller dataset was generated, although similar in structure to the original one with samples of malware and APTs from this new epoch. Chapter 7 describes the validation mechanisms performed on this second dataset, obtained independently of the first one. The classification experiments with the original model trained on the first dataset continued to be adequate for the detection of malware belonging to APTs, and were validated with the second dataset. The classifiers continued to provide a very high classification accuracy of over 90%. The set of the most discriminative features of this new dataset was also re-calculated using the same techniques as the first one. The new feature set obtained was very different from the first one, which would indicate that malware samples evolve over time. All of the above suggests that a system for identifying malware pertaining to APT attacks should periodically recalculate this set of features. However, the work carried out allows us to argue that the set of features initially proposed is sufficiently discriminative, even after a long period of time. Moreover, it has been shown that a fixed view of malware cannot be assumed, considering that neither malware nor its characteristics evolve. On the contrary, the environment environment is not stationary due to the conflicting nature of the malware. The characteristics of new malware samples related to APT campaigns undergo some changes (e.g. the packers used or the different features that are most important in the new dataset) because they need to evolve in response to advances in malware detection. For this reason it seems that the classification accuracies can clearly be extrapolated to new future malware. Finally, the assessment on a completely new dataset provided insight into new trends in the development of malware that could be investigated in future work.