A Comparative Study of Two-Stage Intrusion Detection Using Modern Machine Learning Approaches on the CSE-CIC-IDS2018 Dataset

Abstract

Intrusion detection is a critical component of cybersecurity, enabling timely identification and mitigation of network threats. This study proposes a novel two-stage intrusion detection framework using the CSE-CIC-IDS2018 dataset, a comprehensive and realistic benchmark for network traffic analysis. The research explores two distinct approaches: the stacked autoencoder (SAE) approach and the Apache Spark-based (ASpark) approach. Each of these approaches employs a unique feature representation technique. The SAE approach leverages an autoencoder to learn non-linear, data-driven feature representations. In contrast, the ASpark approach uses principal component analysis (PCA) to reduce dimensionality and retain 95% of the data variance. In both approaches, a binary classifier first identifies benign and attack traffic, generating probability scores that are subsequently used as features alongside the reduced feature set to train a multi-class classifier for predicting specific attack types. The results demonstrate that the SAE approach achieves superior accuracy and robustness, particularly for complex attack types such as DoS attacks, including SlowHTTPTest, FTP-BruteForce, and Infilteration. The SAE approach consistently outperforms ASpark in terms of precision, recall, and F1-scores, highlighting its ability to handle overlapping feature spaces effectively. However, the ASpark approach excels in computational efficiency, completing classification tasks significantly faster than SAE, making it suitable for real-time or large-scale applications. Both methods show strong performance for distinct and well-separated attack types, such as DDOS attack-HOIC and SSH-Bruteforce. This research contributes to the field by introducing a balanced and effective two-stage framework, leveraging modern machine learning models and addressing class imbalance through a hybrid resampling strategy. The findings emphasize the complementary nature of the two approaches, suggesting that a combined model could achieve a balance between accuracy and computational efficiency. This work provides valuable insights for designing scalable, high-performance intrusion detection systems in modern network environments.

Description

Keywords

intrusion detection, stacked autoencoder, apache spark, machine learning, principal component analysis, cybersecurity, CSE-CIC-IDS2018

Citation

Hewapathirana, I. U. (2025). A Comparative Study of Two-Stage Intrusion Detection Using Modern Machine Learning Approaches on the CSE-CIC-IDS2018 Dataset. Knowledge, 5(1), 6. https://doi.org/10.3390/knowledge5010006

Endorsement

Review

Supplemented By

Referenced By