Proposed Discriminative Lexical Features for Real-time Detection of Malware Uniform Resource Locator

M  Olalere; M  T  Abdullah; R  Mahmod  and A  Abdullah

doi:10.17485/ijst/2016/v9i46/107081

Article

Proposed Discriminative Lexical Features for Real-time Detection of Malware Uniform Resource Locator

VIEWS 892
PDF 206

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2016/v9i46/107081

Year: 2016, Volume: 9, Issue: 46, Pages: 1-10

Original Article

Proposed Discriminative Lexical Features for Real-time Detection of Malware Uniform Resource Locator

M. Olalere¹ , M. T. Abdullah^{2 *}, R. Mahmod² and A. Abdullah²

¹Cyber Security Science Department Federal University of Technology Minna, 920102, Niger, Nigeria ²Information Security Research Group Faculty of Computer Science and information Technology, Universiti Putra Malaysia, 43400, Selangor, Malaysia; [email protected]

*Author for correspondence
M. T. Abdullah Information Security Research Group Faculty of Computer Science and information Technology, Universiti Putra Malaysia, 43400, Selangor, Malaysia; [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objective: To identify discriminative lexical features of malware URL through manual examination, and to study prevalence of these features thereby leading to proposition of discriminative lexical feature for real-time detection of malware URL. Methods/Statistical Analysis: Manual examination of malware URL using existing blacklist of malware URLs and empirical analysis allowed the authors to identify discriminative lexical features and to determine whether there is consistency in the way the attackers craft malware URLs respectively. Empirical analysis was carried on both the existing blacklisted malware URLs and newly collected malware URLs. Empirical analysis revealed that there is consistency in the way malware URLs is crafted by the attackers. To evaluate performance of our proposed lexical features, two previously used machine learning models were applied on our trained dataset of malware URLS and benign URLs. The essence of using these models is to enable us compare performance of our proposed lexical features with previous studies proposed feature groups. Our comparison shows that our proposed lexical features outperform previously proposed feature groups. Findings: Our first step was to manually examine blacklisted malware URLs. This step led to the identification of 12 discriminative lexical features which was later reduced to 11. The second step was an empirical analysis of the identified features of existing blacklisted malware URLs and newly collected malware URLs. Empirical analysis was carried out to determine whether there was consistency in the way malware URLs are crafted by the attackers. The results of our empirical analysis revealed that there is indeed consistency in the way malware URLs are crafted by the attackers. This implies that our carefully identified lexical features are common features of malware URL. After experimentation, the evaluation results reveal that our proposed lexical features outperform previously proposed feature groups. Applications/Improvements: Discriminative features are required to build real-time malware URLs detection system with machine learning algorithm. The proposed lexical features are set of discriminative feature that rely on textual properties of malware URL.

Keywords: Attackers, Blacklist, Lexical Features, Malware URL, Rea-time Malware URL Detection