Apriori Algorithm Application on the Prevalence of Computer Malware

Objective: The study aims to identify the characteristics, sources of computer malware. Methods: Data mining technique was explicitly utilized the Apriori algorithm to determine the characteristics, types,and sources of malware once it infiltrates the computer system. The processof gathering the data was through a survey questionnaire using Google form where there are two hundred five (205) IT students answered the survey form. Findings: The analysis shows that Apriori algorithm 98% accurately generates association rules on computer malware. Hence, computer malware can quickly spread through the use of flash drives and common malware that infects computer laboratory is a Virus. Application/Improvements: In the formulation of the computer laboratory policy, there must be clear policy on the use of flash drive inside the laboratory to avoid spreading computer malware that can damage the computer hardware/software. Additionally, a comprehensive study on the use of Apriori Algorithm is recommended.


Introduction
In this modern age where data grows in complexity and is rambling beyond the horizon. Tiny datasets are drifting around waiting to unravel essential information through data mining process. Data mining is a technique used to interpret large data and elucidate hidden patterns and information 1 . Further, data mining is in search of patterns and relationship in large databases 2 . Data miningin the medical field, banking firms, food, and the drug is widely used to ensure customers safety by using artificial intelligence analysis, usually applied to large-scale datasets 3 . Moreover, revealing hidden patterns and relationships uses functions in data mining such as (clustering, classification, prediction, and association). One essentialtask in data mining is that of the association rule. Association rule was first introduced in 1993, as a technique in data mining thatidentifies and extracts frequency patterns, association, correlations and relationships among sets of items in databases 2 . Example of association rule was used to analyzecustomer buying habits and behavior. The goal of the market -the basketwas to identify the customer frequently purchased products. If the customer bought ITEM A (apple, sandwich), then Item B(drinks) will most likely be purchased. This data can be used to organize and display the products, approximately close to each other and making it more accessible to the customer to purchase the product 4,5 . Also, association rule mining in microeconomics product selection based on the parameters that are frequently applied by retailers to endorse their product selection decision-making process 6 . Hence, results showed that the model was capable of identifying cross-selling effects implicitly by using frequent item sets, instead of having to calculate crossselling parameters explicitly. An essential characteristic of association rule mining is that it separates the problem of mining into sub-problems to do efficient computing. One problem is finding frequent item sets from the database, and the other problem generates association rules from the database 7,8 . Hence, association rule using the Apriori algorithm was used to elicit significant information in the medical field specifically those patients with diabetic conditions. It is vital to note that the nature of medical information is categorized, quantitative or Boolean. Data mining with association rules as described is concerned merely getting meaning of Boolean data 9 .
Moreover, association rule in mining can be used to identify computer malware that commonly destroys files and computer hardware. Computer malware isa computer program designed to damage computer system and application 10 . Computer malware comes in different forms, including spyware, ransomware, viruses, worms, Trojan horses, adware, or any malicious code that infiltrates a computer. According to Panda Security research, everyday there are about 230,00 new malware produced, and it is predicted to keep on increasing each year 11 . As technology advances, hackers use email to collect information to generate money using ransomware attack. However, according to the researcher from Erlangen-Nuremberg University, people are not cautious about the effect of opening unknown links send through email. About 78% of people still open unknown email or spam delivered by the unknown sender 12 . Furthermore, 81% of internet users become a victim of data breach. Internet users do not have a system that will self-detect data breach 13 .
In the Philippines, the Department of Information and Communication Technology (DICT) formulated a National Cybersecurity Plan 20222 ensuring the safety of each Filipino people in using the Internet. The primary goals of this Plan are as follows: (1) tocontinuallyprovide the cooperation of public and military networks, (2) implement a quick response on cybersecurity threats during and after the attack, (3) efficiently coordination with law enforcement agencies and (4) to educated society about cybersecurity 14 . Access to the Internet is a fundamental element to improve the quality of education. Some State, College,and Universities (SUC's) have acquired internet services to provide better training. Instructors utilize the net to access online materials to supplement new learning, and students havea widerange ofaccess to learning not just at the library but also seeing a virtual library on the net.Interactive teaching methods, supported by the Internet, enable teachers to pay more attention to individual students' needs and support shared learning.
Moreover, the approach to the Internet helps educational administrators to scale down the monetary values and improve the caliber of schools and colleges 15 . Also, having internet access at school entails great responsi-bility in accessing information. Teachers and students should know the possible treat attached to the internet. Some email contains computer malware, once the virus penetrated the computer system it can affect the computer of other users 16 . However, educating users about computer malware is essentialto avoid becoming a victim of computer malware.
There are many association rule algorithms like Apriori Algorithm, Eclat Algorithm, and FP-growth Algorithm 17 . Moreover, the study utilizes the association rule specifically the apriori algorithm to determine and raise awareness to the students on how computer malware affects the computer and destroys the data and files stored by the student. Hence, this will likewise identify the characteristics, sources of computer malware.

Methodology
The study deploys the use of Knowledge Discovery in Databases (KDD). The KDD is a techniquethat will unveil hidden knowledge in large databases 18 . The study undergoes the process of KDD as shown in Figure 1.  The collection of data isthrough a survey questionnaire which was created using Google form. The research asked permission from the teacher in-change to conduct the study by giving the links generated in google form to answer the survey questionnaire. A total of 205 out of322 respondents or 63.66% Information Technology students participated in the conduct of the study. Table 1 shows the total respondents per year level.

Preprocessing
In the preprocessing stage, it improves the data reliability by removing some of the attributes that are insignificant in the process of data mining presented in Table 2. Example, the Age, Year Level,and Knowledge in computer virusinformationwas foundirrelevant in the process of identifying computer malware was discarded. Moreover, multiple answers in Characteristics of Malware and Sources of Malwarewas broken into individual responses and placed it in one column.

Transforming
In the transformation stage, the researcher carefully translated the answers of the respondents into codes. After, in Weka application, the ARFF file was loaded after the data cleaning of data found in Table 3 has been conducted.

Data Mining
The researcher utilized the Apriori Algorithm. Apriori is a seminal algorithm in finding patterns for frequent itemsets. The sorting of the itemset in Apriori transaction through lexicographic order 8 . Association rule mining: Let I= { a 1 , a 2 , a 3, a n ….} be the attributes called items. Let TD= { Dt 1 , Dt 2 , Dt 3 , Dt n ….} be set of transaction database. Each transaction in DT has a unique transaction ID and contains a subset of the items in I. A rule is defined as in every DT of records, X ⇒ Y means, a record of I contains X then I also contains Y 19 . The item set X and Y is called support and consequent of the rule respectively.
The support supp(X) has the rule of: supp(X)= number of transactions which X appears total number of transactions Apriori Algorithm: Pseudocode • Join Step: joined C k and Lk-1 generatedwith itself • Prune Step: Any (k-1)-item set that is not frequent cannot be a subset of a frequent • k-itemset. C k : Candidate item set of size k L k : frequent item set of size k L 1 = {frequent items}; for (k = 1; L k !=∅; k++) do begin C k + 1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t Lk + 1 = candidate in Ck + 1 with min_ support end return ∪ k L k;.