Integrated Cluster-based Rule Induction Mining of Temporal Data for Time-Series Analysis

Objectives: To deliver excellent performance when compared to existing method Mining Comprehensible Classification Rules for Time-Series (MCCR-TS) using zoo dataset collected from UCI repository comprising of more than 1000 records. Methods/Analysis: Research works on Time-series data has been evolving as a new trend in current scenarios due to the wide range of applications involved in it. One of the widely researched topics is the web usage mining and effective evaluation of time-series movements and transaction associated with them. Most of the studies conducted earlier, focused on identifying the time-series data from the entire logs. However, this type of patterns may not be accurate enough for evaluation due to differentiated behaviors of user patterns are not taken into consideration. Findings: In this paper, we examine the time series for temporal data using integrated cluster-based rule induction mining, by presenting a novel algorithm, namely, Integrated Cluster-based Rule Induction Mining for Time-Series analysis (ICRIM-TS). The algorithm discovers temporal data and the similarities between users are evaluated by the proposed measure, Time Series. To our best knowledge, this is the first work on mining and identification of user behaviors based on the time series data with preferences being given for both user relations and temporal data at the same time. Improvement: Through experimental evaluation, under various settings, the performance of web usage mining is evaluated in terms of precision of users, error threshold value and number of rules extracted.


Introduction
Temporal data mining addresses on certain tasks comprising of classification, clustering, segmentation, forecasting, and prediction. Web usage Mining has been divided into three sub-groups. This kind of classification is represented in Figure 1. The Web Content contains the actual textual and other information related to multimedia, the Web structure concentrates on the structure of the Web documents. Conventionally, information provided by integrating web content combined with the web structure has been widely used for searching process and ranking pages returned by a search process for a specific query. The third kind of Web data, Web Usage reveals the behavior of user surfing patterns that has been a Keywords: Cluster-based Rule Induction Mining, Temporal Data, Time-series Analysis, User Relations, Web Usage Mining great deal of attention for different types of applications. Owing to the complexity involved in analyzing different user behaviors, the web usage mining has received much attention in the recent times to read the human behavior.
The Web Usage Mining is further divided into three categories based on the nature of data. The first one refers to web server data that relates to the user logs accumulated at web server which comprises of IP address from which the request was placed, the timestamp, the URIs of the requested and referral documents. The second one refers to the application server log that is created dynamically by different application servers. Finally, the third model refers to the application level log which is provided by the user for a specific application.
In this paper, we describe a new approach to discover and apply Integrated Cluster-Based Rule Induction Mining (ICRM) for time series. Our approach combines the following properties in a unique way.
(i) ICRIM framework incorporates the clustering model and Induction based decision rule model (ii) Evolutionary clustering model discovers web data clusters and analyzes the visitor web site behavior and optimally segregate similar user interests (iii) Rule Induction mining generates inferences and implicit hidden behavioral aspects in the web usage mining to investigate at the web server and client logs (iv) Achieves flexibility by using time series analysis based on the ICRIM framework with varied timing constraints with efficient time series segmentation techniques.
To achieve the properties mentioned above, the work ICRIM uses Fuzzy C Means algorithm (FCM), back propagation algorithm and rule induction mining technique using tree structure, by sorting them down the tree from the root node to leaf node illustrated in Figure 1. In addition, we derive a novel similarity measure for time series using Sliding Window Segmentation to support the complexity involved in clustering and to avoid the "curse of dimensionality" problem. The remainder of the article is structured as follows. In this section some related work is discussed. The novel idea for time series analysis, which we call ICRIM-TS, is introduced in Section 2. Section 3 makes a detailed analysis of experiments conducted using zoo data set derived from the UCI repository. Finally, Section 4 summarizes the results and discussions and gives an outlook to future work. With the advancement in hardware and communication technologies, time series is gaining more popularity than ever in different fields ranging from data processing, monitoring intrusion and anomaly detection. A time series in integrated cluster and induction rule mining represents a collection of data values gathered at definite interval of time to determine various characteristics of an entity.
Three different approaches, polynomial, Discrete Fourier Transform (DFT), and probabilistic, to identify the unknown values and answer similarity queries on the basis of the predicted data were proposed 1 . An efficient algorithm using suffix tree as the underlying data structure is presented 2 to detect all the three types of periodicity patterns in data mining. The author presents an algorithm to detect symbol, sequence (partial), and segment (full-cycle) periodicity in time series. The work on periodicity detection in time series data 3 is a challenging problem as it is of great importance in many applications. Most work focused on mining synchronous periodic patterns the author focus was on presenting more flexible model of asynchronous periodic pattern using two-phase algorithm. It shows that this algorithm cannot only provide linear time complexity with respect to the length of the sequence but also achieve space efficiency.
The structural analysis of temporal data and the prediction of future data values on the basis of time series are the most important problems that data analysts face in web mining. A new approach was pre-sented to forecast the behavior of time series 4 based on similarity of pattern sequences by using clustering techniques with the aim of grouping and labeling the samples from given dataset. A novel weighted consensus function 5 by using clustering validation techniques is used to define initial partitions to further divide into consensus partitions from numerous perspectives. An analyzes of an incremental system 6 for clustering streaming time series which comprises of a process Online Divisive-Agglomerative Clustering (ODAC) system which continuously maintains a tree-like hierarchy of clusters on basis of top-down strategy.
Numerous techniques have been designed to extract complex classification rules from time series. A new technique 7 for temporal data mining on the basis of classification rules was provided which can easily be understood by domain experts using the two techniques Time Series Segmentation and Segment Representation. A rule-based classifier by using a generalized Fuzzy-Rough Set (FRS) framework 8 using heuristic algorithms which achieve optimal attribute value to build a rule-based classifier. The use of temporal fuzzy chains 9 for designing dynamical systems also discussed. An unsupervised ensemble learning approach 10 to time series clustering by integrating Rival-Penalized Competitive Learning (RPCL) networks with different form of time series was presented. To perform the evaluation of web mining 11 for timeseries analysis based on precision and number of rules extracted using time series segmentation techniques. The brief explanation of the proposed work is explained in section 2.

Materials and Methods
Integrated Cluster-based Rule Induction 12 web usage mining for time-series involves the analysis of web log data to identify and represent time series data.
It also obtains the similarity values derived from the patterns which improve the pre-fetch pages. The proposal ICRIM-TS provide with the method of how the time series are segmented and the similarity values for extracting useful information based on these models are evaluated. The framework of ICRIM-TS discussed in Figure 2 involves three major phases which comprises of (1) preprocessing of access log file (2) pattern discovery for time-series data and (3) pattern analysis of access log files.
The mentioned sequential patterns are also applied to web log data.

Pre-processing of Access Log
In the first phase, the pre-processing of access log file is performed using raw series that are obtained as input. These raw data may be noisy due to the errors involved in it. The input data are henceforth smoothed, to remove noise. Our work uses weighted average method based on the assumption that the window time size WT and DistWT represent the distance in time between two successive logs. For example, each log value  The first step involved in pre-processing of ICRIM-TS consists of selecting the window size and the distance between two logs. The second step involved is the evaluation of previous smooth logs, current value and average log obtained from web server.

Pattern Discovery for Time-series Data
The second phase involved in ICRIM-TS is to discover the pattern for the output log files (pre-processed) obtained in 3.1. Pattern discovery for ICRIM-TS includes segmentation techniques. From the different segmentation algorithms which are in existence, the ICRIM-TS uses sliding-window method to discover similar patterns which is suitable for segmenting time-series obtained in real time. Two of the important threshold factors controlling the segmentation technique are the Error Tolerance Value (ETV), which corresponds to the derived information loss and cost of segment. If the value of ETV = 0, implies no information loss, whereas higher ETV represents the loss related accordingly. The next factor to be evaluated while pattern discovery is to find the cost of segment derived from the least square error.
A simple data set log files consists of m values (a i , b i ), where i = 1,2,….,m where a i denotes independent variable and bi denotes dependent variable whose value is obtained through web servers. The cost of segmentation is obtained through least square method when CSeg is minimum. From equation 3.2 The squared residual value (R i 2 ) is derived as the difference obtained between the dependent variable (bi) and the predicted model (a i , β). Where The pattern discovery of access log file described below using pseudo code from the equation 3.3 is illustrated in Figure 4.
The pattern discovery for ICRIM-TS is evaluated through three steps. The first step is to select the error threshold value. The second step is to evaluate the cost of the segment and the process is iteratively performed till all the log files are accessed and similar patterns are discovered.

Pattern Analysis of Access Log Files
The third phase consists of analyzing the pattern to identify the user interest of log data 13 collected in a web server on basis of time series. Latest trend in web mining for time series involves converting numeric time series to symbolic representation using Symbolic Aggregate Approximation algorithm which requires less storage space. The distance between strings of time series A' and B' is defined in the equation 3.4 as where, the dist function is evaluated using the lookup table for the particular log files. The pattern analysis of access log file is described below using pseudo code is illustrated in Figure 5. In the first step the normalization of time series log files from web server are performed. In the second step, the dimensionality of the time series for log file is reduced using PAA which is divided into equal sized segments. In the third step, the discretization of PAA (piecewise aggregate approximation) is obtained. This is done by identifying the number and the location of the breakpoints, determined by statistical lookup tables. The last step involved is evaluating the similarity measure as given in equation 3.4.

Results and Discussion
The proposed ICRIM for time series based on its precision of users and inductive rules generated are implemented in Java. The experiments were run on an Intel P-IV machine with 2 GB memory and 3 GHz dual processor CPU. In this section, the performance evaluation tests aimed at comparing the most important properties of time series with respect to precision value, inductive rules generated and error threshold value.
Evaluation is made for ICRIM-TS framework by way of using Zoo dataset for more than 1000 records. For example, in Zoo dataset, the number of persons visited to the zoo are evaluated on the basis of working days and hours and evaluated for different time series with detailed comparison made. The forthcoming section discusses about the results of ICRIM-TS framework.
In this work, we have seen how web mining for ICRIM-TS is designed for time series temporal data 14 to capture the user behaviors for both user relations and temporal data written in mainstream languages such as Java. We run independent tests with growing number of web server and client's logs, and constant number of service requests sent by each user to web services 15 . The performance graph and table describes the evaluation of the performance of ICRIM-TS using Sliding Window Segmentation (SWS).
The Table 1 describes the performance of the ICRIM-TS using SWS analysis based on precision rate.  Figure 6 describes the precision for performing the web client and server logs observed from the web server using time-series. Varied number of transactions are used in the experimentation to validate the ICRIM-TS using segmentation techniques. Comparison result of the ICRIM-TS using Sliding Window Segmentation (SWS) with an existing MCCR-TS based on Gaussian Mixture Model (GMM), measured in terms of precision. When number of transactions obtained from the zoo data set increases, the effectiveness of the process of precision obtained is high in the ICRIM-TS using SWS contrast to an existing GMM. The performance graph of the proposed ICRIM-TS using SWS is shown in the figure 6. The variance in the precision for similarity retrieval would be 10-15% high in the proposed ICRIM-TS using HDM. The Table 2 describes the performance of the ICRIM-TS using SWS based on Inductive rules.  Figure 7 describes the inductive rules generated using zoo dataset. Different ranges of datasize are used in the experimentation to validate the ICRIM-TS using SWS. Comparison result of the ICRIM-TS using SWS with an existing GMM, measured in terms of inductive rules. When the datasize increases the inductive rules generated also gets increased, the information obtained for similarity retrieval is high in the ICRIM-TS using SWS contrast to an existing MCCR-TS based on SWS. The performance graph of the proposed ICRIM-TS using SWS is shown in the figure 7. The variance in the inductive rules generated for purpose of identifying similarity retrieval is 12-15% high in the proposed ICRIM-TS using SWS. The Table 3 described the performance of the proposed ICRIM-TS using SWS based on Error Threshold Value. One of the important metrics to be considered for performing ICRIM-TS using SWS is the allowed ETV which indicates the loss of information to provide solution to "curse of dimensionality". The value of ETV to be 0 symbolizes that no information loss occurs. Higher value of ETV results in raise in loss of information while performing similarity retrieval. Figure 8 describes the number of decisive rules obtained using the sliding window segmentation. Comparison result of the work ICRIM-TS using SWS with an existing MCCR-TS based on GMM, measured in terms of Error Threshold Value (ETV) is made. An increase in the number of decisive rules results in the higher ETV value but when compared to the existing model, MCCR-TS, the ETV value lessens due to the application of SWS. The performance graph of the proposed ICRIM-T using SWS is shown in the figure 8. The variance in the ETV for obtaining similarity measure would be 20% low in the proposed ICRIM-TS using SWS.
Finally, it is observed that the proposed ICRIM-TS is used for measuring the similarity value obtained during different time intervals on the basis of temporal data. The process of ETV and cost of segmentation is done effectively by the SWS and the web server and client logs similarity search is done efficiently and optimally segregates user interest by solving curse of dimensionality.

Conclusion
In this paper, ICRIM-TS is efficiently implemented using Sliding Window Segmentation Model for identifying the user interest by evaluating the web server and client logs. The transactions are accessed to perform the similarity measure for different number of transactions by solving curse of dimensionality. The Integrated Cluster Rule Induction Mining using temporal data for time series analysis is measured with metrics such as precision, inductive rules generated and allowed Error Threshold Value to test the effectiveness of information retrieval. The results showed that the proposed ICRIM-TS using Sliding Window Segmentation is 70% better in performing the users' tasks based on the information retrieval process amongst the web server and client logs available in the web server.