Hiding Sensitive Association Rules by Elimination Selective Item among R.H.S Items for each Selective Transaction

This paper focuses on hiding sensitive association rule which is an important research problem in privacy preserving data mining. For this, we present an algorithm that decreases confidence of sensitive rules to below minimum threshold by removing selective item among items of consequent sensitive rule (R.H.S) for each selective transaction. Finally, we qualitatively compare the efficiency of the proposed algorithm with that of already published algorithms in hiding association rules. data sets for processing and extracting knowledge using association rule mining


Introduction
Lately, the significant advances in data collection, data storage technologies and also the widespread use of the World Wide Web, has led to a huge volume of data. Therefore, Data mining has itself becomes a technique for automatically and intelligently extracting information or knowledge from a large amount of data. Despite the fact that it can assist data owners in strategic planning and decision making, it may also lead to reveal sensitive information. Therefore, in parallel development of data mining, a variety of questions can be raised including whether the data sources are used for other than the main goal. So, the new thread in data mining was introduced that should be designed a data mining system with privacy which can be faster with high volume of data storage and also able to prevent the disclosure of sensitive information. For this purpose, privacy preserving data mining has been extensively studied by researchers 4 .
Privacy preserving in association rule mining is one of the important and significant researched techniques of data mining. It achieves to extract and reveal hidden relations and interesting association structures among large sets of data items in the transaction databases. Nowadays many organizations and companies keep their data in transaction data sets for processing and extracting knowledge using association rule mining 5,13 .
In this paper, we focus on privacy preserving association rule mining. In doing so we assume that a certain subset of association rule, which is extracted from specific datasets, is considered as sensitive rules. Our goal then is modification of original data source in such a way that it would be impossible for the adversary to mine the sensitive rules from the modified data set and on the other hand, to minimize the side effects created by the hiding process as the sanitizing process can influence the original set of rules by I. hiding and eliminating not sensitive rules that before of sanitizing process these rules extracting (lost rules)

Background and Related Work
Few papers entitled Privacy Preserving Data Mining (PPDM) appeared in 2000. While they introduced similar problem, the concepts of privacy were completely different.

Secure Multiparty Computation
Secure multiparty computation to encrypt data values 7 , ensuring that no party acquires anything about another's data values. The goal of Secure Multiparty Computation (SMC) is that the parties involved infer nothing but the results 14 .

Obscuring Data
Another approach relying on data obscuration, modifying the data values so real values are not revealed 1 . As, a major feature of PPDM techniques is entail modifications to the data in order to sanitize them from sensitive information (both private data items and complex data correlations) or anonymity them with some uncertainty level. Therefore, in evaluating a PPDM algorithm it is important to determine the quality of the transformed data. To do so, we need methodologies for the estimation of the quality of data, intended as the state of the individual items in the database resulting from the application of a privacy preserving technique, and also the quality of the information that is exposed and extracted from the modified data by using a given data mining method 5 . Verykios et al. categorized PPDM techniques as five different dimensions: (1) data distribution; (2) data modification; (3) the data mining algorithm which the privacy preservation technique is proposed and designed for; (4) the data type (single data items or complex data correlations) that needs to be protected from reveal; (5) preserving privacy approach (heuristic, reconstruction or cryptography-based approaches). Clearly, it does not include all the possible PPDM algorithms. However, it gives the algorithms that have been designed and proposed so far, centralizing on their main features. Data mining discovers inferences that are interesting, but do not always hold. Methods and ways have been proposed to alter and modify data to bring the support or confidence of specific rules below a threshold 3,12 .
This paper is organized as follows; First, The general problem formulation and the basic definitions of association rule mining are discussed. Then, the proposed algorithm for sensitive association rules is given. Therefore, gives the experimental results of the proposed technique. The last section provides the conclusion and future work.

Transactional Databases
A transactional database is a relation consisting of transactions in which each transaction t is determined by an ordered pair, defined as t = <TID, list of elements>, where TID is a unique transaction identifier number and list of items expresses a list of items composing the transactions 1 .

The Basics of Association Rules
Formally, association rules are defined as follows: Let I = {i 1 ,...,i n } be a set of literals, called items. Let D be a database of transactions, where each transaction t is an item set such that t ⊆ I. A unique identifier, called TID, is asso- Association rule mining algorithms depend on support and confidence and mainly have two major phases: I. depending on a support (MST) set by the user and data owners, frequent item sets are given through consecutive scans of database; II. Strong association rules are extracted from the frequent item sets and limited by a minimum confidence (MCT) also set by user and data owners 3,10 .

Side Effects
The data loss (undesirable side effects) is defined, which results after the hiding process, by using four statements below: 1. If a rule R before the hiding process has conf (R) > MCT and after the sanitized process has conf (R) < MCT then this rule has been lost and hidden.
If a rule R before the hiding process has conf (R) < MCT and after the sanitized process has conf (R) < MCT then this rule has been created and discovered (ghost rule).
Clearly, one of the aims for an association rule hiding technique would be the limitation of lost rules (among the non-sensitive ones) and ghost rules, as far as possible 9,12,13 .

Proposed Algorithm
The proposed heuristic algorithm tries for least complexity and the adverse effects of lead hide sensitive association rules to a minimum.
As, the proposed algorithm each time hiding a sensitive rule, preprocessing on transactions of original dataset and among the whole transactions, finds only transactions that fully supports sensitive rule. Then, given priority to each transaction that obtained in this way, • Determine number of (sensitive, non sensitive and of negative-border that can be extracted under the influence of deletion operation) association rules that supported by transaction provided by at least one common item in right hand side of (sensitive, non sensitive)rules and also one common item in left hand side of negative border rules with one item in R.H.S of current sensitive rule(because , for us, only rules are important that with current sensitive rule have common and be affected by elimination of item) • Determine sum of confidence of common (sensitive, non sensitive and of negative-border) association rules. So that, whatever in this sum of confidence for sensitive common rules are less, meaning earlier affected by deletion item and sooner hidden, and whatever the sum of confidence of non sensitive rules are more, meaning later affected by deletion item and also whatever the sum of confidence for negative-border rules are less, meaning later affected by deletion item By obtaining each of these amounts and replace in formula of transaction priority, each transaction can be determined. Then, transactions sorted with highest priority. Therefore, among the items on the R.H.S (right hand side) of current sensitive rule, select the item that has the highest priority. Because, each item can repeat the different (sensitive, non sensitive and of negative-border) association rules with different confidence, we can have logic of given priority to transactions which is also used for items. So that, there may be different selection of items for each transaction and this causes a sudden support of only one item that does not reduce. And that, every time to reduce confidence of sensitive rule, an item is selected to remove, causes fewer side effects (lost rules, ghost rules). This process continues until confidence of current sensitive rule reduces below MCT threshold. Thus, this algorithm has three main stages:

Performance Evaluation
We have performed extensive experiments in order to compare the effectiveness of the algorithm presented in above. We run this algorithm in windows vista operating system at 2.10 GHz with 2 GB RAM. We used two datasets that these datasets are available through FIMI 15 and their properties are summarized in Table 1. And also Table 2 present the result of mining of these databases.
We will compare the proposed algorithm with published algorithms 6,11 for rule hiding that we also implemented. The first algorithm is called 1.b 6 and the second algorithm is called RRLR 11 .
In order to, Experiments were carried out on these algorithms can be divided into the following in general categories and results obtained from each one separately investigated: 1. The first category includes tests to hide the 3, 5, 7 sensitive association rule on dense dataset (Chess) and The proposed algorithm has the following steps:  sparse dataset (mushrooms) with evaluation criteria: hide failure (HF), this measure quantifies the percentage of the sensitive patterns that remain disclosed in the sanitized dataset. It is defined as the fraction of the sensitive association rules that appear in the sanitized database divided by the ones that appeared in the original dataset. Formally, where, R P (D´) equals to the sensitive rules disclosed in the sanitized dataset D´. R P (D) to the sensitive rules appearing in the original dataset D and |X| is the size of set X. Ideally, the hiding failure should be 0% 13 . As, Figures 2 and 3 show result of experiments of these algorithms. These figures indicate that these algorithms don't have hiding failure. 2. The second category includes tests to hide the 3, 5, 7 sensitive association rule on dense dataset (Chess) and sparse dataset (Mushrooms) with evaluation criteria: misses cost (MC), this measure quantifies the percentage of the non sensitive patterns that are hidden as a side-effect of the sanitization process. It is computed as follows: where, R ∼ P (D) corresponds the set of all non-sensitive rules in the original database D and R ∼ P (D' ) is the set of all non-sensitive rules in the sanitized database D´.As one can notice, there exists a agreement between the misses cost and the hiding failure, since the more sensitive association rules one needs to hide, the more association rules is expected to miss 13 . In Figures 4 and 5, we see, the proposed algorithm performs better than algorithm1.b and algorithm RRLR. 3. The third category includes tests to hide the 3, 5, 7 sensitive association rule on dense dataset (Chess) and sparse dataset (mushrooms) with evaluation criteria:   Artifactual Patterns (AP), this measure quantifies the percentage of the discovered patterns that are artifacts. It is computed as follows: where, P is the set of association rules exposed in the original database D and P´ is the set of association rules exposed in D´1 3 . Figures 6 and 7 present the number of ghost rules that are created after hiding process. These figures show that algorithm RRLR extracted more ghost rules. The proposed algorithm performs slightly better than algorithm 1.b.

Conclusion and Future Work
Association rule hiding methods can be very helpful when databases must be shared without the revealing of sensitive information. Accordingly, we had tried to present the algorithm that after the sensitive association rules have been removed, the database can still be mined for extraction of useful information. This algorithm with elimination selective item among items of right hand side of sensitive rules for each transaction that fully support sensitive ruled and sorted these transaction according to priority formula, cause to reduce confidence of sensitive rules below minimum threshold to hide sensitive rule with the least possible side effects each time. Finally, this algorithm was compared with algorithm 1.b and algorithm RRLR by Evaluation criterions: hiding failure (HF), misses cost (MC), artifactual patterns (AP). The results obtained indicated that proposed algorithm is better than the other algorithms.
As future work, we plan to test the above techniques in real datasets that differ in the dependency of their item sets. In addition, we plan to construct a new algorithm that has a better run time.