Hiding Sensitive Frequent Itemsets over Privacy Preserving Distributed Data Mining

Data mining is the process of extracting hidden patterns from data. One of the most important activities in data mining is the association rule mining and the new head for data mining research area is privacy of mining. Privacy preserving data mining is a new research trend in privacy data for data mining and statistical database. Data mining can be applied on centered or distributed databases. Most efficient approaches for mining distributed databases suppose that all of the data at each site can be shared. Privacy concerns may prevent the sites from directly sharing the data, and some types of information about the data. Privacy Preserving Data Mining (PPDM) has become Alaa Kh. Juma'a Sufyan T. F. Al-Janabi & Nazar A. Ali 92 increasingly popular because it allows sharing of privacy sensitive data for analysis purposes. In this paper, the problem of privacy preserving association rule mining in horizontally distributed database is addressed by proposing a system to compute a global frequent itemsets or association rules from different sites without disclosing individual transactions. Indeed, a new algorithm is proposed to hide sensitive frequent itemsets or sensitive association rules from the global frequent itemsets by hiding them from each site individually. This can be done by modifying the original database for each site in order to decrease the support for each sensitive itemset or association rule. Experimental results show that the proposed algorithm hides rules in a distributed system with the good execution time, and with limited side effects. Also, the proposed system has the capability to calculate the global frequent itemsets from different sites and preserves the privacy for each site. Keywords-cryptography; data mining; distributed database; frequent itemsets; sensitive association rules.


I. Introduction
Privacy preserving data mining is a new research area that investigates the sideeffects of data mining methods that originate from the penetration into the privacy of individuals and organizations. Most of information systems contain private information, such as social security numbers, income, disease type, etc. Therefore, this information should be correctly protected and hidden from unauthorized access. Although, the security of data has been permanent goal in database management systems, mining of knowledge and preventing of sensitive knowledge disclosure become the most important and highest priority goal in data mining process. Basically, the sharing of data between businesses in purpose of reaching valuable information is useful, but it can bring a lot of disadvantages [1].
Recent advances in data mining algorithms increased the risk of information leakage and its confidence issue. Because of this progress, the parallel research area has been started to overcome the information leakage risks and immunization of mining environment. Privacy preserving against mining algorithms is a new research area that investigates the side-effects of data mining methods that can be derived from the privacy diffusion of persons and organizations [9].
Two problems are addressed in (PPDM); one is the protection of private data; another is the protection of sensitive rules (knowledge) contained in the data. The former settles how to get normal mining results when private data cannot be accessed accurately; the latter settles how to protect sensitive rules contained in the data from being discovered, while non-sensitive rules can still be mined normally. The latter problem is called knowledge hiding in database in (KHD) which is opposite to knowledge discovery in database (KDD) [7].
Recent studies in preserving association rule privacy have proposed many techniques like k-Anonymity methods, randomization methods, and cryptographicbased PPDM that includes Secure Multiparty Computation (SMC), homomorphic encryption, and other cryptographic techniques [2, 10 and 11].
Also, a number of techniques like perturbation and anonymization have been developed to hide association rules from being discovered from published data. In practically for a single data set, given specific rules or patterns to be hidden, many data altering techniques for hiding association rules have been proposed. They can be categorized into three basic approaches. The first approach hides one rule at a time. It first selects transactions that contain the items in a given rule. It then tries to modify items, transaction by transaction, until the confidence or support of the rule falls below minimum confidence or minimum support. The modification is done by either removing items from the transaction or inserting new items to the transactions. The second approach deals with groups of restricted patterns or sensitive association rules at a time. It first selects the transactions that contain the intersecting patterns of a group of restricted patterns. Depending on the disclosure threshold given by users, it sanitizes a percentage of the selected transactions in order to hide the restricted patterns. The third approach deals with hiding certain constrained classes of association rules [5].
C.-C. Weng et al. presented algorithm to hide frequent sensitive rule by evaluating the weight of each transaction that supports these rules and hiding these rules according to this weigh [14]. V. S. Verykios et al. presented algorithms to hide sensitive association rules, but they generated high side effects and required multiple database scans [12]. S.-L. Wang proposed an algorithm to hide sensitive items. The algorithm needs less number of database scans, but the side effects generated were higher [13].
This paper deals with the Distributed Data Mining (DDM), or more specifically, with (PPDDM). This work proposes a two-phase PPDDM system. The first phase is dedicated for privacy-preserving distributed mining by encrypting local association rule mining in each site with commutative algorithm and sending the results to all sites. In the second phase, a new algorithm is proposed to hide sensitive frequent itemsets from global support items by hiding these frequent items in each site according to the proposed algorithm. Each site will modify its original database in order to prevent any external miner from detecting any sensitive frequent itemsets. The remaining of this paper is organized as follows: Section 2 briefly discusses classes of association rule algorithms. Section 3 presents the problem description. Next, Section 4 introduces the proposed approaches and algorithm. Then, Section 5; discusses the results and performance evaluation. Finally, the paper is concluded in Section 6.

II. Classes of Association Rule Algorithms
The various approaches proposed by researchers hide sensitive information efficiently and accurately but also face the problem of side effects. The side effects occur due to correlations existing between items in the database. Side effects may decrease the informational accuracy to the users. Due to the property of correlation, association rules may possess spurious or wrong information, hide non sensitive rules unnecessarily, and accidentally disclose some sensitive rules. So the challenging task is how to protect sensitive rules from users without effecting informational accuracy to the users that is avoiding side effects as far as possible.
The algorithms which have been already developed for hiding association rules can be classified into three distinct classes, namely heuristic approaches, border-based approaches and exact approaches. The first class of approaches involves efficient and fast algorithms that selectively sanitize a set of transactions from the database to hide the sensitive knowledge [8]. Due to their efficiency and scalability, the heuristic approaches have been the focus of attention for the vast majority of researchers in the knowledge hiding field. However, there are several circumstances in which they suffer from undesirable side-effects that lead them to suboptimal solutions.
The second set of approaches considers the task of sensitive rule hiding through modification of the original borders in the lattice of the frequent and the infrequent patterns in the dataset. In these schemes, the sensitive knowledge is hidden by enforcing the revised borders (which accommodate the hiding of the sensitive itemsets) in the sanitized database. The algorithms in this class differ both in the borders that they track, use of the hiding strategy, and in the methodology that they follow to enforce the revised borders in the sanitized dataset. Finally, the third class of approaches contains non-heuristic algorithms which conceive the hiding process as a constraint satisfaction problem that they solve by using integer or linear programming. The main difference of these approaches, compared to the previous ones, is the fact that the sanitization process guarantees optimality in the hiding solution, provided that an optimal solution exists. On the other hand, these approaches are usually several orders of magnitude slower than the heuristic ones, especially due to the runtime of the integer/linear programming solver [1].

III. Problem Description
Association rule mining was first introduced by Agrawal, R. et al.
Distributed system assumed that that there are n sites S , S , …, S and the transaction database DB is horizontally divided into n non-overlapping partitions db0, db , …, db , where DB = , 0 ≤ i ≠ j ≤ . Each partition db is assigned to site S , and DB is horizontally distributed. Clearly, |DB| = |db | + |db | + … + |db |. X.sup is the local support counts of itemset X at site Si, for 0 ≤ i ≤ . The global support count of X in DB is given as  T1  ABC  T2  ABC  T3  ABC  T4  AB  T5  A  T6 AC Two problems are addressed here: one for protection the privacy for each site when we evaluate the global support itemsets; the other is to hide a sensitive frequent itemsets from global support items. In the first problem, the commutative encryption should be applied to preserve the global candidate itemsets in each site. Each site encrypts its own local frequent itemsets and some fake itemsets, and then sends the encrypted itemsets to the next site until all sites have encrypted all itemsets. Next, it merges all encrypted itemsets, and all encrypted itemsets are then decrypted site by site.

TID Items
The second problem is to hide the sensitive frequent itemsets or rules and minimize the loss items. When the global frequent for the sensitive rules satisfies these two conditions:i. Support(X=>Y)= P(X and Y) >= Min_sup ; ii. Confidence(X=>Y)= P(X/Y)= [Support( Support(X)] >= Min_conf. where X and Y represent the candidate attributes. That means that this rule is frequent and it should be hidden. However, this rule can be hidden by: • Reducing the support of confidential rules (by decreasing the support of the corresponding large XY). • Reducing the confidence of rules (by Increasing the support of X in transactions not supporting Y or decreasing the support of Y in transactions supporting both X and Y) This can be done by deleting or adding a new data to the original database. This way prevents tools from discovering these rules, but the challenge is the data quality. When a support of items is changed, some other insensitive rules will also be affected either by hiding it or supporting another frequent rule. Therefore, we need to define good ways to reduce the negative side effects on data quality.

IV. Proposed Approache and Algorithms
The main aim of our proposed system is to securely and efficiently preserve the privacy of distributed data mining of association rules on horizontally partitioned database. This section is intended to serve as a work-in-progress report on our proposed work. This work generally can be divided into two phases: The first phase is responsible for protection of the privacy for each site when we evaluate the global support itemsets. This can be done by using SMC protocol. In this proposal, we use a commutative encryption tool. Each site encrypts its own frequent itemsets or rules, and then passes it to other sites until all sites have encrypted all frequent itemsets. These are next passed to a common site to eliminate duplicates and to begin decryption. This set is then passed to each site that decrypts each frequent itemsets. The final result represents the global support frequent itemsets.
In the second phase, we need to hide sensitive frequent itemsets from global support items. This can be done when we reduce the support of confidential rules by decreasing the number of items that support these rules. This can be done by removing these items from original database in each site until the frequent itemsets become less than min_support threshold. Figure 1 represents the proposed system.
The major steps for phase one can be explained as follows (Assuming that we have three sites S1, S2 and S3):

Determination of the local frequent itemset:
• Each site determines local frequent itemset for the sensitive rule (R) by using the following prior algorithm pseudo code: Step: Ck is generated by joining Lk-1with itself. • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. • Assume that the FI1, FI2, and FI3 represent the local support items or rules and E1, E2, and E3 represent the commutative encryption algorithm with its keys, for sites S1, S2, and S3 respectively ( we use Pohlig-Hellman algorithm to perform the commutative encryption) . Where FI1 = FI2 = …(1) FI3 = • Each site encrypts its rule and sends it to the next site: o S1 determines E1 (FI1) and sends it to S2. o S2 determines E2 (FI2) and sends it to S3. o S3 determines E3 (FI3) and sends it to S1. • Each site encrypts the received data with its key and sends the result to the next site: o S1 determines E1 (E3 (FI3)) and sends it to S2. o S2 determines E2 (E1 (FI1)) and sends it to S3. o S3 determines E3 (E2 (FI2)) and sends it to S1. • Again, each site encrypts the received data with its key and sends the result to the next site: o S1 determines E1 (E3 (E2 (FI2))) and sends it to S2.
In Phase two, a proposed algorithm is used to hide sensitive frequent itemsets or rules by decreasing the support of their generating itemsets until their support is below the minimum support threshold. The steps for hiding sensitive frequent itemsets or rules for each site in the system can be explained as follows: 1. Each site has Local frequent itemset, global frequent itemsets, local database (|T ) and minimum support threshold. 2. Input is the sensitive rule or frequent itemsets. 3. Extract all transactions that support this sensitive frequent items (T . 4. Evaluate the Site Remove Frequent Itemset (SRFI) that is representing how many frequent itemsets are needed to remove from each site as seen in eq. (2).
)/100*|T …(2) where LSF = Local support frequent, GSF =Global support frequent, and min_supp =Minimum Support Threshold. This evaluation performed according to the ratio of each item in sensitive itemsets for all database transaction in order to reduce the side effects that may result from modifying the database. 6. Sort (T ) in ascending order to minimize the impact that applied changes will have in database. 7. Sort items in itemsets in descending order according to the |Ir|. This also will minimize the side effects that can happen when modifying the database. 8. Remove items from (T ) according to the above sorting, by setting the value for this item to "0" instead of "1".
These steps will be done at all sites in the system in order to recalculate the global support itemsets. After recalculation, the sensitive frequent itemsets or rules have been hidden from the new global support itemsets. The pseudo code for the proposed algorithm is shown in Figure 2.
Let us take an example to clarify the operation of the proposed algorithm. Consider three local sites S1, S2 and S3, which have DB1, DB2 and DB3 respectively. We need to hide Sensitive Frequent Itemsets or rules "1, 15 and 16". The local frequent itemset (LSF) in each site and global frequent itemset (GSF) for these rules are explained in Table II. Let the minimum support threshold be 5%. The size of database is 30000 transactions (10000 transactions in each site). To hide this Sensitive Frequent Itemsets we need to modify the database in each site. The following steps perform the hiding operation in each site.

V. Results Analysis and Performance Evaluation
The experiments for the proposal algorithm performed on a notebook with 2G MHz processor and 2 GB memory, under Windows XP operating system ( in a distributed system setting there are three notebooks with the same properties). The sequence database generated for the experiments can be generated by using a Sequence Database Generator "SeqDBGen" [15] that works like IBM data generator [16]. Databases with sizes 30000, 60000, and 90000 transactions are generated for the series of experiments. The average length of transactions of each database (ATL) is 10 and 30 items in the generated database. The range of minimum support threshold given is 5-8%. The experimental results are obtained by averaging from 4 independent trials for each size of transaction with different sensitive frequent itemsets. Figure 3 represents the percentage of the difference between the evaluated global frequent itemset in central and distributed system with privacy. The result obtained a ratio of the global frequent itemset in the distributed system of about 99.99 % when it is compared with the global frequent itemset in the centralized system. The performance of the hiding algorithms has been measured according to two criteria: time requirements and side effects produced. Concerning the time requirements, we have considered the time needed by each algorithm to hide a specified set of rules. Concerning the side effects, we have considered the number of "lost" rules introduced by the hiding process. Hiding rules produce some changes in the original database. Such changes affect the set of rules mined. In particular, not all the rules that can be mined from the source DB can still is retrieved in the released DB. We call the former rules "lost rules".
To assess the performance of the proposed algorithms, this algorithm is used to hide sets of 5 and 10 frequent itemsets or rules mined from the datasets. In each time, the time required for hiding process will be measured. There is a file that contains the global frequent itemsets with the minimum support threshold. After the completion of the hiding process, we mine the released database and then compare the frequent itemsets or rules generated by the two databases. In order to do that, we check if the non-sensitive rules mined from the source database could still be mined in the released database. To do so, we compare each frequent itemsets or rules mined from the original database with each frequent itemsets rule mined form the released DB. If the rule is not found, it is considered as "lost". Note that this process of rule checking tends to consider the rules selected for hiding as "lost", since they cannot be retrieved from the released database. However, since they have been hidden on purpose, we excluded them from the set of "lost rules".
The proposed algorithm applied on both centralized and distributed database. Tables IV-VII below represent the results for time measurements and the lost frequent itemsets in both cases. As shown in Tables IV and VI and Figures 4, 5, 6 and 7, there is no clear change for the number of lost itemsets in the distributed system when it is compared with the centralized database system. That means the proposal algorithm not more effected when it works on the distributed database system. However, the lost frequent itemsets do not have a significant effect on the results, because they are of very low percentage. Also, from the mentioned Figures, we can observe that the number of the lost itemsets is independent of the size of database, but it is linearly-related with the size of sensitive frequent itemsets.
Furthermore, from Tables V and VII and Figures 8 and 9, it is possible to notice that the measured time is of a linear growth with the size of database. The time requirement for hiding 10 frequent itemsets is higher than time needed to hide 5. This is an expected result because in the case of hiding 10 frequent itemsets, we need more modifications in database transactions; this will consume more time. Also, we observe that the time required in the distributed system is less than time required in central system. This is because the database is divided into different sites and the hiding algorithm work independently and in parallel on each of these sites. The number of the lost frequent itemsets is dependent on the sensitive frequent itemsets or rules. If these sensitive rules have high support, they will be of more impact on the database and more items would be required to be removed from transactions. This will increase the number of lost frequent itemsets or rules.      Privacy-preserving data mining on the distributed databases has become a hot research area. Not only in this competitive, but also cooperative business environment, companies, hospitals and other organizations need to share information with others, but not sharing the data. Also, some organizations need to hide some sensitive frequent sets or rules from any other external miner. In this paper, we proposed a system to allow sites like companies or other organizations to share knowledge while protecting at the same time the privacy of each site. We have extracted the frequent itemsets or rules from the distributed database and compared the resultant rules with the rules extracted from the same database, but in a centralized system setting. We have obtained a very small percentage error (about 0.002 %) in the distributed setting case. Also, we have proposed a new algorithm for hiding sensitive itemsets in distributed database. Its operation depends on the ratio of the frequent for sensitive itemsets in each site and the ratio of item counts for each item in the sensitive itemsets for the local database.
According to the obtained results, our proposal has resulted in a very limited side effect (lost frequent itemsets), while obtaining a significant reduction in the time requirement for the case of the distributed database system. As a future work, we will continue to improve the performance of our algorithm to hide sensitive association rules with the lowest possible level of side effects (lost rules and new rules) in a distributed system. We also try to more reduction of the algorithm time requirements.