The majority of the privacy preservation methods developed in the past were based on anonymization techniques which will reduce data utility
A non-anonymization-based solution to privacy preservation problem as recommended in GDPR.
Identity disclosure and attribute disclosure is not possible, because the quasi identifiers are synthesized and cannot be mapped with external data sources.
Sensitive data is tokenized before analytics, differential privacy is applied on synthetic data which will prevent background knowledge and homogeneity attack.
Strong and coherent privacy protection is guaranteed because the original data set is not involved in data analytics and instead a synthetic data set is used which is statistically similar to the original data set such that the analytical results of synthetic data can be related or mapped to the original data set.
The nature of privacy threats has changed due to the emergence of applications like recommendation systems, e-commerce, etc. Conventional data analysis included a statistical analysis of data especially using aggregate queries where data was analyzed as a whole
Digital Profiling
Social media privacy and cyberstalking
Image analytics and privacy hazards
Digital Profiling is the automated processing of person-specific data to evaluate certain attributes relating to a person, particularly to analyze and predict an individual's economic situation, buying habits, health, preferences, interests, behaviour, etc. Digital Profiling also influences group privacy wherein an individual may be a member of one or more groups
Social media platforms are highly vulnerable to stalking attacks. One of the common stalking techniques involves an online mob of anonymous self-organized groups to target individuals causing defamation, threats of violence, and technology-based attacks. Social media are used to build trust between the perpetrator and the victim. When the victim transmits confidential data including pictures and videos, the perpetrator abuses them for blackmail purposes
Image data analytics is widely used in health care, social media, and e-commerce applications. In social media applications like Facebook and Instagram, users upload a lot of images every day. An image is worth more than a thousand words and hence it may reveal the emotional state of a person
Attempt to analyse the emotional state of people and exploit them. Facebook and Whatsapp status updates can be studied using machine learning models and sentiment analysis can help analyse the social and emotional wellbeing of a person and in turn, exploit them.
Disclosure of secret medication being taken by a person by virtue of promotional offers on medicine.
Another important privacy concern is identity theft because copies of permanent account number (PAN) cards, passports and driving licenses are kept in digital form and shared. Insurance and banking firms and third parties will extract a lot of sensitive data which is a serious privacy hazard
Medical imaging deals with a visual representation of the internal structure of organs and tissues. Medical imaging may lead to leakage of personal and sensitive medical data of a person.
Data Privacy has gained paramount importance in recent times and it is evident from the privacy legislation passed in more than 100 countries. Firms dealing with data sensitive applications need to abide by the privacy legislation of respective regions. In the recent past, a lot of promising work has been done in privacy preserving data analytics. Swarm based algorithms were also applied to the data sets alongside perturbation techniques. Swarm based algorithm developed for privacy preservation uses k-anonymity as the building block. Even though swarm algorithms are promising, they suffer from the traditional flaws of anonymization
Synthetic Data is one of the data sanitization methods where original data is replaced with synthetic data ensuring privacy preserving data analytics. Data can be fully synthetic or partial and various types of synthetic data generation methods were studied and compared in the previous literature
As part of our research, we employed a novel algorithm called SQIDP in which quasi identifiers (QI) are synthesized, sensitive attribute(s) are tokenized, and finally, differentially privacy is applied to generate a new dataset from the original data set.
Algorithm: SQIDP |
1. Start |
2. Given a Dataset D with attributes D{a1,a2,a3...an} |
3. Choose Quasi identifiers (QI) example |
|
5. do |
6. Synthesize each QI using |
7. rnorm (column size, desired mean, desired standard dev.) to create synthetic data for QI. |
Example. SQID {a3’, a4’, a5’}. |
|
9. Tokenize the sensitive attribute (SA).Example SA {a6}. In tokenization each discrete value of the attribute is replaced with a token. |
10. Merge non quasi identifiers of D, SQID and tokenized SA to generate new data set D'. |
11. End |
The algorithm was initially applied on the adult dataset, downloaded from the University of California, Irvine (UCI) machine learning repository
The attribute marital-status is the sensitive attribute (SA) which is tokenized. Marital-status enumerates {Never-married, Married-civ-spouse, Divorced, Married-spouse-absent, Separated, Married-AF-spouse, Widowed}. The SA attribute is tokenized using a numerical vector. The new data set D' is created by combining the non quasi identifiers of D, SQID, and tokenized sensitive attribute (SA). D' is released instead of D for data analytics. D' contains synthetic data that has a very close resemblance with original data but it is not the original data. The mathematical transformations done on the quasi identifiers (QI) will ensure that the analytical results of D' can be applied on D without releasing the original data set. The comparison of synthesized attributes in D’ with original values in D is shown in
S.no |
Data set name |
No. of attributes |
No. of records |
1 |
Adult Data set |
14 |
48842 |
2 |
Statlog Data set |
13 |
270 |
3 |
Indian Liver Patient records |
11 |
583 |
In Section 3 we have demonstrated the generation of partially synthetic data with strategic changes made to quasi identifiers. The dataset thus generated (D') can be released for analytics and the results can be applied back to the original data set. However, to make the dataset more robust to privacy attacks, an additional differential privacy algorithm is employed on D'. Laplace mechanism of differential privacy is applied on D' to generate a differentially private data set which makes it very difficult to predict whether an individual record was present in the data set or not. Package diffpriv
In SQIDP, the quasi identifiers were replaced with synthetic data generated using random variates having specified normal distribution. The mean and standard deviation of the synthetic data will be very close to the mean and standard deviation of the original quasi identifiers. This will ensure the results of data analytics on synthetic data can be mapped to original data sets.
Advantages of SQIDP:
The execution time of SQIDP was same on all three data sets with different sizes and hence it is scalable.
SQIDP addressed the background knowledge attack and homogeneity attack because the quasi identifiers are synthesized and cannot be mapped with any external data sources.
SQIDP is a non anonymized method and offers 100% data utility.
SQIDP is an innovative method of privacy preservation where quasi identifiers are synthesized to ensure no scope of linkage attacks which was a common problem in previous privacy preservation techniques. SQIDP is found to be more efficient than existing techniques and a detailed comparison is given in
Techniques Features |
K anonymity |
Cryptographic techniques |
Randomization |
SWARM Based techniques |
Multi Dimensional Sensitivity Based Anonymization (MDSBA) |
SQIDP |
Attribute disclosure and linkage attacks |
Vulnerable |
Not Vulnerable |
Vulnerable |
Not Vulnerable |
Vulnerable |
Not Vulnerable |
Background knowledge attack |
Vulnerable |
Not Vulnerable |
Vulnerable |
Not Vulnerable |
Vulnerable |
Not Vulnerable because of synthesis of Quasi Identifiers and tokenization of sensitive attribute |
100% Data Utility |
No |
No |
No |
No |
No |
Yes |
Scalability |
No |
No |
No |
No |
Yes |
Yes |
S.no. |
Performance Metric |
Description |
1 |
Data Utility |
SQIDP offers 100% data utility because it is non anonymized. |
2 |
Robust |
SQIDP is robust because it is not vulnerable to Linkage attacks - because of differential privacy Background knowledge attack - because of synthetic quasi identifiers Attribute and Identity disclosure - because of tokenization of sensitive attribute. |
3 |
Compliance |
SQIDP complies to privacy regulations and does not use anonymization as recommended in GDPR. |
4 |
Execution time |
All the privacy preserving techniques including SQIDP will have O(n) time complexity. However, SQIDP can be executed in a distributed computing platform to gain better performance. |
5 |
Accuracy |
Anonymization leads to data loss and in turn affects the results of the analytics. SQIDP is a non anonymized and hence offers accurate analytics. |
Even though Differential privacy is employed in US Census 2020
The SQIDP algorithm can be applied only to text data by ensuring privacy preservation and protection from background knowledge attack and linkage attacks. SQIDP is a useful contribution to the field of privacy preserving data analytics that ensures data utility along with privacy preserving data analytics. However, SQIDP is limited to text data and cannot be applied to image or video data. Extensive usage of social media has led to the creation of a huge amount of image and video data that are prone to various cyber security vulnerabilities and have enough research scope.