Privacy Confidentiality Protection in Big-data Policies. Many national statistical agencies, polling organisations, medical facilities, and online and social networking businesses are gathering and analysing a large amount of data about people, such as demographic information, internet activity, energy use, contact habits and social interactions.

Big Privacy: Protecting Confidentiality in Big Data

The widespread distribution of microdata (individual granularity data) promotes advances in science and public policy, lets people learn about their communities, and encourages students to improve data analysis skills. Sometimes, however, data manufacturers do not release microdata as obtained, since this may expose the identities or values of sensitive attributes of data subjects.

It is unethical to fail to maintain confidentiality (when promised) and may cause damage to data subjects and the data provider. In fact, in government and research environments, it may even be illegal. For instance, if one discloses confidential data protected by the U. S. Act on the Security and Statistical Efficiency of Sensitive Information, one is subject to a maximum of $250,000 in fines and a five-year jail sentence.

Sharing protected microdata seems an easy task at first glance: simply strip unique identifiers such as names, addresses, and tax identification numbers until the data is published. However, where other readily available variables, such as aggregated geographic or demographic data, remain on the register, anonymizing behaviour alone may not suffice.

In order to fit units in the released data to other databases, these quasi-identifiers may be used. As part of her PhD thesis at MIT, for example, computer scientist Latanya Sweeney showed that 97 percent of the data on publicly accessible Cambridge, MA voter registration lists could be uniquely identified using birth date and nine-digit zip code.

Related Articles

She was able to recognise Governor William Weld in an anonymized medical database by matching the details on these lists. More recently, the company Netflix released allegedly de-identified data detailing the movie watching habits of more than 480,000 customers.

However, by linking to an online movie ratings website, computer scientists Arvind Narayanan and Vitaly Shmatikov were able to identify many clients, thus uncovering obvious political biases and other potentially sensitive information.

The most sensational breach of privacy possibly occurred in 2006 when America Online (AOL) published 20 million search queries posed by users over a span of three months to promote information retrieval research.

They recognised that information contained potentially identifying and sensitive information in web searches (including social security and credit card numbers!), and thus tried to anonymize the data by replacing random numbers with user identifiers.

Within a few hours, however, of releasing the anonymized data,
“Based on only her search history, two New York Times reporters were able to discover the identity of user No. 4417749, e.g. “landscapers in Lilburn, Ga,” multiple individuals with the last name Arnold, and “numb
The fingers.

This violation has had far-reaching consequences: many high-ranking AOL officials have been dismissed, and search firms are now hesitant to disclose search logs and other personal details. In reality, because it exposes so much about people, even researchers are wary of using the now publicly accessible AOL data for analysis.

While these re-identification exercises were conducted to highlight privacy issues, re-identification attacks for malicious reasons can easily be conceived, particularly for large individual databases.

In an attempt to learn private details about someone who they knew participated in a survey or administrative database, a nosy neighbour or family relative might search through a public database. A reporter might try to identify celebrities or politicians. Advertising agencies or creditors may exploit big databases to classify future customers who are good or bad.

And disgruntled hackers may attempt to discredit organisations by identifying people in data for public use. Whether perceived or imminent, the threat of breaches has significant consequences for the practise and scope of data sharing, especially for the availability of large and data full of details.

For aspiring computer scientists, mathematical and statistical scientists, and social scientists, these threats have developed a fascinating field of study. This region includes names such as methods for protecting privacy (computer science) and restricting statistical disclosure (statistical science).

It is a sector where the research challenges are broad and interdisciplinary, where there are clear opportunities for high-profile publications and external funding, and where there is a real and important potential to influence the practise of data sharing.

We explain some general research topics in this field in this article, with the intention of pointing out opportunities for students. Bearing in mind its interdisciplinary nature, we present both computer science and statistical science viewpoints, which are our two departments at home.

We begin by explaining research into how the risks of breaches of confidentiality are described and calculated. We then define some data security approaches. We end up with a few general fields in which students can participate in research.

We remember that in great privacy, there are several more subjects that we do not cover for lack of space. These include, for example, systems for privately collecting data, control of access to web and social networking applications, data security and cryptography, and secure computing protocols.

These are equally rich and complementary research areas that are essential for big data to be used safely and confidentially.

Defining and Measuring Confidentiality Risks

A variety of standards and methods for quantifying confidentiality risks have been developed by both the computer science and statistical science communities. Indeed, a major push of research sponsored (including grants to us) by the US National Science Foundation is to combine these two viewpoints, taking the best of what both have to give.

We do not try to cover all of these in evaluating any of the risk metrics
Approximations. Instead, we deal with a few big ones that we are most acquainted with.

New Articles

Measures used in practise in statistical science tend to be informal and heuristic in nature. For example, a standard risk heuristic for the disclosure of tabular magnitude data for business establishments (e.g., total payroll tables within groupings of employee size) is that no establishment should contribute more than a percentage of the total cell, and no cell should contain fewer than three establishments. Cells that do not fulfil these conditions are either suppressed or disturbed.

Bayesian probabilities of re-identification are the most common and mathematically systematic method of disclosure risk evaluation, meaning subsequent probabilities that, given the published data, intruders will learn details about data subjects and a collection of assumptions about the knowledge and actions of the intruder.

In the face of confusion, agencies should compute these steps through a range of intrusion knowledge scenarios as a way to classify particularly risky records and make an informed decision on data release policy (the goal of statistical science in general). It is computationally challenging to compute these probabilities in practise and needs creative methodology, especially for big data.

Some of the early attempts in computer science to measure confidentiality risk were aimed at thwarting reidentification attacks (which we mentioned in the introduction) by ensuring that the record of no person is unique in the data.

This inspired a common principle of privacy called K-Anonymity, which demanded that microdata be published in such a way that the record of no person is distinguishable from other records of at least K-1. While this apparently prevents the violations of privacy discussed in the introduction, it has two disadvantages. An adversary may learn sensitive information (especially one with previous knowledge).

For example, suppose a hospital releases K-anonymous patient microdata, and you know there’s your neighbour Bob in the data. If people all have cancer or flu in the anonymous community containing Bob, and you know that Bob does not have the flu, then you can deduce that Bob has cancer.

K-Anonymity has been extended to deal with this shortcoming in a variety of ways. An example is L-Diversity, which requires that each group of people who are indistinguishable through quasi-identifiers are
(like age, gender, zip code, etc.) do not share the same value for the responsive (like disease) attribute, but rather have well-represented (approximately the same proportion) values of L.

Differential privacy is considered the latest state of the art disclosure metric. In K-anonymity, L-diversity and their extensions, this removes (to a large extent) the confidentiality problems. Using the following opt-in/opt-out comparison, differential privacy can be better clarified.

Suppose an organisation needs to disclose microdata (e.g., the Census Bureau or a search engine). Any person has two options: opt-out of the microdata in order to protect their privacy, or opt-in and hope that the published microdata can not infer sensitive information from an informed intruder.

A microdata release mechanism is said to guarantee εdifferential privacy if the probability that the mechanism outputs M with input D1 should be close to (within an exp(ε) factor of) the probability that the mechanism outputs M with input D2 should be close to (within an exp(ε) factor of) the probability that the mechanism outputs M with input D2 for any pair of inputs D1 and D2 that differ in one individual’s record (e.g., D1 contains record t and D2 does not contain t).

In this way, the release mechanism is insensitive to the presence (opt-in) or absence (opt-out) of a single person in the data. Differential privacy, therefore, constitutes a firm guarantee.

In addition, differential privacy satisfies an essential composability property — if M1 and M2 are two mechanisms that satisfy differential privacy with ε1 and ε2 parameters, then releasing the M1 and M2 outputs together also satisfies differential privacy with ε1+ε2 parameter.

Other established requirements of privacy (e.g. k-anonymity and l-diversity) do not satisfy composability and can thus result in a violation of privacy by two privacy preservation releases using these concepts.

Methods for Protecting Public Release Data

Like risk controls, methods of modifying or disrupting data before release have been developed by both computer scientists and statistical scientists. Indeed, in both cultures, often very similar approaches are independently created!

Aside from whether a method of privacy protection results in low disclosure (according to one of the metrics mentioned in the previous section), when developing a method of privacy protection, there are two important considerations. Next, the process must give rise to outputs that preserve useful input information.

Note that all privacy protection could lead to a certain loss of utility (after all we are trying to hide individual specific properties). Therefore, it is typically a good idea to list the kinds of statistical analyses that need to be performed on the data, and then optimise the performance to better respond to those analyses.

Related Posts

Some approaches presume an interactive environment, where the datasets are queried by a data analyst, and disturbed results are returned for these queries. Secondly, it should be possible to simulate a method of privacy protection — an intruder must be presumed to know the method of privacy protection.

For example, a method that reports an individual’s age (x) as [x-10, x+10] is not simulated, as an attacker who knows this algorithm can deduce the individual’s age to be x. Next, we present a few significant forms of methods of protection of privacy.


By converting atypical documents—which are normally most at risk—into normal records, aggregation decreases disclosure risks. For instance, there may be only one individual in a city with a specific combination of demographic features, but several individuals in a state with those features.

Releasing data for this individual with city-level geography may have a high risk of exposure, although releasing the data at state-level does not. To K-anonymity, aggregation is quite close. Unfortunately, aggregation renders research challenging and sometimes impossible at finer levels, and it generates ecological inference issues (relationships seen at aggregated levels do not apply at disaggregated levels).

There is a substantial literature that satisfies k-anonymity, l-diversity (and variants), as well as differential privacy, on aggregation techniques.


Sensitive values from the published data may be removed by agencies. They might suppress entire variables or just values of data at risk. In general, suppression of unique data values produces information that is lacking because of their real values, which are difficult to accurately analyse.

For example, if revenues are deleted because they are high, estimates of the distribution of revenue would be too low based on the released data.

Data swapping

In order to prevent users from matching, agencies should swap data values for selected records—for example, replace age, ethnicity, and sex values for at-risk records with those for other records—as matches might be based on incorrect data. Swapping is used by federal departments widely.

Swapping fractions are usually believed to be low—agencies do not report rates to the public (and thus these algorithms are not simulated)—because swapping at high levels breaks relationships between swapped and unswapped variables.

Adding random noise

By adding any randomly selected quantity, such as a random draw from a normal distribution with a mean equal to zero, to the observed values or to answers to statistical questions, agencies may safeguard numerical data. Adding noise to values will decrease the ability to fit the disrupted data correctly, and distort the values of sensitive variables.

The degree of security of confidentiality depends on the essence of the distribution of noise; e.g., greater protection is given by using a broad variance. Adding noise with great variance, however, introduces measurement error that extends marginal distributions and attenuates coefficients of regression.

Strong privacy guarantees like differential privacy can be given by adding noise from a heavy tailed distribution (like a Laplace distribution) to query responses.

Synthetic data

The fundamental concept of synthetic data is to replace, at high risk of exposure, original data values with values simulated from distributions of probability. To replicate as many of the relationships in the original data as possible, these distributions are defined. The approaches to synthetic data come in two flavours: partial and complete synthesis.

The units originally surveyed with some subset of collected values replaced by simulated values are partially synthetic results. For example, for units in the sample with unusual combinations of demographic characteristics, the agency could simulate sensitive or identifying variables; or, for selected sensitive variables, the agency could substitute all data.

An entirely simulated data set consists of completely synthetic data; the units originally sampled are not on the file. Typical algorithms either construct a comprehensive data model or translate the data into another space (e.g., Fourier).

The alternative representation, from which synthetic data is sampled, is then disrupted. In reality, one of the data products of the US Census publishes statistics on people’s travel habits using a synthetic technique of data generation that offers a variant of differential privacy.

Research Challenges

Although recent research has shed a great deal of light on structured disclosure metrics and modern, validated private methods that provide valuable statistical information, students can still be involved in many interesting research challenges in this field.

For example, most privacy work has considered data where each record corresponds to a particular person, and usually the various records are considered separate. Ensuring the privacy of related data, such as social networks, where individuals are connected to other people, and relational data, where various types of entities may be connected to each other, is an important issue.

Article Updates for this Month

In such data, reasoning about privacy is tricky since information about a person can be leaked via links to other people. The release of sequential releases of the same information over time is another interesting problem. Via releases, attackers may connect people and infer additional sensitive information that they may not have from a single release.

Finally, we need to build strategies that can preserve privacy while maintaining utility as the data we deal with becomes increasingly high-dimensional. An significant open field for study is the understanding of theoretical trade-offs between privacy and utility.

We urge students to learn more about opportunities in this research field in order to close this article. In conclusion, we will be happy to point you to places where you can read more; just drop an email voicing your interests to one of us via the subscription page. And we promise to keep your email private, of course!

Topics related to Privacy Confidentiality Protection

  • privacy and confidentiality in research
  • download privacy and confidentiality in research pdf
  • privacy, confidentiality and security of health information
  • how to protect privacy and confidentiality in research
  • privacy and confidentiality in healthcare
  • how to maintain patient confidentiality privacy and security
  • electronic health records: privacy, confidentiality, and security
  • privacy and confidentiality examples