Privacy as confidentiality

When the right to be left alone is interpreted from a technical perspective, privacy usually means preventing personal data from becoming accessible to others, especially to the general public. Accordingly, the goal of data protection technologies is to enable the use of services while minimising the amount of information disclosed. In this context, information includes both data directly exchanged with a service and information that is indirectly available in the metadata associated with the use of the service.

Metadata is discussed separately below. Approaches to minimising the exposure of direct data include cryptographic protection and obfuscation-based inference control. Cryptography prevents unauthorised access to information. Obfuscation, in turn, ensures that information leaking to unauthorised parties is limited in quantity or cannot be linked to specific individuals. Obfuscation relaxes the requirements for confidentiality because the data can no longer be individually identified.

Select one or more options.

What relates to the right to be left alone?
You are using an online service that requests your personal identity code. You are certain that the service has a justified reason for the request. What else do you check before sending the information, if you want to ensure that your identity code does not fall into the wrong hands?

Cryptographic protection of data (advanced)

Cryptographic protection aiming at confidentiality is based on one of the following attacker models:

  1. The recipient is trusted, but the transmission channel is not.
  2. The recipient is not trusted, and the data must be kept confidential not only during transmission but also during processing.

1. Protecting data during transmission is achieved through end-to-end encryption (E2EE). Here, the “ends” refer to the origin and destination of the message, for example the sender and recipient of an email. They can also be the client and server in services based on the client-server model. E2EE ensures that no third party (e.g. routers, applications used for communication) can read the message while it is travelling between the ends. In addition, E2EE protects message integrity by preventing intermediaries from modifying the transmitted data and provides authentication, allowing the ends to be certain of each other’s identity.

Devices at both ends of E2EE communication require an encryption key. These are usually symmetric keys that can also be used to decrypt the data. They can be agreed upon using key transport or generated through key exchange. Integrity and authentication are often implemented using digital signatures and message authentication codes (MAC). E2EE encryption is usually implemented as part of a protocol: for example, TLS is widely used in client–server scenarios, and PGP is a common email encryption mechanism.

E2EE is now the dominant practice in instant messaging applications, such as Signal, WhatsApp, Facebook Messenger, Viber. All of these are based on the Signal Protocol. It provides authenticated messaging between users and end-to-end confidentiality. Messages remain confidential even if the messaging server is compromised, and even if users’ master keys are compromised. These properties are based on the fact that key exchange is repeated to generate new encryption keys, and keys are updated even when the same party sends multiple messages consecutively.

Protocols guarantee E2EE only as long as the authentication mechanisms of the parties function as expected. For example, the confidentiality of TLS is based on services keeping their keys secret and on the public key infrastructure (PKI) functioning in the authentication of communication parties. In turn, the confidentiality of WhatsApp relies on the difficulty of spoofing phone numbers, allowing users to be confident that they are communicating with the intended person.

You can watch Computerphile’s E2EE video to illustrate the topic. The video is motivated by a comment from a British policymaker suggesting that, in order to combat crime, a government backdoor should be introduced into E2EE communication. Also consider, while watching the video, the ”nothing to hide” argument briefly mentioned in it.

2. Protecting data during processing: Sometimes a data recipient must be allowed to perform some processing, even though it may be considered a potential attacker. Processing can even be fully outsourced to the recipient, or the sender may participate in the processing.

Examples of outsourced processing include the use of cloud services for processing big data, or a database in which the sender wishes to perform searches. The problem with outsourced processing is that access to certain parts of the data may reveal information about the sender to the processor; for example, in a patent database, reading certain information may reveal business intentions, or searching for a specific user’s message in a messaging directory may reveal relationships between users. This issue can be mitigated using private information retrieval, which enables database queries without revealing which record is being accessed.

Another example of outsourced processing is digital shops or digital banking, where the server returns information to the user depending on inputs. A shop processes payments and delivers digital purchases (e.g. a game or a film), and a bank executes payments during authentication. However, users’ purchasing habits can reveal a great deal about them. In this case, privacy can be preserved by using some oblivious transfer protocol, in which the service can transfer an item without knowing which item it is (from a set of many).

When the sender of the data also participates in its processing, this is referred to as collaborative computation. The results of the computation may be of interest to the sender, the recipient, both, or third parties. If the participants do not trust each other, any of them may be a potential attacker. Typical applications include database comparison or statistical calculation on datasets. For example, when searching for similarities between two databases, protocols can be used that allow the parties to compute intersections of datasets without revealing anything else about the data. Such protocols can be explored under the terms Multi-Party Computation and Private Set Intersection.

Data obfuscation (advanced)

When confidentiality is achieved cryptographically, the cost is reduced efficiency and flexibility, as computation requires both processing power and bandwidth. In addition, encryption limits the possibilities for data processing. Obfuscation does not conceal data but restricts the inferences that can be made from it. It is important to assess how much can still be inferred from the data after obfuscation. It should also be noted that since the amount of data is reduced for everyone, not only for the attacker, this may impose limitations on the use of these techniques.

Obfuscation techniques are not suitable for protecting data in transit. Instead, they are suitable for privacy-preserving outsourcing, collaborative computation, and publication. There are five types of obfuscation techniques:

  • Anonymisation: Identifiers are removed from the data points so that they become unlinkable, that is, they cannot be grouped as belonging to the same person. However, it should be noted that achieving full anonymisation is extremely difficult, and it is also difficult to determine when data is truly anonymous. The data itself contains information that can be correlated with different attributes or database records. These can then be combined with other data to find common factors and “fingerprints” that may ultimately lead to the identification of individuals.
  • Generalisation: The precision of the data is reduced, for example by sharing only the first two digits of a phone number or by giving an age group instead of an exact age.
  • Suppression: Part of the data is removed; for example, half of the phone numbers in a database, or if there are fewer than five cases of a disease, the municipalities are not disclosed.
  • Dummy data is added to the dataset so that it is difficult for an attacker to distinguish which data is real. Ideally, from the attacker’s perspective, all data in the dataset appears to be dummy data with equal probability. However, generating good dummy data is difficult, and it is often easy to filter out the most likely dummy entries.
  • Perturbation: Noise is added to the data to make it harder to draw conclusions. However, care must be taken not to add noise that can be easily filtered out, as if the attacker succeeds in filtering the data, too much may be revealed.

When using obfuscation techniques, it should be noted that none of these alone rarely provides effective protection. It is therefore common to combine different techniques to achieve the desired protective effect. The topic will be revisited in the context of databases.

Confidentiality of metadata

Instead of, or in addition to, direct data, metadata can also be used to breach confidentiality and extract personal information from data. Metadata is generated by communications, the device, and its location.

Metadata of communications: Network-layer information such as IP address, data volume, timing, or connection duration is available to observers even if the data itself is encrypted or obfuscated. This information can be used to reveal potentially sensitive details. For example, in remote healthcare services, messages between the client and the doctor are encrypted, but the mere fact that a person is consulting a specialist in a particular field may reveal a possible illness. Another example is employees’ web browsing habits, which may allow inferences to be made about, for example, trade secrets or business partners.

Device metadata: In online services, information about the user’s device is sent to the service provider along with service requests so that the provider can optimise its responses. Even if users are anonymous at the network layer, device information may contain identifiers that allow service providers to track users across the web. This is because the combination of device characteristics (device, browser, browser version, language, plugins in use, operating system, resolution, fonts) is nearly unique for each user. Tracking is also performed using cookies, but this is a different matter from device metadata.

Location metadata can be obtained, for example, when users make queries in location-based services, from GPS data in photographs or social media content, or from data collected by access points. The data can be clustered, combined, and analysed using various methods, potentially revealing almost everything about a person’s life: place of residence, workplace, leisure locations, social contacts, medical visits, holidays, daily schedules, whether the person is at home, and so on. The list is virtually endless. In addition, inferences can be made about, for example, age and gender, and, ultimately, a person’s future movements can be predicted. From a privacy perspective, maintaining the confidentiality of location data is extremely important.

Answer the question.

The key sources of metadata are (a) communications, (b) device, and (c) location. Their significance for privacy naturally varies case by case, but according to the description in the text, what is their order from most critical to least critical?
Posting submission...