The DNS Privacy solutions presented here ensure that DNS queries made by an individual end user can’t be observed by eavesdroppers as they pass across the Internet. Only the operators of DNS privacy servers have access to the details of the queries. For operational reasons such as monitoring server performance or detecting and mitigating attacks operators need to keep logs of the DNS queries they see; in some circumstances they may need to share those logs with other operators. To preserve end user privacy, as RFC697317 observes it is important that the data in these logs limits the identifiability of end users; more generally, that the data in the logs is kept to the minimum required for purpose, a process the RFC terms data minimisation.
Data minimising a trace or logs of network traffic therefore includes ensuring the recorded data does not contain privacy-sensitive information. This is typically personal data, or data that can be used to link a record to an individual, but may also include revealing other confidential information, for example on the structure of an internal corporate network. In the case of identifiability, this means that if individual user identifiers cannot be omitted altogether, pseudonyms should be used instead.
The problem of effectively ensuring that DNS query logs do not contain privacy-sensitive information is not one that currently has a generally agreed solution. This page gives an overview of current approaches to identifier pseudonymisation. As RFC762616 makes clear, the big privacy risk in DNS is connecting DNS queries to an individual, so at present the main focus is on pseudonymising client IP addresses (though, of course, the MAC address, VLAN identifier and ARP data may be useful in particularly localised environments).
Pseudonymising a dataset is generally done using either anonymisation or pseudonomysation. This discussion uses the definitions from RFC697317 Section 3, with additional observations from van Dijkhuizen et al.1
Anonymisation. To enable anonymity of an individual, there must exist a set of individuals that appear to have the same attribute(s) as the individual. To the attacker or the observer, these individuals must appear indistinguishable from each other.
Pseudonymisation. The true identity is deterministically replaced with an alternate identity (a pseudonym). When the pseudonymisation schema is known, the process can be reversed, so the original identity becomes known again.
In practice there is a fine line between the two; for example, how to categorise a deterministic algorithm for pseudonymisation of IP addresses that produces a group of pseudonyms for a single given address?
Awareness of the need for pseudonymising data, so that significant corpuses of captures could be shared for research purposes, sparked research into particularly IP address pseudonymisation in the late 1990s/early 2000s. Several techniques reflecting different requirements for the pseudonymised addresses and different performance/resource tradeoffs emerged over the course of the decade. Developments over the last decade have been both a blessing and a curse; the large increase in size between an IPv4 and an IPv6 address, for example, renders some techniques (in particular TSA) impractical, but also makes available a much larger amount of input entropy. Pseudonymised IPv6 addresses are therefore much better placed to resist brute force re-identification attacks than IPv4 addresses. Several authors (e.g. Brenker & Arnes11) have observed that today any IPv4 address pseudonymisation is vulnerable to a brute force attack, particularly if an attacker is capable of ensuring packets are captured by the target and the attacker can send forged traffic with arbitrary source and destination addresses to the target thus permitting an attack along the lines of a cryptographic chosen plaintext attack.
Categorising techniques for anonymising logs
The ways in which data may be pseudonymised can be classified into some broad categories.
This list is derived from RFC6235 and van Dijkhuizen et al.1
A pseudonymising technique may also have properties desirable in a particular application:
Since May 2010, Google Analytics has provided a facility that allows website owners to request that all their users IP addresses are pseudonymised within Google Analytics processing. This very basic pseudonymisation simply sets to zero the least significant 8 bits of IPv4 addresses, and the least significant 80 bits of IPv6 addresses. The level of pseudonymisation this produces is perhaps questionable. There are some analysis results13 which suggest that the impact of this on reducing the accuracy of determining the user’s location from their IP address is less than might be hoped; the average discrepancy in identification of the user city for UK users is no more than 17%. Anonymisation: Format-preserving, Filtering (grey marking).
Since 2006, PowerDNS have included a data minimisation tool dnswasher14 with their PowerDNS product. This is a PCAP filter that performs a one-to-one mapping of end user IP addresses with an pseudonymised address. A table of user IP addresses and their de-identified counterparts is kept; the first IPv4 user addresses is translated to 0.0.0.1, the second to 0.0.0.2 and so on. The de-identified address therefore depends on the order that addresses arrive in the input, and running over a large amount of data the address translation tables can grow to a significant size. Anonymisation: Format-preserving, Enumeration.
Used in TCPdpriv2, this algorithm stores a set of original and pseudonymised IP address pairs. When a new IP address arrives, it is compared with previous addresses to determine the longest prefix match. The new address is pseudonymised by using the same prefix, with the remainder of the address pseudonymised with a random value. The use of a random value means that TCPdrpiv is not deterministic; different pseudonymised values will be generated on each run. The need to store previous addresses means that TCPdpriv has significant and unbounded memory requirements, and because of the need to allocated pseudonymised addresses sequentially cannot be used in parallel processing. Anonymisation: Format-preserving, prefix preservation (general), replacement, random substitution.
Cryptographic prefix-preserving pseudonymisation was originally proposed as an improvement to the prefix-preserving map implemented in TCPdpriv, described in Xu et al3 and implemented in the Crypto-PAn tool4. Crypto-PAn is now frequently used as an acronym for the algorithm. Initially it was described for IPv4 addresses only; extension for IPv6 addresses was proposed in Harvan & Schönwälder5 and implemented in snmpdump6. This uses a cryptographic algorithm rather than a random value, and thus pseudonymity is determined uniquely by the encryption key, and is deterministic. It requires a separate AES encryption for each output bit, so has a non-trivial calculation overhead. This can be mitigated to some extent (for IPv4, at least) by pre-calculating results for some number of prefix bits. Pseudonymisation: Format-preserving, prefix preservation (general), cryptographic permutation.
Proposed in Ramaswamy & Wolf7, Top-hash Subtree-replicated Anonymisation (TSA) originated in response to the requirement for faster processing than Crypto-PAn. It used hashing for the most significant byte of an IPv4 address, and a pre-calculated binary tree structure for the remainder of the address. To save memory space, replication is used within the tree structure, reducing the size of the pre-calculated structures to a few Mb for IPv4 addresses. Address pseudonymisation is done via hash and table lookup, and so requires minimal computation. However, due to the much increased address space for IPv6, TSA is not memory efficient for IPv6. Pseudonymisation: Format-preserving, prefix preservation (general), cryptographic permutation.
A recently-released proposal from PowerDNS8, ipcipher is a simple pseudonymisation technique for IPv4 and IPv6 addresses. IPv6 addresses are encrypted directly with AES-128 using a key (which may be derived from a passphrase). IPv4 addresses are similarly encrypted, but using a recently proposed (and confusingly closely named) encryption ipcypher9 suitable for 32bit block lengths. However, the author of ipcrypt has since indicated10 that it has low security, and further analysis has revealed it is vulnerable to attack. At the time of writing, progress on ipcipher appears to have stalled. Pseudonymisation: Format-preserving, cryptographic permutation.
van Rijswijk-Deij et al15 have recently described work using Bloom filters to categorise query traffic and record the traffic as the state of multiple filters. By this means, it is possible to determine with a high probability if, for example, a particular query was made, but the set of queries made cannot be recovered from the filter. Similarly, by mixing queries from a sufficient number of users in a single filter, it becomes practically impossible to determine if a particular user performed a particular query. Large numbers of queries can be tracked in a memory-efficient way. As filter status is stored, this approach cannot be used to regenerate traffic, and so cannot be used with tools used to process live traffic. Anonymisation: Generalisation.
(A placeholder list).
TTL/Hoplimit (if present) can be used to fingerprint client OS.
MAC address/VLAN.
DNS ID lack of randomisation ditto.
All queries down a single TCP stream must come from the same host.