-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathappendix2-deidentification.tex
More file actions
64 lines (37 loc) · 14.3 KB
/
appendix2-deidentification.tex
File metadata and controls
64 lines (37 loc) · 14.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
%\appendix
\openchapterblock
\chapter{De-identification}
\label{appendix:deidentification}
\section{Detailed analysis of studies claiming use of ``non-identifiable'' data}
\subsection{Analysis of Dataset 1}
% \nb{Paul Gastin (by coincidence) is one of the authors of \cite{Robertson2015}, and the related paper \cite{Woods2015}, which show as the top two Google Scholar search results for the terms ``Australian'' ``football'' ``non-identifiable'' (in a private browser window to prevent bias from any of my prior searches)}
In a study of selection attributes in elite junior Australian Rules football \cite{Robertson2015} the authors claim ``access and consent to non-identifiable testing data was provided by each of the relevant state-based organisations and the study was approved by the relevant human research ethics advisory group.'' No details were provided about whether individual consent of participants had also been sought. The data includes player birthdates, performance, anthropometric data, as well as the outcome variable of interest, whether the player was drafted. The study used logistic regression and the JRip algorithm to predict whether the player was drafted as a function of the other variables in the dataset. The use of logistic regression and the JRip algorithm in this manner imply that the authors must have access to individual data rows for each player rather than group averages. The use of the term ``non-identifiable'' is questionable; the study specifically dealt with elite junior players which may progress into elite Australian Rules football, upon which names and birthdates are likely to become public information. In future, these details could then be linked back to the full attributes of the junior player using birthdate as a key. Thus while the participants may have been ``non-identifiable'' at the time of the study, it is likely that some of the participants may become re-identifiable in future once more details become public. The classification of the data as ``non-identifiable'' means that it could become part of a research databank kept long term, despite the data containing details that are likely to reveal participants at a future data.
The same authors published a second paper \cite{Woods2015}, analysing the same dataset as \cite{Robertson2015} from the perspective of age distribution. As some players did not participate in all performance tests (e.g. if injured, players would not be able to participate in performance tests that stress the injured part of the body), the number of players used to report averages differs between the two papers; there were up to n=292 drafted players in \cite{Woods2015} who were measured, but only n=212 complete player profiles selected for use in \cite{Robertson2015}. By comparing the differences between the two papers, it is possible to infer details about the group of players who were included in the first publication, but not the second. For example, \cite{Robertson2015} reports a mean drafted player height of 186.4 cm (n=281), and a mean drafted player body mass of 79.5 kg (n=282). Whereas \cite{Woods2015} reports a mean drafted player height of 185.7 cm (n=212), and a mean drafted player body mass of 79.5 kg (n=212). From the difference, it is possible to calculate that the excluded group (69 and 70 players) were higher than average (188.5 cm) and had greater body mass (81.0 kg). These differences alone give some indication of the characteristics of the excluded group. Furthermore, an attacker may have information from auxiliary sources as to which players belong to the excluded group\footnote{For example, a 2017 AFL Media article names players who were not able to complete some of the tests that year due to injury \url{http://www.lions.com.au/news/2017-10-06/afl-draft-combine-wrap}}. In a similar manner, an attacker could calculate the average performance results (in the tests that injured players were able to perform) for the excluded group. While the performance attributes of individuals are not sensitive---identifiable lists of the top performers in each test are publicly published by the AFL each year---the lack of public identifiable data including every individual suggests that only the results of the top performers are intended to be public information. There were too many players in the excluded group that knowledge of the group statistics would allow an attacker to infer individual player results. However, it appears that the attack was only avoided by chance (due to multiple players with incomplete profiles that needed to be excluded) rather than by design, and could thus pose a threat to participant privacy in other studies where the excluded group with incomplete profiles is small. This highlights the need for a mechanism to preserve privacy at the data level rather than post-publication to ensure that participants are not revealed by the differences between results when multiple publications discuss the same dataset from different perspectives.
\subsection{Analysis of Dataset 2}
Greenham et al. \cite{Greenham2017} perform a pilot study to measure game style in Australian Rules Football. They state that their research is exempt from ethics review, as ``non-identifiable player data, from identifiable team-based data-sets, were used in this study''.
Their measure of game style was derived from 12 variables. In this re-analysis, each variable was re-considered from the perspective of de-identification.
Nine of their variables were either publicly reported, or could potentially have been derived from public data (e.g. \textit{Shot at goal accuracy}), thus for the purposes of de-identification there is no need to consider these further. \textit{Location of Goal attempts} is potentially a de-identification issue as it is possible to re-identify players if their location is known; however, the study only used the proportion of goal attempts taken from close to the goal, and it is possible that Champion Data precomputed this prior to providing the data to the researchers. \textit{Ball Speed} was derived from video footage, while video footage is obviously identifiable, if using public video footage this is not a privacy issue, and the research may still be exempt from ethics approval. \textit{Offensive and defensive player numbers in the 50 m zone} and the closely related \textit{differential in team player numbers} were derived ``using video footage recorded behind the goals''. This is an issue, as contrary to the authors claim, behind the goals video footage is not public\footnote{Clubs are provided with ``exclusive behind the goals vision'' recordings \url{https://www.foxsports.com.au/afl/geelong-coach-chris-scott-explains-why-afl-coaches-bother-going-to-games-inperson-in-2016-with-video-technology/news-story/c9b99fbf9472483491294056c03bf25b}, which are considered a ``game-changer'' for football analysis \url{http://www.afl.com.au/news/2018-02-18/secret-spies-the-life-of-an-opposition-analyst}. A news report in 2013 revealed that clubs payed \$28,000/year each for the footage, with prices expected to rise to \$60,000/year per club in 2014. \url{https://www.theage.com.au/sport/afl/afl-doubles-tv-costs-20131025-2w7d1.html}}, nor de-identified\footnote{Even in the hypothetical case that the authors were to ask the video provider to blur out faces and player numbers in the behind-the-goals video, the position and movements of players evident within the video would still allow re-identifying particular players in the footage.}. However, one could potentially argue that a spectator at the game could observe the same information if they had reserved the right seat at the game.
The table in the paper only provides summary statistics taken over the entire group of games. However, the visual ``game style plot'' (parallel coordinates visualisation), shows z-scores for individual teams, identified by team name. Nevertheless, it is unlikely that one could infer details of individual players from this plot. While the information revealed was limited in the paper, it demonstrates that attention needs to be given to data revealed in figures and visualisations, not just the main text and tables.
% http://www.afl.com.au/news/2018-02-18/secret-spies-the-life-of-an-opposition-analyst
% "THE AFL's decision to begin supplying behind-the-goals vision to clubs about a decade ago was "the biggest game-changer in this role", according to Harding. "
% http://www.heraldsun.com.au/sport/afl/afl-seeks-better-behind-goa-vision/news-story/367ae612ab0ed52c1b0c78e5b4596136?sv=33f5e47008ac5a8cd791cac81f766488
% (afl wants more broadcasters to record behind the goals so that there is a high quality video feed)
% Evidence that clubs pay for footage:
% https://www.theage.com.au/sport/afl/afl-doubles-tv-costs-20131025-2w7d1.html
% (clubs can buy behind the goals vision at a cost of $28K (2013) to $60K (2014))
% "A move by the AFL to award its media department the contract to provide crucial match-day vision to clubs at more than double the previous cost has received mixed reviews across the competition."
% "AFL had put the contract through a tender process and awarded it to AFL Media, which is now headed by former Foxtel chief Peter Campbell. "
% https://www.foxsports.com.au/afl/behind-the-goals-vision-has-revealed-the-culprits-responsible-for-north-melbournes-fifth-consecutive-loss/news-story/8d827f9d7670d3f7311d759bfa44f8aa
% (seems that sport performance analysts have access to behind the goals footage) (show often involves a discussion with coach, so perhaps that's how they found out about details)
% https://www.foxsports.com.au/afl/geelong-coach-chris-scott-explains-why-afl-coaches-bother-going-to-games-inperson-in-2016-with-video-technology/news-story/c9b99fbf9472483491294056c03bf25b
% "But in 2016 ... plenty of extra camera angles provided to clubs including exclusive behind the goals vision"
%
% Arguably, is only for games played in public -- so could have potentially collected this information by selecting a seat in the correct location at the game.
\subsection{Analysis of Dataset 3}
Jacob et al. \cite{Jacob2016} perform a pilot study investigating the link between genetic polymorphisms and performance in Australian Rules Football. The study collected individual consent of players (and parental consent for players under 18). The study stated that to ``ensure anonymity, the players were assigned a randomised, non-identifiable code.''
From a de-identification perspective, there are two aspects of this study that make de-identification difficult: firstly, it only involved 30 participants, while small samples present well understood issues for validity as they risk describing the group rather than population, this can also be understood as a privacy issue as description of the characteristics of the group can be used to make inferences about the members of the group; and secondly, genetic polymorphisms can vary in distribution between racial and ethic groups, thus revealing genetic markers of a sub-group of the study may unintentionally reveal the likely racial or ethnic profile of that sub-group.
The study does not reveal the players studied, stating only that the study ``recruited 30 sub-elite Australian [Rules] Football players'', presumably this is to prevent individual players being identified. However, all authors of the study were from the University of Notre Dame Australia, Fremantle, Australia; it is thus likely that the players were recruited from a club within close proximity of Fremantle, Australia. Furthermore, a publicly available author pre-publication copy of the study mentions East Fremantle Football Club in the acknowledgement section. East Fremantle Football Club, and other clubs in the area, publicly publish the names of players in their team. Thus attempting to removing the name of the club from the published paper provided only superficial privacy, and it is reasonable to assume that an attacker could infer the list of players that potentially participated in the study.
The study considered polymorphisms of 9 genes, and published regression coefficients for the association between each genotype on performance. Amongst these, the study found ``the ACE [angiotensin-converting enzyme] DD genotype, associated with higher plasma ACE levels, had the greatest positive impact on [\arf{}] players in traditional power and aerobic athletic assessments, as well as in sport-specific skill assessments.'' However, it is also necessary to consider what information this reveals to an attacker regarding the identity of participants; the ACE DD genotype is also known to occur in lower proportions within certain ethnic groups, notably a study of blood and kidney donors \cite{Lester1999} found that Australian Aboriginals had a D allele frequency of 14\%, compared to Australian Caucasians who had a D allele frequency of 55\%, furthermore, one tribe of Australian Aboriginals was found to have a D allele frequency of just 3\%. In Jacob et al., the performance results published for the DD genotype was derived from a group of just 10 players. While the study does not specify who these players were, an attacker can refine the possible candidates by inferring that they were unlikely to be Australian Aboriginals given that they had the DD genotype.
In three cases, a genotype only corresponded to a single player. Thus the regression coefficients for these genotypes correspond to the player profile for a specific player. While the study does not make the identity of the player known, hypothetically, if a the player were to exhibit exceptional results for one of the tests, an attacker familiar with the team may be able to infer the likely identity of the player for that test, and use it to look up their results on the other tests (as they were the only player with that specific genotype). As before, the genotype itself also reveals information about that player race and ethnicity which could help the attacker reduce the possible candidates.
% \nb{Due to sensitive nature of discussion involving race / ethnicity, may be worth reviewing language to determine if anything could inadvertently cause offence. The term ``Australian Aboriginal'' was the term used in the study of blood and kidney donors by Lester, 1999 \cite{Lester1999}. The reason for specifically stating ``Australian Aboriginal'' rather than the more common classification ``Indigenous Australian'' used by the AFL is because the study of blood and kidney donors by Lester does not include statistics for the DD genotype frequency amongst Torres Strait Islander people.}
\closechapterblock