Nanna Bonde Thylstrup – Dataset Ethics: Deleting archives, encountering remains

Nanna described her presentation as being about projects she’s working on, so she was both nervous and excited to see potential collaborations with the work that we do at the Institute, and that it will hopefully lead to fruitful conversations. One of these projects is called AI Reuse, about the politics and ethics of reuse in Artificial Intelligence (AI) and Machine Learning (ML) systems, and, drawing on critical work on ML systems and AI, to inform the danish public sector what to be mindful of when implementing these technologies. Denmark has declared its intention to take the lead in implementing digital transformations within the public sector, so it is a suitably receptive environment with which to be conducting this work. She is also a member of the data ethics component of PIN, a research project funded by the Innovation Fund Denmark, about the ethics of AI systems in nthe Danish news sector. Within this work, Nanna  draws on both Wendy and Zeerak’s work to understand the ethics of NLP systems. Finally for now, she is a member of the research collective Uncertain Archives which mobilizes critical archive theory to understand the politics and ethics of big data.

ML systems require increasing amounts of data to run and evolve. As a result the demand for datasets around the world has also increased, and Nanna’s work focuses on how to encounter these dataset archives responsibly. She described that while they are fundamental to ML systems, the archives are rarely a point of study themselves, only recently being focussed on by people like Timnit Gebru and others. Eun Seo Jo and Timnit Gebru situate them as “critical archives of social cultural data.” Nanna therefore engages with datasets through the lens of critical archival studies, helpful for thinking about the politics of datasets, and also the ethics of dataset encounters.

Hence the title of the talk today, ‘Dataset Ethics: Deleting traces, encountering remains’. Nanna described that datasets can be meaningfully approached as archives, recording the historical moments and feelings that gave rise to them, and their entanglement with gender, labour and colonial legacy that they inherently inhabit. When thinking about these things, it is therefore necessary to consider the ethics of engaging with the archives, which is where the critical archive theory element is so important too.

As any good researcher does, Nanna looked at the etymology of the word ‘Dataset,’ referring to the OED’s definition as “a collection of data”, a term which has been used since 1958 so is relatively recent. However, the conceptual history is embedded in a much longer history of quantification and standardization of numbers, and how they have a role in informing political decision making, scientific explorations, corporate strategy and the like. She described them as being unambiguous in terminology but having a rhetorical role in politics, power, and quantification.

One particular example of this comes from Jacqueline Wernimont’s work in her book Numbered Lives, which describes how slave bodies were counted as 3/5 people for census purposes, demonstrating examples how power was embodied in data collection, governance and interpretation. Nanna also described how conversely, subversive movements also use datasets to inform to combat oppression, particularly for feminist, LGBTQ, black & other social justice movements.

This embodiment of data is particularly fascinating, reinforcing its role as being able to describe historical moments and the events behind them, and as Nanna described, situating the archives as “potential places for the recovery of suppressed or marginalized histories.” And it is not only for social justice movements that datasets where context is helpful, Nanna also referred to where disciplines demand it, using the fascinating example of plant phenomics which demands accuracy and traceability, in its use of datasets, whereas economics focuses more on accuracy. The differences in needs of all the uses of datasets therefore has implications for reuse potential, as data collected with one priority might not therefore be suitable for another. Further, for commercial contexts, traceability is not a primary focus, more so the aggregation of data and linkages between it so while the original data collected confirms to the law where it was collected, and the author’s intention, once it is transformed through ML practices, it becomes “unmoored” from those original factors, and becomes a separate, dissolute entity. All of these considerations are taken into account in drawing on archive studies for datasets. Finally here, the notion of consent is easily obscured once data becomes transformed, reused, resold and redistributed through public and private sectors. There are very little ethical guidelines for how to treat this consent, or the data that was provided, something that Nanna and her colleagues are exploring the implications of.

So, in this moment where these issues are being investigated more thoroughly, there is a disparate but related group of scholars looking into the politics and ethics of datasets, particularly where related to machine learning. Nanna mentioned a number of scholars, for reference here:

  • STS studies, e.g David Ribes, Stephen Slota 
  • Platform infrastructure studies, e.g. Jean Christophe Plantin. Paul Edwards
  • Artistic research, e.g. Kate Crawford and Trevor Paglen, Adam Harvey, Everest Pipkin, Philip Schmitt, Linda Kronmann, Mimi Onuoha
  • Information science, e.g. Irene Pasquetto, Christine Borgman
  • ML/AI ethics, e.g. Emily Denton, Alex Hanna, Os Keyes, Prabhu and Abeba Birhane, Deb Raji, Timnit Gebru, Margaret Mitchell, Zeerak Waseem

Nanna reported that there have been some instances where corporations and universities have removed access to particular forms of datasets for machine learning, which suggests that serious questions are being asked about their liability and ethics. However this approach does not solve some of the issues, and also creates some as deletion does not always remove all the associated data, especially where it has already been repurposed. This is a feature of data and digital media more broadly, as has been covered in detail by people like Wendy in her work on software and memory. It is in the nature of digital media more broadly to create problems for understanding, through its endless re/creation of content, and its lack of ability to fix errors in previous or old versions of datasets. These traces then remain in the datasets which are then used for training ML and AI…perpetuating the problems, as has been demonstrated by many.

Fundamentally as Nanna describes, “understanding datasets as iterative spaces in which we repeat, rehearse, re-encounter and re-member the past as present also opens up to an ethical impulse that can help us acknowledge the traces that haunt us even if formal history has put them to rest.”

Nanna referred to one really interesting artwork by Everest Pipkin called ‘Lacework,’ which creates slow-moving, hallucinatory vignettes using datasets of videos from the MIT archive ‘Moment in Time.’ You can check out this work here. Pipkin says that there was a lot of difficulty encountered in creating their work, partly because of the enormous size of the dataset, much of which has not been watched by researchers, but also because the researchers did not ask for permission to use the films. There are also “moments of extreme emotion” (Pipkin) in the archive, and they describe love, loss, death and pornography as all featuring within the dataset.  Again this calls into question the labour that went into making the films, the Amazon Mechanical Turk workers who uploaded them, and the control exerted by those maintaining the archive. Here Nanna also referred to the notion of “second order violence,” a notion developed by Fuentes (2016), Agostinho (2019) from Hartman (1997) where those traces of violence are then reused.

Finally, Nanna expanded on the concept of traceability as put forward by Timnit Gebru et al (2018), as those using ML practices are rarely informed on the provenance of the data they employ; as long as it fulfills their purposes, there is no need to engage with where it came from. In more practical forms of construction, every element of a product has a supply chain that can be traced for accountability purposes, but this is not so in ML and digital media. This is a potential suggestion for how to proceed for a more accountable dataset archive, but also not a straight forward one. As Nanna closed with, “Cultural theories of the archive, however, frame traces as a cultural problem of remains. These approaches to traceability configure traces not as a question of linear paths that can be followed to the Ursprung of the problem, but rather as a form of cultural iteration that enfolds pasts and futures within the present in ways that are perhaps felt, but not always remain legible to us.” In thinking through how these problems can be solved, Nanna’s work and the works of those she referred to throughout feels incredibly important, and something that has not, so far anyway, received the attention it needs and deserves.

Thank you Nanna for such a thorough, and thoroughly thought-provoking presentation that resonates with the work being done at the Institute and beyond. We look forward to seeing how it progresses!