The privacy risks posed by generative AI are very real. From increased surveillance and exposure to more effective phishing and vishing campaigns than ever, generative AI erodes privacy en masse, indiscriminately, while providing bad actors, whether criminal, state-sponsored or government, with the tools they need to target individuals and groups.
The clearest solution to this problem involves consumers and users collectively turning their backs on AI hype, demanding transparency from those who develop or implement so-called AI features, and effective regulation from the government bodies that oversee their operations. Although worth striving for, this isn’t likely to happen anytime soon.
What remains are reasonable, even if necessarily incomplete, approaches to mitigating generative AI privacy risks. The long-term, sure-fire, yet boring prediction is that the more educated the public becomes about data privacy in general, the lesser the privacy risks posed by the mass adoption of generative AI.
Do We All Get the Concept of Generative AI Right?
The hype around AI is so ubiquitous that a survey of what people mean by generative AI is hardly necessary. Of course, none of these “AI” features, functionalities, and products actually represent examples of true artificial intelligence, whatever that would look like. Rather, they’re mostly examples of machine learning (ML), deep learning (DL), and large language models (LLMs).
Generative AI, as the name suggests, can generate new content – whether text (including programming languages), audio (including music and human-like voices), or videos (with sound, dialogue, cuts, and camera changes). All this is achieved by training LLMs to identify, match, and reproduce patterns in human-generated content.
Let’s take ChatGPT as an example. Like many LLMs, it’s trained in three broad stages:
- Pre-training: During this phase, the LLM is “fed” textual material from the internet, books, academic journals, and anything else that contains potentially relevant or useful text.
- Supervised instruction fine-tuning: Models are trained to respond more coherently to instructions using high-quality instruction-response pairs, typically sourced from humans.
- Reinforcement learning from human feedback (RLHF): LLMs like ChatGPT often undergo this additional training stage, during which interactions with human users are used to refine the model’s alignment with typical use cases.
All three stages of the training process involve data, whether massive stores of pre-gathered data (like those used in pre-training) or data gathered and processed almost in real time (like that used in RLHF). It’s that data that carries the lion’s share of the privacy risks stemming from generative AI.
What Are the Privacy Risks Posed by Generative AI?
Privacy is compromised when personal information concerning an individual (the data subject) is made available to other individuals or entities without the data subject’s consent. LLMs are pre-trained and fine-tuned on an extremely wide range of data that can and often does include personal data. This data is typically scraped from publicly available sources, but not always.
Even when that data is taken from publicly available sources, having it aggregated and processed by an LLM and then essentially made searchable through the LLM’s interface could be argued to be a further violation of privacy.
The reinforcement learning from human feedback (RLHF) stage complicates things. At this training stage, real interactions with human users are used to iteratively correct and refine the LLM’s responses. This means that a user’s interactions with an LLM can be viewed, shared, and disseminated by anyone with access to the training data.
In most cases, this isn’t a privacy violation, given that most LLM developers include privacy policies and terms of service that require users to consent before interacting with the LLM. The privacy risk here lies rather in the fact that many users are not aware that they’ve agreed to such data collection and use. Such users are likely to reveal private and sensitive information during their interactions with these systems, not realizing that these interactions are neither confidential nor private.
In this way, we arrive at the three main ways in which generative AI poses privacy risks:
- Large stores of pre-training data potentially containing personal information are vulnerable to compromise and exfiltration.
- Personal information included in pre-training data can be leaked to other users of the same LLM through its responses to queries and instructions.
- Personal and confidential information provided during interactions with LLMs ends up with the LLMs’ employees and possibly third-party contractors, from where it can be viewed or leaked.
These are all risks to users’ privacy, but the chances of personally identifiable information (PII) ending up in the wrong hands still seem fairly low. That is, at least, until data brokers enter the picture. These companies specialize in sniffing out PII and collecting, aggregating, and disseminating if not outright broadcasting it.
With PII and other personal data having become something of a commodity and the data-broker industry springing up to profit from this, any personal data that gets “out there” is all too likely to be scooped up by data brokers and spread far and wide.
The Privacy Risks of Generative AI in Context
Before looking at the risks generative AI poses to users’ privacy in the context of specific products, services, and corporate partnerships, let’s step back and take a more structured look at the full palette of generative AI risks. Writing for the IAPP, Moraes and Previtali took a data-driven approach to refining Solove’s 2006 “A Taxonomy of Privacy”, reducing the 16 privacy risks described therein to 12 AI-specific privacy risks.
These are the 12 privacy risks included in Moraes and Previtali’s revised taxonomy:
- Surveillance: AI exacerbates surveillance risks by increasing the scale and ubiquity of personal data collection.
- Identification: AI technologies enable automated identity linking across various data sources, increasing risks related to personal identity exposure.
- Aggregation: AI combines various pieces of data about a person to make inferences, creating risks of privacy invasion.
- Phrenology and physiognomy: AI infers personality or social attributes from physical characteristics, a new risk category not in Solove’s taxonomy.
- Secondary use: AI exacerbates use of personal data for purposes other than originally intended through repurposing data.
- Exclusion: AI makes failure to inform or give control to users over how their data is used worse through opaque data practices.
- Insecurity: AI’s data requirements and storage practices risk of data leaks and improper access.
- Exposure: AI can reveal sensitive information, such as through generative AI techniques.
- Distortion: AI’s ability to generate realistic but fake content heightens the spread of false or misleading information.
- Disclosure: AI can cause improper sharing of data when it infers additional sensitive information from raw data.
- Increased Accessibility: AI makes sensitive information more accessible to a wider audience than intended.
- Intrusion: AI technologies invade personal space or solitude, often through surveillance measures.
This makes for some fairly alarming reading. It’s important to note that this taxonomy, to its credit, takes into account generative AI’s tendency to hallucinate – to generate and confidently present factually inaccurate information. This phenomenon, even though it rarely reveals real information, is also a privacy risk. The dissemination of false and misleading information affects the subject’s privacy in ways that are more subtle than in the case of accurate information, but it affects it nonetheless.
Let’s drill down to some concrete examples of how these privacy risks come into play in the context of actual AI products.
Direct Interactions with Text-Based Generative AI Systems
The simplest case is the one that involves a user interacting directly with a generative AI system, like ChatGPT, Midjourney, or Gemini. The user’s interactions with many of these products are logged, stored, and used for RLHF (reinforcement learning from human feedback), supervised instruction fine-tuning, and even the pre-training of other LLMs.
An analysis of the privacy policies of many services like these also reveals other data-sharing activities underpinned by very different purposes, like marketing and data brokerage. This is a whole other type of privacy risk posed by generative AI: these systems can be characterized as huge data funnels, collecting data provided by users as well as that which is generated through their interactions with the underlying LLM.
Interactions with Embedded Generative AI Systems
Some users might be interacting with generative AI interfaces that are embedded in whatever product they’re ostensibly using. The user may know that they’re using an “AI” feature, but they’re less likely to know what that entails in terms of data privacy risks. What comes to the fore with embedded systems is this lack of appreciation of the fact that personal data shared with the LLM could end up in the hands of developers and data brokers.
There are two degrees of lack of awareness here: some users realize they’re interacting with a generative AI product; and some believe that they’re using whatever product the generative AI is built into or accessed through. In either case, the user may well have (and probably did) technically consent to the terms and conditions associated with their interactions with the embedded system.
Other Partnerships That Expose Users to Generative AI Systems
Some companies embed or otherwise include generative AI interfaces in their software in ways that are less obvious, leaving users interacting – and sharing information – with third parties without realizing it. Luckily, “AI” has become such an effective selling point that it’s unlikely that a company would keep such implementations secret.
Another phenomenon in this context is the growing backlash that such companies have experienced after trying to share user or customer data with generative AI companies such as OpenAI. The data removal company Optery, for example, recently reversed a decision to share user data with OpenAI on an opt-out basis, meaning that users were enrolled in the program by default.
Not only were customers quick to voice their disappointment, but the company’s data-removal service was promptly delisted from Privacy Guides’ list of recommended data-removal services. To Optery’s credit, it quickly and transparently reversed its decision, but it’s the general backlash that’s significant here: people are starting to appreciate the risks of sharing data with “AI” companies.
The Optery case makes for a good example here because its users are, in some sense, at the vanguard of the growing skepticism surrounding so-called AI implementations. The kinds of people who opt for a data-removal service are also, typically, those who will pay attention to changes in terms of service and privacy policies.
Evidence of a Burgeoning Backlash Against Generative AI Data Use
Privacy-conscious consumers haven’t been the only ones to raise concerns about generative AI systems and their associated data privacy risks. At the legislative level, the EU’s Artificial Intelligence Act categorizes risks according to their severity, with data privacy being the explicitly or implicitly stated criterion for ascribing severity in most cases. The Act also addresses the issues of informed consent we discussed earlier.
The US, notoriously slow to adopt comprehensive, federal data privacy legislation, has at least some guardrails in place thanks to Executive Order 14110. Again, data privacy concerns are at the forefront of the purposes given for the Order: “irresponsible use [of AI technologies] could exacerbate societal harms such as fraud, discrimination, bias, and disinformation” – all related to the availability and dissemination of personal data.
Returning to the consumer level, it’s not just particularly privacy-conscious consumers that have balked at privacy-invasive generative AI implementations. Microsoft‘s now-infamous “AI-powered” Recall feature, destined for its Windows 11 operating system, is a prime example. Once the extent of privacy and security risks was revealed, the backlash was enough to cause the tech giant to backpedal. Unfortunately, Microsoft seems not to have given up on the idea, but the initial public reaction is nonetheless heartening.
Staying with Microsoft, its Copilot program has been widely criticized for both data privacy and data security problems. As Copilot was trained on GitHub data (mostly source code), controversy also arose around Microsoft’s alleged violations of programmers’ and developers’ software licensing agreements. It’s in cases like this that the lines between data privacy and intellectual property rights begin to blur, granting the former a monetary value – something that’s not easily done.
Perhaps the greatest indication that AI is becoming a red flag in consumers’ eyes is the lukewarm if not outright wary public response Apple got to its initial AI launch, specifically in regards to data sharing agreements with OpenAI.
The Piecemeal Solutions
There are steps legislators, developers, and companies can take to ameliorate some of the risks posed by generative AI. These are the specialized solutions to specific aspects of the overarching problem, no one of these solutions is expected to be enough, but all of them, working together, could make a real difference.
- Data minimization. Minimizing the amount of data collected and stored is a reasonable goal, but it’s directly opposed to generative AI developers’ desire for training data.
- Transparency. Given the current state of the art in ML, this may not even be technically feasible in many cases. Insight into what data is processed and how when generating a given output is one way to ensure privacy in generative AI interactions.
- Anonymization. Any PII that can’t be excluded from training data (through data minimization) should be anonymized. The problem is that many popular anonymization and pseudonymization techniques are easily defeated.
- User consent. Requiring users to consent to the collection and sharing of their data is essential but too open to abuse and too prone to consumer complacency to be effective. It’s informed consent that’s needed here and most consumers, properly informed, would not consent to such data sharing, so the incentives are misaligned.
- Securing data in transit and at rest. Another foundation of both data privacy and data security, protecting data through cryptographic and other means can always be made more effective. However, generative AI systems tend to leak data through their interfaces, making this only part of the solution.
- Enforcing copyright and IP law in the context of so-called AI. ML can operate in a “black box,” making it difficult if not impossible to trace what copyrighted material and IP ends up in which generative AI output.
- Audits. Another crucial guardrail measure thwarted by the black-box nature of LLMs and the generative AI systems they support. Compounding this inherent limitation is the closed-source nature of most generative AI products, which limits audits to only those performed at the developer’s convenience.
All of these approaches to the problem are valid and necessary, but none is sufficient. They all require legislative support to come into meaningful effect, meaning that they’re doomed to be behind the times as this dynamic field continues to evolve.
The Clear Solution
The solution to the privacy risks posed by generative AI is neither revolutionary nor exciting, but taken to its logical conclusion, its results could be both. The clear solution involves everyday consumers becoming aware of the value of their data to companies and the pricelessness of data privacy to themselves.
Consumers are the sources and engines behind the private information that powers what’s called the modern surveillance economy. Once a critical mass of consumers starts to stem the flow of private data into the public sphere and starts demanding accountability from the companies that deal in personal data, the system will have to self-correct.
The encouraging thing about generative AI is that, unlike current advertising and marketing models, it need not involve personal information at any stage. Pre-training and fine-tuning data need not include PII or other personal data and users need not expose the same during their interactions with generative AI systems.
To remove their personal information from training data, people can go right to the source and remove their profiles from the various data brokers (including people search sites) that aggregate public records, bringing them into circulation on the open market. Personal data removal services automate the process, making it quick and easy. Of course, removing personal data from these companies’ databases has many other benefits and no downsides.
People also generate personal data when interacting with software, including generative AI. To stem the flow of this data, users will have to be more mindful that their interactions are being recorded, reviewed, analyzed, and shared. Their options for avoiding this boil down to restricting what they reveal to online systems and using on-device, open-source LLMs wherever possible. People, on the whole, already do a good job of modulating what they discuss in public – we just need to extend these instincts into the realm of generative AI.