OpenAI Faces Scrutiny Over Data Retention and User Control Challenges
Legal Mandates and Regulatory Oversight Reshape Data Retention
OpenAI is currently navigating significant challenges regarding user data retention, stemming from both legal mandates and ongoing regulatory investigations. On June 6, 2025, Reuters reported that OpenAI initiated an appeal against a U.S. court order compelling the company to preserve all user interactions with ChatGPT and its API. This directive, a consequence of an ongoing copyright lawsuit filed by The New York Times, has effectively suspended OpenAI’s customary data deletion protocols for a broad spectrum of users.
According to OpenAI, this preservation order applies to non-enterprise users, encompassing interactions from ChatGPT Free, Plus, Pro, and Team subscriptions, as well as standard API usage. The only exceptions are customers operating under Enterprise, Education, or Zero Data Retention (ZDR) agreements. While OpenAI asserts that access to this retained data is restricted to legal and security personnel, the implications are clear: sensitive information transmitted to generative AI services could potentially be retained indefinitely, irrespective of a user's attempts to delete those communications. This shift fundamentally alters the risk landscape for organizations, as proprietary data, personally identifiable information (PII), protected health information (PHI), and even source code might persist outside their control.
Further underscoring these concerns, a joint investigation by privacy commissioners from Canada, Quebec, British Columbia, and Alberta into OpenAI OpCo, LLC, concluded on May 6, 2026. This probe examined OpenAI’s compliance with various privacy acts, focusing on issues such as the appropriate purpose for collecting, using, and disclosing personal information, valid consent, transparency regarding its models, accuracy of generated information, user access and deletion rights, and the adequacy of retention and disposal procedures.
Challenges in Controlling Sensitive Data within Development Workflows
Developers utilizing OpenAI’s tools, particularly the Codex CLI, are encountering difficulties in preventing sensitive files from being accessed by AI models. A discussion on GitHub, active as of October 22, 2025, highlighted an important nuance: while files referenced directly using an @ prefix (e.g., @config/settings.py) correctly exclude files listed in .gitignore from the model context, workspace-wide commands like rg or cat can still locally read these ignored files, potentially exposing their contents in Codex responses. This raises serious questions about how to guarantee that sensitive information, such as .env files, credentials, or local configurations, is never inadvertently transmitted to the model or remote contexts.
A related GitHub issue, dating back to August 28, 2025, described the need for a feature to explicitly exclude sensitive files. This issue noted that prior discussions (issue #205) had identified two primary use cases: safeguarding sensitive data and excluding large or irrelevant files. Although a Rust-based implementation (codex-rs) was referenced, a comparable exclusion feature did not appear to exist by that date, prompting a call to restart the design discussion.
Beyond development tools, users of the OpenAI API for Retrieval Augmented Generation (RAG) approaches also face challenges. One user reported that even with a system prompt explicitly instructing the AI to "Use ONLY source named ‘PerksPlus.pdf’ to generate your answer" and with the model's temperature set to 0 for deterministic responses, the AI model would still utilize other available sources. While the AI successfully distinguished between sources and provided references, it failed to adhere to the instruction for source exclusion, indicating a persistent difficulty in precisely controlling the AI's data access even with explicit API-level instructions.
Unpredictability and Mitigation Strategies for AI Interaction
The inherent unpredictability of generative AI systems introduces further data security concerns. ChatGPT's behavior, in particular, has been described as fundamentally unreliable and unpredictable. A critical bug reported on the OpenAI Developer Community forum highlighted how a combination of custom instructions and saved memory entries could silently disable core features like file parsing and Advanced Data Analysis (ADA) for months, without any warning or error messages. This meant ChatGPT was unable to process common file types such as .xls, .csv, .pdf, and .xlsx, while other users on the free tier experienced no such issues. The problem was resolved only after the user deleted problematic memory items and rewrote custom instructions to explicitly permit Python use, step-by-step output, and tool execution.
This incident underscores the challenge of managing GenAI systems, which are dynamic, externally connected, and difficult to monitor, making it harder to protect sensitive data like PII, PHI, intellectual property, and source code from exposure or exfiltration. In response to these evolving risks, some users and experts advocate for alternative AI solutions that offer stronger data privacy protections. Recommended alternatives include Anthropic's Claude, which reportedly does not train on chat data, Google Gemini AI Studio when accessed via a paid API, Cohere for enterprise-grade embeddings and semantic search, and local models like Ollama or Mistral for complete data control.
The ongoing legal and technical complexities surrounding data retention and user control necessitate a heightened focus on robust data governance and transparent AI system design.
This digest was compiled from:
- https://github.com/openai/codex/issues/2847
- https://github.com/openai/codex/discussions/5523
- https://www.youtube.com/watch?v=5PuofaVqXNI
- https://www.upwind.io/feed/mitigating-genai-data-exposure-in-light-of-openais-new-data-retention-policy
- https://community.openai.com/t/instruction-to-exclude-certain-information-when-generating-answer/470451
Share this digest
People Also Ask
- The AI Coding Safety Showdown: How Security Vulnerabilities and Infrastructure Outages Are Shaping the Vibe Coding Era
A comparative review of vibe coding tools reveals critical security differences, while recent global outages expose the infrastructure challenges facing Anthropic's Claude.
- GitHub Analysis Reveals 19-62% Token Reductions by Eliminating Unnecessary LLM Calls
GitHub's analysis of five production agentic workflows reveals that removing unnecessary LLM calls reduces token usage by 19 to 62 percent.
- A Character Is Just Context: Lessons From Building Unwritten Realms
Building the text-only game Unwritten Realms reveals that believable AI agents require strict context discipline and robust validate-and-repair loops rather than larger models.
Share your thoughts
Reactions, corrections, or insights — all welcome.
