Data Management Implications for Generative AI
The year 2023 will be the year that we’ll remember as the mainstream beginning of the AI age, catapulted by the technology that everyone’s talking about: ChatGPT.
Generative AI language models like ChatGPT have captured our imagination because for the first time, we are able to see AI holding a conversation with us like an actual person, and generating essays, poetry and other new content that we consider creative. Generative AI solutions seem full of groundbreaking potential for faster and better innovation, productivity and time-to-value. Yet, their limitations are not yet widely understood, nor are their data privacy and data management best practices.
Recently, many in the tech and security community have sent out warning bells due to lack of understanding and sufficient regulatory guardrails around the use of AI technology. We are already seeing concerns around reliability of outputs from the AI tools, IP and sensitive data leaks and privacy and security violations.
Samsung’s incident with ChatGPT made headlines after the tech giant unwittingly leaked its own secrets into the AI service. Samsung is not alone: A study by Cyberhaven found that 4% of employees have put sensitive corporate data into the large language model. Many are unaware that when they train a model with their corporate data, the AI company may be able to reuse that data elsewhere.
And as if we didn’t need more fodder for cyber criminals, there’s this revelation from Recorded Future, a cybersecurity intelligence firm: “Within days of the ChatGPT launch, we identified many threat actors on dark web and special-access forums sharing buggy but functional malware, social engineering tutorials, money-making schemes, and more — all enabled by the use of ChatGPT.”
On the privacy front, when an individual signs up with a tool like ChatGPT, it can access the IP address, browser settings and browsing activity—just like today’s search engines. But the risk is higher, because “without an individual’s consent, it could disclose political beliefs or sexual orientation and could mean embarrassing or even career-ruining information is released,” according to Jose Blaya, the Director of Engineering at Private Internet Access.
Clearly, we need better regulations and standards for implementing these new AI technologies. But there is a missing discussion on the important role of data governance and data management – since this can play a pivotal role in enterprise adoption and safe usage of AI.
It’s All About the Data
Here are three areas we should focus on:
- Data governance and transparency with training data: A core issue revolves around the proprietary pretrained AI models, or large language model (LLM). Machine learning programs using LLMs incorporate massive data sets from many sources. The trouble is, LLM is a black box that provides little if any transparency on the source data. We don’t know if the sources are credible, non-biased, accurate or illegal by containing PII or fraudulent data. Open AI, for one, doesn’t share its source data. The Washington Post analyzed Google’s C4 data set, spanning 15 million websites, and discovered dozens of unsavory websites sporting inflammatory and PII data among other questionable content. We need data governance that requires transparency in the data sources that are used and the validity/credibility of the knowledge from those sources. For instance, your AI bot might be training on data from unverified sources or fake news sites, biasing its knowledge that is now part of a new policy or R&D plan at your company.
- Data segregation and data domains: Currently, different AI vendors have different policies on how they handle the privacy of data you provide. Unwittingly, your employees may be feeding data in their prompts to an LLM, not knowing that the model may incorporate your data into its knowledge base. Companies may unwittingly expose trade secrets, software code and personal data to the world. Some AI solutions provide workarounds such as APIs that protect data privacy by keeping your data out of the pre-trained model, but this limits their value since the ideal use case is to augment a pre-trained model with your situation-specific data while keeping your data private. One solution is to make pre-trained AI tools understand the concept of “domains” of data. The “general” domain of training data is used for pre-training, and it is shared across entities, while “proprietary data”-based training model augmentation is securely confined to the boundaries of your organization. Data management can ensure these boundaries are created and preserved.
- The derivate works of AI: A third area of data management relates to the data generated by the AI process and its ultimate owner. Let’s say I use an AI bot to solve a coding issue. If something was not done correctly resulting in bugs or errors, normally I would know who did what to investigate and fix. But with AI, my organization is liable for any errors or bad outcomes that result from the task I ask the AI to perform–even though we don’t have transparency into the process or source data. You can’t blame the machine: somewhere along the lines a human caused the error or bad outcome. What about IP? Do you own the IP of a work created with a generative AI tool? How would you defend that in court? Claims are already being litigated in the art world, according to Harvard Business Review.
Data Management Tactics to Consider Now
In these early days, we don’t know what we don’t know about AI regarding the risks from bad data, privacy and security, intellectual property and other sensitive data sets. AI is also a broad domain with multiple approaches such as LLMs, logic-based automation, these are just some of the topics to explore through a combination of data governance policies and data management practices:
- Pause experimentation with generative AI until you have a governance framework overseeing strategy, policies,
and procedures to mitigate risk and validate the outcomes.
- Incorporate data management guidelines: it starts by having a solid understanding of your own data wherever it resides. Where’s your sensitive PII and customer data? How much IP data do you have and where do those files live? Can you monitor usage to make sure these data types aren’t inadvertently fed into an AI tool and to prevent a security or privacy breach?
- Don’t give the AI application more data than is needed and don’t share any sensitive, proprietary data. Lock down/encrypt IP and customer data to prevent it from being shared.
- Find out how and if the AI tool can be transparent with data sources.
- Can the vendor protect your data? Google shares this statement in its blog but the “how” is not clear: “Whether a company is training a model in Vertex AI or building a customer service experience on Generative AI App Builder, private data is kept private, and not used in the broader foundation model training corpus.” Read each AI tool’s contractual language to know whether any data you feed it can be kept private
- Tag data from derivative works by the owner, or the individual or department that commissioned the project. This is helpful because you may be ultimately liable for any work produced at your company and you want to know how AI was incorporated into the process and by whom.
- Ensure portability of data between domains. For instance, a team may want to strip data of its IP and identifying characteristics and feed it into a general training data set for future use. Automation and tracking of this process are vital.
- Stay informed of any industry regulations and guidelines in the works and talk to peers at other organizations to see how they are approaching risk mitigation and data management.
- Before you begin any generative AI project, consult a legal expert to understand risks and processes to follow in the event of data leakage, privacy and IP violations, malicious actors or false/erroneous results.
A Pragmatic Approach to AI in the Enterprise
AI is advancing rapidly and holds tremendous promise with the potential to accelerate innovation, cut costs and improve user experience at a pace we have never seen before. But like most powerful tools, AI needs to be used carefully in the right context with the appropriate data governance and data management guardrails in place. Clear standards have not yet emerged for data management with AI, and this is an area that requires further exploration. Meanwhile, enterprises should tread carefully and ensure they clearly understand the data exposure, data leakage and potential data security risks before using AI applications.
About the author: Krishna Subramanian is President and COO of Komprise, a provider of unstructured data management solutions.
Related Items:
Self-Regulation Is the Standard in AI, for Now
NIST Puts AI Risk Management on the Map with New Framework
Data Silos Are Here to Stay. Now What?