Coming to Grips with Prompt Lock-In
Nothing stays still for long in the world of GenAI. The system prompts you’ve written for GPT-4 or Llama-3 8B give you one answer today, but may tell you something different tomorrow. This is the danger of a condition known as prompt lock-in.
System prompts set the conditions and the context for the answer the user expects from the large language model (LLM). Combined with other techniques, such as fine-tuning and retrieval-augmented generation (RAG), system prompts are a critical tool for getting the most from an LLM.
However, system prompts don’t work like normal computer programs, says Luis Ceze, computer science professor and CEO of OctoAI, a cloud-based AI platform that enables customer to run a variety of LLMs.
“In a typical program, you write the steps, you execute those steps, and there’s a pretty high confidence that you know what these steps do,” Ceze tells Datanami in a recent interview. “You can validate, you can test them. There’s a long history in developing software that way.”
“What a prompt is, fundamentally,” he continues, “is a way of getting a large language model to find the corner in this super complex, high-dimensional latent space of what is the stuff that you’re talking about that you want the model to do to continue completing the sentence.”
It’s truly amazing that we’re able to get so much out of LLMs using such a technique, Ceze adds. It can be used to summarize text, to have an LLM generate new next based on input, or even to exhibit some form of reasoning in generating steps to tasks.
But there’s a catch, Ceze says.
“All of that is highly dependent on the model itself,” he says. “If you write a set of prompts that work really well for a given model, and you go and replace that model with a different model because, as we said, there’s a new model every other week, it could be that these prompts won’t work as well anymore. Then you have to go and adjust the prompt.”
Fail to adjust those prompts when the model changes, you could succumb to prompt lock-in. When the model changes, the prompts work the same way, giving you potentially worse outcomes, even though nothing on your end changed. That’s a big departure from previous software development patterns that today’s AI application and system designers must adjust to.
“I feel like it’s definitely a change in how we build software,” Ceze says. “The way we used to build software is we had modules that you could easily compose. And there’s an expectation of composability, that combining module A and module B, you have some expectation of what the model would do, what the combined modules will do in terms of behavior of the software.
“But with the way building LLMs work, where you set these system prompts to pre-condition the model that are subject to change as the models evolve, and given how fast they’re evolving, you have to continuously update them,” he continues. “It’s an interesting observation of what’s happing building with LLMs.”
Even if the LLM update delivers more parameters, a bigger prompt window, and overall better capabilities, the GenAI application may end up performing worse than it did before, unless that prompt is updated, Ceze says.
“It’s a new challenge to deal with,” he says. “The model might be better, but you might get worse results because you didn’t adjust the system prompts, the things that tell the models what they should do as a baseline behavior.”
The current GenAI trend has developers using a multitude of LLMs to handle various tasks, a model “cocktail,” as it were, as opposed of using one LLM for everything, which would lead to cost and performance issues. This lets developers take advantage of models that do certain things very well, like the broad language understanding of GPT-4, while using smaller LLMs that may be less expensive but still provide good performance at other tasks.
As the number of LLMs in a GenAI application goes up, the number of system prompts that a developer must keep up-to-date also goes up, which adds to the cost. These are considerations that AI developers must keep in mind as they’re bringing the various components together to create.
“You’re optimizing performance and accuracy, you’re optimizing speed, and then cost,” Ceze says. “These three things. And then of course system complexity because the more complex it is, the harder it is to keep going.”
Related Items:
Taking GenAI from Good to Great: Retrieval-Augmented Generation and Real-Time Data
Birds Aren’t Real. And Neither Is MLOps