This data was originally featured in the May 1st, 2024 newsletter found here: INBOX INSIGHTS, MAY 1, 2024: AI ETHICS, MODEL TUNING
In this week’s Data Diaries, let’s discuss model tuning. Many AI services from big tech companies like Google’s AI Studio, OpenAI’s Platform, Anthropic’s Console, IBM WatsonX Studio, etc. all offer the ability to create tuned models. But what does this mean, and why would you do it? Let’s discuss model tuning.
Large language models work based on the prompts we give them. In general, the more specific, relevant text we provide in a prompt, the more likely it is we’re going to get a satisfactory output for most common tasks. The key phrase there is common tasks – the major use cases like summarization, extraction, classification, rewriting, question answering, and generation all have thousands or millions of examples around the web that models have trained on.
Sometimes, however, you want a model to perform a very specific task, a very specific way – and because language is naturally ambiguous, language models may not always do things the same way even when instructed to do so, much in the same way a toddler may not do things the same way even with firm instructions.
Generally speaking, you get better performance out of models by providing a few examples. You might have a specific style of summarization, so in your prompt, you’d specify a few examples of the right and wrong way a model should summarize your input text.
But sometimes, you need a model to conform exactly to a format, and even a few examples may not be enough to guarantee that output. That’s when you switch from prompting to model tuning. How it works is relatively straightforward: you provide a LOT of specific examples of the way you want a model to do a task, and then with the help of AI infrastructure (like that provided by the big AI tech companies), you essentially change how the model works by teaching it those examples.
For example, suppose you were building a system to do something like sentiment analysis. If you’ve ever done sentiment analysis with a large language model, you can tell it to provide only a numerical score and most of the time it will – but some of the time it wants to wax rhapsodic about your input text. That’s fine if you’re using a language model in a consumer interface like ChatGPT. That’s not fine if you’ve incorporated the language model into other software, like your CRM.
In that case, you’d want to build at least a thousand examples of exactly how you want the model to respond, in key-value pairs that look like this toy example:
- Input: Score the sentiment of this text: “I really hate when my food is delivered cold.”
- Output: -5
You’d have many, many specific examples of this in what’s essentially a spreadsheet, and you’d give that to the training software to tune the model to become really, really good at this task and delivering exactly the output you want.
As you migrate and evolve from end-user, consumer use of generative AI to organizational and enterprise use cases, these predictable, reliable responses become more and more important. When integrated into other software, there’s no opportunity to go back and ask the model to do it again, so tuning the model for a specific use case is essential.
The key takeaway to remember is that tuning language models makes them very good at one specific task. If you have a mission-critical task you need the model to do right all the time, tuning the model is the way to go.