October 05, 2023

Finetuning GPT3.5 – Rick Sanchez

Authors:

Yue Liu
Christos Frantzolas
Bharat Sharma

Editors:

Dmitrijs Kass
Puya Sharif

Introduction

Large Language Models (LLMs) have taken the world by storm. Most prominent among them, Chat-GPT has disrupted our daily and professional lives. From high-school essays to creative writing, coding, and journalism, we must embrace AI writing assistants’ advancement. While prompt engineering often helps align the output of these models to specific tasks, it can sometimes fall short, particularly when trying to follow a certain style or tone. Furthermore, extensive prompt engineering results in additional tokens that can increase both the cost and latency of a model. This is where the fine-tuning of LLMs proves its worth. In this post, we share our experience of fine-tuning GPT-3.5 using OpenAI’s API. Our objective? To create a model that effortlessly mimics the unique style of Rick Sanchez, a popular character from the animated series “Rick and Morty”, without requiring a system message to guide its behavior.

Finetuning GPT3.5

Prepare your dataset

The first, and often most complex step, is to create a fine-tuning dataset. Despite the availability of public Rick and Morty scripts, this task required significant effort. We used ysharma/rickandmorty, containing a subset of episodes from seasons 1, 2, and 3. Each line in the dataset represents a single utterance from the show, labeled with the corresponding season, episode, and character name.

To understand the intricacy of constructing a dataset for fine-tuning a conversational bot, we need to consider the required format. Here’s a single training example:

{
    "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."},
    ]
}

Each example starts with a system message, followed by user and assistant messages. In our case, the system message was consistent across all examples:

system_message = "You are a conversational AI assistant that impersonates Rick Sanchez from the animated series Rick and Morty. Your task is to craft the next line in a dialogue based on the chat history provided. This history includes messages exchanged between you and the user. Each message from the user begins with the name of a character from the show who supposedly said it. You must use this information to formulate a response in Rick's signature style."

We appended each non-Rick utterance with the corresponding character’s name to account for the different responses Rick might have depending on who he is replying to.

Now let’s consider adding one user and one assistant message in each training example. However, a single user message often doesn’t provide enough context. For example, Rick’s response to the question “What’s wrong?” can vary significantly depending on the preceding dialogue. Here are two examples:

{
    "messages": [
        {"role": "system", "content": system_message},
        {"role": "user", "content": "Morty: What’s wrong?"},
        {"role": "assistant", "content": "The portal gun is not working!"},
    ]
}
#and
{
    "messages": [
        {"role": "system", "content": system_message},
        {"role": "user", "content": "Morty: What's wrong?"},
        {"role": "assistant", "content": "I am a cucumber!"},
    ]
}

By fine-tuning the model on such examples, we train it to say something about portal guns and cucumbers even if they never appeared in the conversation. Alright, what about adding a larger number of preceding lines from the script? Good thinking! However, deciding on the number of preceding lines to include is not straightforward, and using a fixed number of previous utterances can lead to problems with incoherence. Imagine a training example that contains three utterances from one scene and four more from an unrelated following scene on the show. Such incoherent examples may degrade the model if they become too prevalent.

Typical datasets for fine-tuning conversational bots, like OpenAssistant and Reddit comment chains, don’t pose such an issue since each data point contains just one coherent dialogue. Based on considerations set forth in this section, we manually cherry-picked 52 coherent, non-overlapping dialogues. According to OpenAI, 50 to 100 training examples generally suffice for noticeable improvements from fine-tuning.

Fine-tuning in 3 steps

Step 1 – Upload training and validation splits

With the dataset ready, we divided it into training and validation sets. This allows us to oversee the fine-tuning process and prevent overfitting. Using the OpenAI’s Files API, we uploaded these sets, setting the stage for the next steps.

import openai
import os


openai.api_key = os.getenv("OPENAI_API_KEY")


file_upload_train = openai.File.create(
    file=open(file_name_train, "rb"),
    purpose="fine-tune",
)


file_upload_validation = openai.File.create(
    file=open(file_name_valid, "rb"),
    purpose="fine-tune",
    user_provided_filename="validation-file",
)

Step 2 – Create a fine-tuned model

Next, we initiated a fine-tuning job by specifying the model, datasets, and hyperparameters:

finetuning_job = openai.FineTuningJob.create(
    training_file=file_upload_train["id"],
    model="gpt-3.5-turbo",
    validation_file=file_upload_validation["id"],
    hyperparameters={"n_epochs": 3},
)

Step 3 – Retrieve results

You can access the status of a fine-tuning job at any point:

finetuning_job = openai.FineTuningJob.create(
    training_file=file_upload_train["id"],
    model="gpt-3.5-turbo",
    validation_file=file_upload_validation["id"],
    hyperparameters={"n_epochs": 3},
)

You can also obtain the job ID from a list of previous fine-tuning jobs:

# List 10 fine-tuning jobs
jobs = openai.FineTuningJob.list(limit=10)

Once the fine-tuning is finished, we collected job events related to the given fine-tuning job ID. This allows us to access the training loss and mean token accuracy.

import pandas as pd


job_events = openai.FineTuningJob.list_events(id=job_id, limit=60)["data"]


# Transform a list of OpenAIObjects to a dataframe.
train_df = pd.DataFrame(job_events)
train_df = train_df[train_df["type"] == "metrics"].reset_index(drop=True)
train_df["step"] = train_df["data"].apply(lambda x: x["step"])
train_df["train_loss"] = train_df["data"].apply(lambda x: x["train_loss"])
train_df["train_mean_token_accuracy"] = train_df["data"].apply(
    lambda x: x["train_mean_token_accuracy"]
)
train_df[["step", "train_loss", "train_mean_token_accuracy"]]

Interestingly, validation metrics aren’t available in the list of events. As such, we acquired them in the following way:

import requests
from io import StringIO


result_file = openai.FineTuningJob.retrieve(job_id)["result_files"][0]
url = f"https://api.openai.com/v1/files/{result_file}/content"
headers = {"Authorization": f"Bearer {openai.api_key}"}
result_response = requests.get(url, headers=headers)


data_str = result_response.content.decode("utf-8")
val_df = pd.read_csv(StringIO(data_str))


# Use only validation metrics since training accuracy is unreliable.
val_df = val_df[["step", "valid_loss", "valid_mean_token_accuracy"]]

The file we retrieved here also contains training loss and training accuracy. However, only the training loss aligns with the data obtained from the list of events – the training accuracy is always reported as zero. Below, we plot the training and validation loss.

Web UI

We agree that running the fine-tuned model in a Python IDE is not the most user-friendly or accessible way to showcase your work. To rectify this, you can use Gradio’s web UI. Not only can you have your model operational within two minutes, but it also provides a publicly sharable link that remains active for 72 hours. This allows you to quickly create demos for colleagues and clients. However, it’s important to note that your machine needs to stay on as the app runs on it.

Here’s a ready-to-use code from Gradio. Just substitute “gpt-3.5-turbo” with your fine-tuned model’s name, which begins with “ft:gpt-3.5-turbo-0613:”. Here’s an example of a slightly adapted interface featuring a conversation between Morty and Rick:

Reflections

Here are the main takeaways from this exercise:

– It worked. The fine-tuned model successfully impersonated Rick Sanchez without a system message. Although it’s not perfect, the results are noteworthy, given we only used 52 training examples. However, caution is required when fine-tuning a model on a small dataset, especially if the model’s inherent knowledge already covers the fine-tuning domain extensively. In such cases, fine-tuning could potentially lead to poorer performance than simply prompt-engineering the original model.

– API-based fine-tuning simplifies the process compared to open-source models – there’s no need for infrastructure setup or QLoRA configuration [1]. However, this comes at the expense of control over fine-tuning parameters. We can only adjust the number of epochs, meaning no control over the learning rate or early stopping. Model checkpointing is cumbersome. Considering this, API-based fine-tuning of GPT-3.5 is an excellent option for rapid prototyping. More resource-intensive fine-tuning of open-source models can then be explored in subsequent iterations.

– Transforming a show’s script into a fine-tuning dataset can be complex. Ideally, we need distinct dialogues separated into scenes, each accompanied by a context description at the start of the dialogue. Generating such metadata is labor-intensive and could potentially be outsourced to GPT-4.

– Evaluating fine-tuning with cross-entropy loss offers only a single quantitative aspect of the entire evaluation process. Another critical component is human evaluation, where experts compare and score outputs from different models. Depending on GPT-4’s specific domain knowledge, evaluation tasks could potentially be delegated to it. In practical terms, there’s often a trade-off between the degree of automation and the quality of qualitative evaluation.

Resources

[1] Dettmers, Tim, et al. “Qlora: Efficient finetuning of quantized llms.” arXiv preprint arXiv:2305.14314 (2023).