July 16, 2024

Enhancing retrieval systems with Domain Adaptation

Authors:

Lycke Fureby
Filippa Hansen

Editors:

Svante Sörberg
Puya Sharif

Given a large, unlabeled, domain-specific set of documents. What is the most effective set of techniques to achieve good retrieval performance?

Introduction

Retrieval Augmented Generation (RAG) is currently being used across various industries and enterprises as it allows for up-to-date, traceable, and fact-based Large Language Model (LLM) generated answers, crucial in professional settings. The retrieval component in such RAG-systems is often based on open-source or proprietary pre-trained embedding models, commonly one of the models provided by OpenAI. These embedding models have been trained on large amounts of data; wikipedia pages, reddit threads, stack overflow posts, and PubMed articles, to name a few examples. However, when such models are employed in retrieval systems intended for very domain-specific text, often drastically different from data used in pre-training, a steep decrease in retrieval performance is observed.

Figure 1: Visual representation of how a dense retrieval system works. Central for the system is an embedding model, which can be fine-tuned in order to improve performance across domain-specific fields.

Examples of domain-specific data include technical documentation, medical records, and financial documents. To concretise the importance of domain adaptation in such settings, the word “trauma” can be taken as an example. Imagine that you’re working with a medical dataset, using a pre-trained embedding model not specifically trained on similar data. This embedding model might interpret “trauma” as the mental condition caused by shock, stress, or fear, while in medical context the word has a different meaning, referring to a physical injury. Another example is the word “suspension”, which in the automotive domain is a part of a vehicle and has nothing to do with suspension from e.g. school.

One approach to improving retrieval performance across domain-specific fields is to fine-tune the embedding model used in the system on data from the same domain. However, this is often easier said than done. Successful fine-tuning requires high-quality labeled training data: in this context, each document should be labeled with a question that can be answered with the content in the document. Naturally, this is not the case for most domain-specific datasets. Performing manual labeling is extremely time-consuming and expensive when working with large amounts of data. Instead, a more feasible approach is to generate synthetic questions from the documents by utilizing a generative LLM.

Project core

Based on the above background and motivations, the primary project aim was to explore and compare various approaches to optimize domain adaptation of a dense retrieval system. We investigated if synthetic training data is a viable option for fine-tuning embedding models in such fields, and if a retrieval system utilizing a small embedding model fine-tuned on synthetic data could outperform a system utilizing a larger, off-the-shelf embedding model.

Project overview

As a data resource for this project, we used Operation CHARM https://charm.li/, which is a corpus consisting of over 50 000 car repair and diagnosis manuals. Our strategy for adapting a retrieval system to the “car manuals domain” can be roughly divided into five stages: data pre-processing, synthetic training data generation, curating sub-datasets, fine-tuning the embedding model using the sub-datasets, and evaluation. All links in this chain have been proven crucial for the success of this project, and we will now dive deeper into some of the techniques used.

Figure 2: Overview of methods and techniques employed during this project for domain adaptation of a dense retrieval system.

Generating synthetic training data

Figure 3: Overview of how synthetic queries were generated.

Acquiring high-quality training data is not easy, and while synthetic queries generated by an LLM is a start, we found that simply prompting the LLM to produce queries matching each document did not result in realistic data. Often, the generated queries were long, very detailed and simply not human-like, as can be seen in the two examples listed below:

“What are the specific diagnostic trouble codes (DTCs) that may trigger the check engine light in some 2007-2009 Hyundai Accent vehicles, and what are the associated issues with each code?”
“How can the Terminal with lead wire kit (TRK011) be used to repair a damaged connector in the instrument panel of the Hyundai Accent L4-1.6L?”

Fine-tuning an embedding model on these extremely detailed and keyword specific questions would not boost performance across real, human-written user queries, as rather than learning semantic similarities crucial for semantic search, the search is facilitated by keywords. This would make the embedding model superfluous, and we might as well use traditional lexicon search.

Given this, we wanted to find efficient methods, usable with large datasets, that could improve the quality of the generated queries. Firstly, different prompting strategies were tested, where improvements could be seen when techniques such as role and few-shot prompting were employed. However, prompting only got us so far, and the generated questions still did not quite meet the desired quality.

Leveraging LLMs for higher quality training data

Another approach was tested: leveraging an LLM for corpus refinement. The basic idea was to simplify the corpus, modifying each document before generating the questions to make them more realistic and less detailed. Depending on the corpus and the domain, this can be done in many ways, and it is beneficial to be creative and to know the data you are working with.

In our case, each document began with a title, and after manual inspection, it was discovered that these were often structured in a non-semantic manner, containing lots of technical words, and sometimes even specific error codes. This translated to the generated queries: they often contained the same error codes and technical words as the ones present in the title, making them less realistic and human-like.

Based on these observations, we decided to send each document title to an LLM, prompted to simplify, summarize, and remove specific error codes while keeping important document context. We then replaced the original titles with the LLM-summarized ones, and used these modified documents as input for generating training queries. The results were very promising, and below follows examples of queries generated after LLM-based corpus refinement:

“How do I remove the front seat assembly in order to replace the seat belt system?”
“Where are the front impact sensors located in my Hyundai car?”

Staged fine-tuning

After high quality training data had been secured, the embedding model fine-tuning process could begin. The Sentence Transformers library contains a multitude of different loss functions suitable for this, where the selection of loss can often be narrowed down depending on the training data format. We found that staged fine-tuning, first with a loss function adapted for document-query pairs, followed by a contrastive loss function teaching the model not only what is correct, but also incorrect, yielded the best performance.

Figure 4: Overview of how an LLM was utilized to evaluate hard negative passages.

To be able to use a contrastive loss function, negative examples are necessary. The process of hard negative mining, e.g. finding the negative examples that are the most difficult to classify as such, is not straightforward for an unlabeled corpus. With basic strategies, such as selecting documents that score highly but are not labeled as positives and labeling these as negatives, an issue of false negatives arises.

This was a concern during our project, since, given how the data was generated, each query only had one matching, positively labeled document. All other documents were thus potential, but not guaranteed negatives. To tackle this, we once again leveraged an LLM, see Figure 4. The potential hard negative examples were sent to the LLM, together with the query and a prompt instructing the model to determine if the document actually was a true negative, in which case it was labeled as such. The LLM hence acted in the role of a domain-expert, supervising the labeling to avoid false negatives.

Evaluating the success of domain adaptation

To get an idea of how much fine-tuning can boost retrieval performance, evaluation can be conducted on synthetic questions, generated in the same manner as the training data but from different documents not seen during training. However, to get a fair assessment of the system capabilities, real human-written questions should be used in evaluation. During this project, we calculated the following, commonly used, information retrieval metrics on both synthetic and human annotated test sets:

Recall measures how often a relevant document is retrieved at all,
NDCG (Normalized Discounted Cumulative Gain) and
MRR (Mean Reciprocal Rank) also takes the ordering of the relevant document into account, rewarding higher scores if the relevant document is retrieved as number 1 compared to as number 5.

By comparing the scores obtained on the same test sets before versus after fine-tuning an embedding model, we got an indication on how much retrieval performance could be increased using synthetic data. As a benchmark, a 10 times larger embedding model was used, to investigate if fine-tuning could outperform model size.

How much could we improve the retrieval system on a domain-specific corpus?

A lot! Our experiments demonstrated substantial improvements in retrieval performance following the fine-tuning on synthetic data. Our system that integrated LLM-based corpus refinement and hard negative mining, followed by staged fine-tuning, resulted in the highest percentage increase, significantly outperforming the baseline on both synthetic and human annotated test data. Percentage increases in recall as well as document ranking, measured with NDCG and MRR, were observed across all test sets.

The results were particularly striking when our fine-tuned model was compared against a much larger pre-trained embedding model. On both synthetic and human-annotated test datasets, our model consistently exceeded the performance of the larger model.

Figure 5: Average recall across the synthetic test datasets for a small base embedding model, a 10 times larger base embedding model, and our best fine-tuned small embedding model.

Figure 6: Average recall across the human annotated test datasets for a small base embedding model, a 10 times larger base embedding model, and our best fine-tuned small embedding model.

Closing remarks

Key insights from our research highlight the value of synthetic data as a resource, confirming its practicality and effectiveness for fine-tuning models when labeled datasets are unavailable. LLM-based data processing techniques, such as corpus refinement, was in our case effective for improving synthetic data quality, and can be adapted and applied in endless different ways depending on the data.

Our findings demonstrate that even smaller models, when properly fine-tuned, can outperform larger pre-trained models, offering a more efficient and cost-effective solution for applications where resources are limited.

Looking ahead, there are several avenues for future work. One promising direction is expanding our domain adaptation techniques to other specialized fields, such as medical records or financial documents, to validate the generalizability of our approach. Additionally, experimenting with more sophisticated synthetic data generation techniques could further enhance the realism of the generated queries. Finally, it would be interesting to integrate our fine-tuned models into practical applications, allowing us to assess their performance in live environments and gather user feedback for continuous improvement.