Skip to content
February 20, 2024

RAG system for advanced document search at large enterprises


  • Davit Janezashvili


  • Puya Sharif
  • Emil Larsson


In today’s world, technology is changing fast, and companies need to stay agile and innovative to keep up. One of the new techs making waves is Retrieval Augmented Generation (RAG) systems. These tools are changing the game for businesses by making it easier to navigate their internal data through chatGPT-type question-answer interfaces. Modulai is working closely with multiple large enterprises to tailor RAG solutions that fit just right for them.

But this isn’t just about throwing around big tech terms. It’s a closer look at why and how Modulai uses RAG systems to make a difference in performance. We’re talking real-life examples that show everything from the drawing board to the final product. It’s all about how Modulai’s RAG solutions hit the mark every time, blending into companies’ existing operations while bringing that much-needed flexibility and custom fit.


Modulai has worked closely with various large enterprises to tailor an advanced search solution for internal company data. The solution is designed to manage sensitive and proprietary data, which means security and scalability are important objectives. The system can be deployed on-premises, on any cloud platform, and can be set up to use local language models without sharing data with third parties.


Many companies need performant internal search capabilities but are reluctant to use SaaS where data integrity can be compromised. They want to be in control of their data, be able to customize the functionality to tailor to specific needs, and be able to interact with a diverse set of data and documents.


Based on learnings from multiple deployments of custom RAG solutions, we have put together a performant stack of technologies, specifically designed for customizability. Each component has been developed, tested, and configured to perform best-in-class experience, letting the users explore, search, and access various modes of data, with a reasoning and presentation layer optimized to answer complex questions intuitively and conversationally and with direct reference to the underlying documents.

System design

Our design philosophy for Retrieval Augmented Generation (RAG) systems emphasizes three key elements: multi-agent architecture, security, and adaptability. By dividing the system into specific, adaptable agents, each responsible for a distinct function, such as ingestion, retrieval and response generation, we enhance the system’s resilience and compatibility with various service or storage solutions providers.

The multi-agent architecture allows for easy updates and maintenance without overhauling the entire system, ensuring that as new models or other key technologies emerge, the system can incorporate them with minimal disruption. For instance, integrating new databases or changing the response generation LLM can be done within the respective configuration layer without affecting the rest of the system. The design also centers on configurable components, allowing customization to meet specific organizational challenges. This customization is essential for ensuring seamless integration with existing infrastructures and supporting businesses as they evolve and adapt to their industries.

Security is integral, not an afterthought. We prioritize it, establishing a foundation that protects data integrity and confidentiality. This approach is crucial for safe integration with databases such as Pinecone, ElasticSearch, and OpenSearch and storage solutions such as S3, GCS, and Azure Blob. We achieve this through encryption, access controls, and regular security audits, ensuring data is secure at rest and in transit.


Adaptability is achieved through a service-oriented architecture, facilitating seamless integration with cloud and on-premise infrastructures like AWS, Azure, and GCP, with ECS, EKS or custom k8s setup. This flexibility allows the system to be tailored to different organizational needs, whether scaling up for large enterprises or offering precision for niche markets.

Using SaaS, cloud-based and locally hosted Large Language Models (LLMs), and embedding models enhances the system’s versatility in processing and generating responses. This approach provides options for organizations concerned with data privacy or those seeking to leverage the computational power of cloud services.

Integrations, such as with Langchain and LLamaIndex, are expanding the system’s capabilities, offering developers and organizations more tools for building efficient RAG applications, representing our commitment to staying at the forefront of technological advancements, ensuring a system remains a relevant and powerful tool for information retrieval and generation.

In summary, our approach to RAG system design strikes a balance between flexibility, security, and efficiency. This ensures that the system can meet the complex demands of large organizations while maintaining the simplicity needed for easy integration and operation. By focusing on modularity, security, and adaptability, we provide a robust foundation for building RAG applications that are resilient, flexible, and capable of evolving with technological advancements.

Data ingestion

Feeding information into RAG is set up to be flexible and efficient. We offer various ways to input data, like direct API calls, scheduled fetches, and updates triggered by events. This flexibility keeps the system current with fresh data from different sources.

We handle a mix of data types, including PDFs, DOCX files, and images, and we’re working to support audio and video. For images, especially scanned documents, we use Optical Character Recognition (OCR) to turn pictures into text that can be searched, helping the system understand and file away various data types.

Combining neural search with traditional semantic and SQL searches is a powerful way to organize data. This combination helps to improve understanding of the content and makes it easier to locate specific data. Our team is continually updating and fine tuning the models used in the system to better comprehend new data and the types of information people are likely to search for.

Extracting important bits from the data, like names and places, is a big part of what we do, thanks to Named Entity Recognition (NER). We also pull out metadata and keywords, which helps keep everything organized and easy to find later.

How we break down and arrange data is key. We use more models to split sentences and organize data to mirror how documents are set up. This makes pulling up the information you need quicker and more accurate.

Retrieval strategy

We’ve adopted a flexible retrieval strategy to sift through the vast information available. This strategy blends dense and sparse search techniques to address various information needs effectively.

The chart outlines a retrieval system that starts with a user's query. This input is then processed using a query augmentation technique and, along with the original query, is passed to the search component. The system conducts a hybrid search for all queries, and then further enhances the results with the Parent Document Retrieval technique. To support the search, metadata is extracted from the original query and additional search parameters. To ensure the results are varied and relevant, the outcomes of these searches are then merged using the Maximal Marginal Relevance (MMR) method. Lastly, Contextual Reranking rearranges the results to better align with the query's context.

Hybrid search

Dense search relies on neural network embeddings and is particularly good at grasping the nuanced meanings between documents and queries. This capability allows us to find relevant information beyond simple keyword matches. On the other hand, sparse search, which focuses on keyword and phrase matching through algorithms like BM25, offers precise answers to straightforward queries.

Query augmentation

To enhance search capabilities, we also use query augmentation and multi-querying. These methods help expand, rephrase, and create hypothetical document embeddings to cast a wider net for relevant data, especially useful for complex, multi-hop queries requiring a deeper understanding and retrieval of information.

Parent document retrieval

We break down large documents into smaller chunks, creating embeddings for these pieces to reflect their content accurately. When a query comes in, it’s matched against these embeddings to find relevant information. This process ensures that the responses are both specific to the query and rich in context.

File and repository structure

The retrieval strategy accounts for the data repository’s structure, be it a database or a directory hierarchy. By factoring in elements like database schemas and document structures, this approach can improve search accuracy and provide the context with the information relevant to the query to an answer generation model.


We use techniques like Reciprocal Rank Fusion (RRF), Maximal Marginal Relevance (MMR), and contextual ranking to merge results from different searches. These methods help prioritize the most relevant information, ensuring the retrieved data is pertinent and contextually appropriate.

The retrieval agent combines various search methodologies, query enhancement techniques, and result-merging tactics to deliver precise, relevant, and contextually fitting information. This approach is designed to meet the needs of a diverse audience, balancing technical depth with practical application.

This approach reflects a commitment to providing reliable information retrieval solutions without claiming to be the ultimate or only option. We aim to support the community by offering a system that can adapt to the evolving information retrieval landscape, acknowledging the importance of technical innovation and practical utility.

Generation process

In the heart of the system lies a smart process aimed at crafting replies that hit the mark regarding relevance, accuracy, and context alignment.

The flowchart illustrates a streamlined query handling system. It begins with the user's question, which then goes through a summarization of previous interactions. Next, relevant details are added to refine the query. Depending on the query's nature, it either exits the system if deemed invalid or out of scope, or proceeds to a process that links prompts together for context. Following this, the query undergoes a series of steps including function execution, mathematical computations, and identification of key entities. The system then creates a response and reviews it internally. If the self-critique deems the reply adequate, it is returned to the user; otherwise, it undergoes a regeneration process for improvement.


This begins with summarizing conversation history, ensuring responses fit the context better, making them more on point. To keep RAG conversations positive and on track, we set up checks to keep out harmful or off-topic content. When the system gets a new query, it first checks if it’s something we can handle, filtering out stuff that doesn’t fit our wheelhouse. This step ensures we focus our efforts where the system can provide the answer. We then guide the query to the right part of our system for the best handling, making the process more precise.

Prompt chaining

For complex tasks, we break down the query into smaller parts that are easier for the Large Language Model (LLM) to manage. This breakdown helps the system think through things more clearly and handle tasks better. Together with this, it makes it possible to trace and debug each step of model reasoning.

Spotting and sorting the main entities in the question—like documents, places, or concepts—is key to understanding and tailoring responses. For algebraic or computational tasks, we’re all about using different LLM tools and agents or pulling in data from outside through APIs to get the job done.

A big part of our job is asking follow-up questions when things aren’t clear. This makes the conversation feel more natural and helps us better grasp what the user is after. Asking more questions doesn’t just improve answers; it also makes the chat more lively and more like talking to a human.


Toward the end of the process, self-critique is used to validate responses and ensure they’re as good as possible. Citing specific documents or paragraphs when referencing information ensures clarity and precision in responses, reinforcing the relevance and specificity of answers.

Regarding shaping prompts, we’re all about setting up LLM for success. This means building a kind of thought map to help guide the system towards the best responses, using examples to show how to align with different needs, and tweaking the setup to work well with different types of models like GPT-4-turbo, Claude V2.0, and Mixtral.

Keeping the language professional is key, especially when a business-like tone is needed. This ensures responses aren’t just correct and fit the conversation’s vibe.

Continuous evaluation

We’ve established a straightforward validation and evaluation process in refining our Retrieval-Augmented Generation (RAG) system. This step is essential for ensuring the system works well and meets the needs of different users. At the heart of this effort is an automated setup that continuously tests and tweaks the system to keep it running smoothly in various situations.

Synthetic data

The first part of checking the system’s quality uses synthetic datasets. We put together sets of questions, contexts, and answers covering a broad range of types—from yes/no questions to more detailed inquiries and multi-hop fact-based questions. This approach helps us ensure our system is ready for real-world challenges by testing it against various situations, including different complexity levels and the amount of context needed to come up with answers.

Human in the loop

Next, we add a layer of testing with datasets marked by people. This is important because it lets us compare the system’s outputs to human judgment, ensuring the answers it gives align with what people would expect.

Retrieval validation

When we look at how well the system retrieves information, we focus on two things:

  1. How often it finds relevant information (recall)
  2. How accurate that information is (precision).

These checks help us understand how effectively the system can sift through data to find and use the right information.

LLM as a judge

To test how well the system generates answers, we try something different by having a Large Language Model (LLM) act as a reviewer. This involves looking at the responses to see if they’re correct, relevant, and factually accurate. This part of the test ensures the answers the system gives are on point and reliable.

Custom evaluation metrics

We also design specific metrics for different use cases, fine-tuning our evaluation to match the unique requirements of each scenario. This tailored approach ensures the system is delivering the best possible performance for the specific context it’s used in.

By going through this thorough testing and refining process, we’re working hard to create an RAG system that’s not only technically reliable but also adaptable to meet the complex needs of real-world applications. This effort ensures our system is prepared to handle a wide range of questions with accuracy and reliability.

Final thoughts

At Modulai, our journey in crafting Retrieval-Augmented Generation (RAG) systems is driven by the goal of addressing the intricate demands of large organizations. Our blueprint for these systems is built on a foundation of flexibility, security, and customization, ensuring they serve the specific needs of each organization effectively. By concentrating on key components such as data ingestion, retrieval processes, and rigorous validation, we’ve developed RAG systems that stand out for their reliability, precision, and ability to tackle these organizations’ complex challenges.

A standout feature of the systems is their swift adaptability to the evolving landscape of Large Language Models (LLM). This adaptability is essential, allowing for the seamless integration of new models as they emerge and furnishing organizations with the most current information and insights. This capability is not just an advantage but a necessity for large entities, enabling them to remain at the forefront of their industry with data-driven decisions informed by the latest advancements.

Integrating tools like Langchain and LLamaindex plays a pivotal role in the success of RAG systems. These tools offer a reliable and efficient means to tap into and process large language models, ensuring that even the extensive demands of major organizations can be met without compromising performance or security. Pairing these advanced tools with our robust system architecture and a detailed focus on validation, we’ve crafted a solution that’s not just scalable but also deeply attuned to the multifaceted needs of large firms.

In essence, our approach to RAG systems at Modulai is about crafting solutions that are not only technologically adept but also profoundly aligned with the strategic objectives of large organizations. It’s about providing a system that answers questions and anticipates needs, ensuring that every piece of information retrieved and generated is a step towards more informed, strategic decision-making. Our commitment is to deliver RAG systems that empower organizations to leverage their data resources fully, enhancing their operations and decision-making processes in a world that never stands still.


[Karpukhin et al., 2020] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.

[Robertson and Zaragoza, 2009] Stephen Robertson, and Hugo Zaragoza. The Probabilistic Relevance Framework: BM25 and Beyond.

[Chen et al., 2017] Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes. Reading Wikipedia to Answer Open-Domain Questions. arXiv preprint arXiv:1704.00051, submitted on 31 Mar 2017, last revised 28 Apr 2017 (this version, v2).

[Lewis et al., 2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems

[Izacard et al., 2022] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.

[Xu et al., 2023b] Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025, 2023.

[Arora et al., 2023] Daman Arora, Anush Kini, Sayak Ray Chowdhury, Nagarajan Natarajan, Gaurav Sinha, and Amit Sharma. Gar-meets-rag paradigm for zero-shot information retrieval. arXiv preprint arXiv:2310.20158, 2023.

[Shao et al., 2023] Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023.

[Wang et al., 2023b] Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, et al. Shall we pretrain autoregressive language models with retrieval? a comprehensive study. arXiv preprint arXiv:2304.06762, 2023.

[Chen et al., 2024] Xinyue Chen, Pengyu Gao, Jiangjiang Song, Xiaoyang Tan. HiQA: A Hierarchical Contextual Augmentation RAG for Massive Documents QA. arXiv preprint arXiv:2402.01767, 2024.

Wanna discuss RAG in enterprises with us?