Blog

Navigating RAG: Selecting the optimal architecture for your needs 

Magnus Lindberg Christensen Junior AI Expert, Solita

Published 08 Nov 2024

Reading time 10 min

2023 and 2024 have witnessed high advancements with LLMs, especially with the emergence of numerous models varying in both size and capabilities. The proliferation of LLM providers is also present, offering various solutions for both hosting models and data across different storage systems, let it be on-premises or cloud-based. Because of the ample opportunities it has become more important for Developers and Solution Architects to thoughtfully consider the overall architecture of Retrieval-Augmented Generation (RAG) systems during application development.

Deciding on the right architecture isn’t a trivial task. The most suitable solution design must find a balance between performance, security, scalability, as well as cost-efficiency, while still aligning project requirements such as latency, customer expectations, and so forth.

In the following blog post, we will explore various architectural approaches for RAG systems ranging from entirely cloud-based solutions to fully local, on-premises solutions.

Overview of RAG architectural approaches

To start, it’s important to understand that the choice of architecture involves multiple decisions. The considerations are summarised below.

Navigating rag 1

In this overview, we will categorise RAG architectures into three primary types: On-premises solutions, cloud-based solutions, and hybrid solutions. Each of the categories represents a certain approach to balancing the trade-offs of control, scalability, and resource utilisation.

On-premises solutions

In the context of RAG applications, on-premises solutions refer to architectures where both the LLM and the data storage are hosted entirely on local infrastructures. Generally, this approach offers greater control over data and ensures the maximum security possible in infrastructure, thus not considering the security of data preprocessing. Unfortunately, it also imposes more burdens in terms of setup, maintenance, and scalability. The following introduces several on-premises architectures that can be employed. 

The local pre-trained model with a local database

In this architecture, a pre-trained LLM (such as Llama, Mistral, GPT-3, etc.) is selected and deployed on the local infrastructure. Additionally, an appropriate embedding model is used to transform the data into numerical vectors, which are then stored in a locally hosted database, such as a vector database, Elasticsearch, or others, depending on the system’s requirements.

  • Flexibility: Low
  • Customisability: Limited to fine-tuning
  • Scalability: Low. Require hardware upgrades
  • Model Change flexibility: Low. Require update of on-premises infrastructure

Pros 

  • Enhanced privacy and security: Data never leaves the local environment, thus ensuring higher security
  • Compliance with data regulations: On-premises solutions make it easier to comply with regulations such as GDPR, especially when dealing with sensitive data
  • Complete control: Full ownership over both the model and the data infrastructure
  • Lower long-term costs: If the system is long-lived and relatively stable, hosting everything locally can result in lower operations costs by avoiding cloud subscription fees
  • No reliance on Internet connectivity: Operations can continue even if there is no Internet access

Cons 

  • High initial costs and maintenance: Setting up and maintaining local servers, databases, and the model infrastructure is resource-intensive, requiring skilled personnel and significant initial investment
  • Limited scalability: Scaling up typically involves costly hardware upgrades, making this less flexible than cloud-based solutions
  • Resource intensive: Running LLMs and databases locally requires substantial computational resources
  • Hardware obsolescence: Hardware and software may quickly become outdated, and upgrading infrastructure may be costly and potentially complex 

Ollama language (and embedding) model

Ollama is a third-party service that simplifies access to open-source language models while still maintaining local hosting. The architecture mirrors the previous example but relies on Ollama to streamline access to LLMs and embedding models (or any other third-party application with the same goal).

  • Flexibility: Medium. Access to several open-source models
  • Customisability: Limited to fine-tuning
  • Scalability: Low. Require on-premises hardware upgrades

Navigating rag

Pros

  • Higher security with outsourced convenience: Sensitive data stays on-premises while benefiting from a more managed experience through Ollama
  • Customisability: Developers can choose from a variety of LLMs and embedding models tailored to their use
  • Flexibility: Ollama’s interface lets you quickly pull and deploy an LLM, providing more flexibility to change
  • Cost efficiency over time: Once set up, there are no recurring costs beyond infrastructure maintenance
  • Reduced development time: Ollama simplifies the management of language models, and thus the application development time may be reduced
  • Streamlined updates and integration: Ollama simplifies the integration with open-source models, thereby giving easier access to the latest models

Cons

  • Potentially increased latency: Depending on the integration specifies, latency may increase slightly, especially when combining components from multiple sources
  • Less scalability: Although Ollama simplifies model management, it does not eliminate the challenges of scaling on-premises systems
  • Third-party dependency: Even though data remains local, there remains a dependency on the Ollama platform, thus limiting complete autonomy
  • Limited customisability: Though the technique is flexible, complete customisation of model and infrastructure is still limited compared to fully do-it-yourself on-premises solutions 

Local fine-tuned model with local database

A more specialised and domain-specific variation of the pre-trained model approach involves fine-tuning the model on domain-specific data. This allows for the creation of a tailored solution that is more suited to particular tasks and industries.

  • Flexibility: Low. Often set on a single model
  • Customisability: High. Fine-tuned model is highly customisable
  • Scalability: Limited. Require hardware updates and re-fine tuning if changes in models
Navigating rag 3

Pros 

  • Greater precision: Fine-tuning allows the model to specialise in the domain of the application, leading to more accurate and relevant results
  • Complete data control: Both the model and data stay on-premises, ensuring privacy and regulatory compliance
  • Customisation flexibility: Fine-tuning provides more control over the model’s behaviour and performance
  • Better long-term model optimisation: Fine-tuning allows for gradual optimisation as more specific data is introduced over time

Cons

  • Longer development time: Fine-tuning models is resource-intensive and requires specialised knowledge to ensure the model is optimally trained for the task
  • High maintenance overhead: Fine-tuned models require additional storage and computational power
  • Storage and resource demands: Fine-tuned models often require additional storage and computational power

Custom language and embedding model

This approach goes beyond fine-tuning to also create a custom embedding model. This allows for full optimisation of both the language model and the embedding model to align with the specific domain of the application.
  • Flexibility: Low. Once build, this is what you use
  • Customisability: High. Both fine-tuned the LLM and the embedding model
  • Scalability: Low. Require hardware update and re-fine-tuning when changing the models
Navigating rag

Pros

  • Complete domain-specific optimisation: Both the language model and the embedding model can be crafted to suit a particular domain, improving performance and accuracy
  • High security: As with other on-premises solutions, this approach ensures complete data privacy
  • Long-term cost efficiency: Once the system is in place, there are no ongoing subscription costs
  • Comprehensive data and model control: Control over every element of the pipeline allows for more robust optimization and security measures

Cons

  • High setup costs: The initial effort and cost to develop and fine-tune both models are significant
  • Resource intensive: Requires significant hardware, expertise, and time to develop and maintain
  • Less flexibility: Once a custom system is built, it can be more challenging to adapt or update it compared to off-the-shelf solutions
  • Require advanced expertise: Custom training both the LLM and embedding model requires deep expertise in machine learning, which could be a limiting factor for several teams 

Nvidia NIM

Nvidia’s NIM is a specialised toolset optimised for LLMs and embeddings on Nvidia hardware. It is particularly suited for low-latency, high-performance applications in specific domains.

  • Flexibility: Medium. Require GPU accelerated systems (Nvidia in particular)
  • Customisability: Medium. You can fine-tune models but can’t do whatever you want
  • Scalability: Low. Scaling requires updated Nvidia Hardware investments, which may be costly

Navigating rag

Pros 

  • Optimised performance: Nvidia’s hardware and software ecosystem allows for extremely low-latency solutions
  • Fine-tuned precision: Models can be fine-tuned for specific domains, thus increasing their relevancy and accuracy
  • High security and privacy: As a locally hosted solution, all the data remains within the organisation
  • Specialised for GPU acceleration: Navidia’s hardware ecosystem is well-suited for accelerated machine learning tasks, providing significant performance boosts 

Cons

  • Limited scalability: Scaling such a solution requires further investment in specialized Nvidia hardware
  • High initial costs: Setup and maintenance costs are substantial, particularly for organisations without existing Nvidia infrastructure
  • Less hardware flexibility: Dependency on Nvidia limits flexibility to switch hardware ecosystems, and their solutions might not be portable to non-Nvidia systems
  • Dependency on Nvidia ecosystem: The architecture is coupled to Nvidia hardware, which may reduce flexibility when adapting to other platforms 

Cloud-based solutions

At core of cloud-based solutions their leverage external infrastructure, where models and data are hosted on cloud platforms and accessed via APIs. These architectures often offer greater scalability and flexibility but introduce concerns about privacy, security, and long-term costs.

API LLM endpoint with cloud database

In this architecture, the language model is selected and hosted on a cloud platform, such as Azure or AWS. The data is stored in a cloud-based database, thus providing a seamless and scalable solution for RAG applications

  • Flexibility: High. It is reasonably easy to change models
  • Customisability: Low, as you are often restricted to the cloud provider
  • Scalability: High. It is easy to scale on the cloud by just increasing the subscription

Navigating rag

Pros 

  • Reduced initial cost: Cloud platforms eliminate the need for significant upfront investment in hardware
  • Scalability: Cloud infrastructure allows for rapid scaling to meet a particular demand
  • Convenience: The cloud provider manages maintenance and updates, reducing the need for internal resources
  • Rapid deployment: Cloud solutions can be deployed and scaled much faster than on-premises setups 

Cons

  • Privacy and security concerns: Sensitive data is processed on external servers, raising issues on compliance with data privacy regulations
  • Ongoing costs: While initial costs are lower, ongoing subscription fees and pay-per-use charges can add up
  • Reliance on external providers: The system is fully dependent on the reliability and availability of the cloud provider
  • Provider lock-in: Moving away from a cloud provider can be costly and complex once reliant on their ecosystem  

API LLM endpoint with API database

This architecture further decentralises the above solution by hosting the language model and database on separate cloud providers, accessed through respective APIs.

  • Flexibility: Very high. It is easy to change different components from different providers
  • Customisability: Medium. More setting configurations due to the several providers
  • Scalability: Very high. The model and database can be scaled independently

Navigating rag

Pros

  • Flexibility: By selecting different providers for the model and database, developers can optimize each component for their specific needs
  • Scalability: Both the model and database can scale independently, allowing for more precise control of the resources

Cons

  • Increased complexity: Managing multiple APIs introduces additional complexity in synchronisation, which may lead to performance issues and potentially higher latency
  • Higher long-term costs: Utilising multiple cloud providers can lead to higher operational expenses, particularly when scaling across providers
  • Lower data control: Distributing data and processing across multiple providers can introduce more privacy risks
  • Increased data transfer latency: Data moving between providers introduces additional latency considerations
  • Data security complexity: Relying on multiple API providers requires complex security measures to ensure data integrity and privacy 

Hybrid solutions

Hybrid architectures try to combine the benefits of both on-premises and cloud-based solutions, thus offering a compromise between control and scalability. These approaches are often favoured when organisations need to balance e.g. privacy with resource flexibility.

Local LLM and API database

In this architecture, a language model is hosted locally, while the database is accessed via an API from a chosen cloud provider. This approach balances local control over the model with the scalability of cloud-based storage.

  • Flexibility: High. Combines both on-premises and cloud capabilities
  • Customisability: Medium. Local control of the LLM but limited to the Database
  • Scalability: Medium. Difficult to scale the local model, but very easy to scale the database
Navigating rag

Pros

  • Privacy for model processing: Keeping the LLM on-premises ensures that sensitive data remains under local control
  • Lower infrastructure costs: By outsourcing the database, organisations can reduce the costs associated with maintaining large-scale storage
  • Scalability for data: Cloud databases offer flexible storage solutions that can grow with the application’s need
  • Reduced on-premise storage burden: Offloading the database reduces the storage burden on local servers and makes the system more efficient

Cons 

  • Latency and synchronisation Issues: The separation between the model and database can introduce latency, particularly when handling large volumes of data
  • Complexity in management: Maintaining synchronisation between local and cloud components adds to the overall system complexity
  • GDPR and data privacy: Careful considerations are required to ensure compliance with data
  • Increased cloud dependence for data: While the model is local, database access still relies on cloud availability and performance 

API LLM with local database

In this architecture, the language model is hosted in the cloud, while the data is stored locally. This setup is useful for organisations that prioritise control over their data but wish to leverage the scalability and flexibility of cloud-based LLMs.

  • Flexibility: High. Leverages cloud capabilities for LMMs but retains the sensitive data locally
  • Customisability: Low. LLMs on the cloud can’t be deeply customised Scalability: Medium. Great scalability for the model, yet limited for the local storage, as this requires further hardware investments

Navigating rag

Pros

  • Data control: Sensitive data is stored locally, providing greater security and compliance with privacy regulations
  • Lower initial costs: By relying on cloud-hosted LLMs, organisations avoid the need for expensive local hardware to support large models
  • Scalability: The cloud-based model can scale to meet increasing demand without the need for costly hardware upgrades

Cons

  • Dependency on cloud provider: The system’s performance is reliant on the availability and reliability of the cloud provider
  • Latency issues: There may be increased latency when communicating between the cloud-hosted model and the local database
  • Ongoing cloud costs: Utilising cloud-based LLMs introduces recurring subscription fees, which can accumulate over time

Conclusions

Selecting the optimal architecture requires careful consideration of multiple factors, including latency requirements, data security, scalability, and resource availability. On-premises solutions provide unparalleled control and security but come with high upfront and ongoing costs for maintenance. Cloud-based solutions offer flexibility and scalability but introduce privacy concerns and may be costly in the long term. Hybrid architectures attempt to strike a balance between these, allowing for both scalability and control, though they may introduce more complexity and synchronisation challenges.

By evaluating the unique needs of the application, such as domain-specific requirements, the size and sensitivity of the data, and the available resources for development and maintenance, developers can make informed decisions that ensure a robust and efficient architecture tailored to their project’s goals.

Ultimately, there is no one-size-fits-all solution, but careful planning and consideration will lead to an architecture that best meets the needs of both the organisation and its users. 

  1. Tech