Building a Private Offline RAG System with Hybrid Retrieval and Reranking

7 June 2026 by

TechStora

Introduction to Building a Private Offline RAG System

Creating a private offline Retrieval-Augmented Generation (RAG) system for handling personal documents can be a challenging yet rewarding project. The idea behind this implementation is to utilize a combination of dense and sparse retrieval techniques to enable natural language querying over a personal corpus of PDFs. The unique aspect of this system is its ability to run entirely on local hardware, avoiding the need for cloud APIs or external servers.

By leveraging dense embeddings to capture semantic meaning and sparse methods like BM25 to identify exact terms, the system ensures robust and accurate search results. The reranking mechanism further refines these results using a cross-encoder, offering better context to the local Language Model (LLM) for generating cited answers.

Understanding the Hardware Challenges

One of the primary challenges encountered during development was the limitation of the hardware used, particularly older GPUs like the GTX 1080 Ti. This hardware, combined with running under WSL2, caused frequent issues such as process freezing and GPU unresponsiveness. These problems were further compounded by the embedding model, BGEM3, which would hang during long ingestion processes.

Attempts to troubleshoot the issue initially focused on adjusting the batch size, but this did not resolve the problem. Even small data chunks caused significant delays, with processing times exceeding 90 seconds for a mere 600-character chunk. This pointed to a fundamental incompatibility between the embedding model and the GPU under the given conditions.

Transitioning the Embedding Process to the CPU

The ultimate solution to the GPU freezing issue was to offload the embedding process to the CPU. Despite the initial assumption that GPU processing would be faster, it was discovered that the BGEM3 model is relatively lightweight, requiring just 1 GB of memory. This made it feasible to run the embedding process on the CPU without significant performance degradation.

Switching to CPU-based embedding resolved the freezing issues entirely. The GPU remained functional for other tasks, such as running the LLM. As a result, a 100-chunk corpus could be embedded in under a minute, highlighting the efficiency of the revised approach. This adjustment proved to be a critical step in stabilizing the system.

Optimizing Dense and Sparse Embeddings

The system employs dense embeddings to capture the semantic meaning of text and sparse embeddings to identify exact terms, such as gene names or identifiers. Combining these two approaches provides a comprehensive retrieval mechanism that balances precision and recall. The fusion of these methods ensures that the most relevant passages are retrieved for further processing.

To enhance the quality of the results, a reranking process is implemented using a cross-encoder. This step significantly improves the context provided to the LLM, enabling it to generate more accurate and cited answers. The dual embedding strategy, coupled with reranking, forms the backbone of this private offline RAG system.

Lessons Learned and Practical Insights

Building a private offline RAG system highlighted several important lessons. First, hardware compatibility is a crucial consideration, especially when working with older GPUs. The issues encountered with the GTX 1080 Ti under WSL2 underscore the importance of thoroughly testing hardware configurations.

Second, the choice of processing unit-CPU versus GPU-should be guided by the specific requirements of each task. In this case, offloading the embedding process to the CPU not only resolved the freezing issues but also optimized the overall system performance. Lastly, the importance of combining dense and sparse embeddings with reranking cannot be overstated, as this approach dramatically improves retrieval accuracy.