Evaluating RAG pipelines on a million documents with OpenVINO on 5th Gen Intel Xeon Processors
Download PDFTue, 13 Aug 2024 20:52:26 -0000
|Read Time: 0 minutes
Summary
This paper presents the impressive capabilities of the OpenVINO toolkit [1] to accelerate Retrieval-Augmented Generation (RAG) pipelines by using the AI acceleration features present in the 5th Gen Intel Xeon Scalable processors [2] for the optimization of large language models (LLMs) and vector embedding models.
Enterprises across industries can benefit from the use of LLM-based RAG applications that extend generative AI workloads with modern information retrieval techniques to address most of the limitations faced by LLMs. In particular are those responsible for the generation of factually incorrect content or “hallucinations”. The OpenVINO toolkit leverages the 5th Gen Intel Xeon Scalable processor features to optimize two of the most critical runtime components of RAG pipelines: Embedding models and LLM inference.
In this document, we show that by using the capabilities of the OpenVINO toolkit along with the 5th Gen Intel Xeon Scalable processors, a speedup of close to 2x can be obtained in the LLM-generation step of a RAG pipeline. Furthermore, we show that an additional 1.9x can be obtained by using model quantization techniques, thus achieving a total speedup of up to 3.7x. Finally, we show that using OpenVINO can provide up to 2x speedup compared to out-of-the-box Pytorch[1] in the creation of a vector database containing more than a million documents extracted from Wikipedia.
Market positioning
LLMs have been at the center of attention for many enterprises because of their ability to generate human-like content that can be applied to many different use cases across various industries: including customer service, education, healthcare, law, research, and entertainment. Despite their tremendous performance in generative AI tasks, LLMs are known to require substantial computational resources to be trained or fine-tuned for: gaining domain or company-specific knowledge, incorporating up-to-date information, and reducing the chance of hallucinations. With RAG, enterprises can benefit from the generative AI capabilities of LLMs without having to spend all the time and cost required to train or fine tune an LLM. All the while, ensuring the generative model uses relevant and up-to-date information to produce higher quality content that aligns better with their specific business policies and priorities.
In addition to the intrinsic limitations of LLMs to generate high quality content when handling prompts that extend beyond their training data, their large size and compute-intensive characteristics introduce significant challenges for their inference when applied to scenarios with low latency requirements. The OpenVINO toolkit enables enterprises to take advantage of the AI acceleration features present in 5th Gen Intel Xeon Scalable processors to optimize embedding models and LLMs, both of which are critical to RAG deployments.
Dell and Intel conducted an LLM-based RAG performance testing using the latest Dell PowerEdge R760 [3] with 5th Gen Intel Xeon Scalable processors. This platform provides support for the higher memory bandwidth requirements of intense generative AI tasks with the use of DDR5-5600 MT/s modules. The platform also includes support for eight hot-plug 2.5”/3.5” HDD/SSD drives and offers Value SAS (vSAS) SSD support which expands the number of storage options available at an affordable and performant SSD tier. These drives can be configured with the Dell rack solution for small businesses that require a scalable server optimized for enterprise-class workloads.
The Dell PowerEdge R760 with 5th Gen Intel Xeon Scalable processors further helps enterprises to execute on their goals for: AI-driven innovation, automation, sustainability and a Zero-Trust approach to security. The OpenVINO toolkit and the 5th Gen Intel Xeon Scalable Processors offer multiple advanced CPU capabilities for AI, including:
- Built in AI accelerator with Intel® Advanced Matrix Extensions (Intel AMX) built into each core. This accelerator supports BF16 for inferencing and fine-tuning. It also supports quantization with support for INT8 for inferencing tasks, which leads to significant reduction in runtime and memory requirements.
- Intel® Advanced Vector Extensions 512 (Intel AVX-512) to boost the performance of non‒deep learning vector computations.
- Efficient model weight compression to INT8 and INT4 via the Optimum python module and the Intel Neural Network Compression Framework (NNCF).
System Configuration Tested
In our tests we used a Dell PowerEdge R760 enterprise rack server with 5th Gen Intel Xeon Scalable processors and 512GB of memory. Tables 1, 2, and 3 describe the Hardware platform, Software Configuration, and Workload Configuration used in our tests, respectively.
Table 1. Hardware Configuration
System | Dell Inc PowerEdge R760 |
Chassis | Rack Mount Chassis |
CPU Model | INTEL(R) XEON(R) PLATINUM 8562Y+ |
Microarchitecture | EMR_MCC |
Sockets | 2 |
Cores per Socket | 32 |
Hyperthreading | Enabled |
CPUs | 128 |
Intel Turbo Boost | Enabled |
Base Frequency | 2.8GHz |
All-core Maximum Frequency | 3.8GHz |
Maximum Frequency | 4.1GHz |
NUMA Nodes | 2 |
Prefetchers | L2 HW: Enabled, L2 Adj.: Enabled, DCU HW: Enabled, DCU IP: Enabled, AMP: Disabled, Homeless: Enabled, LLC: Enabled |
Accelerators | DLB 2 [0], DSA 2 [0], IAA 2 [0], QAT 2 [0] |
Installed Memory | 512GB (16x32GB DDR5 5600 MT/s [5600 MT/s]) |
Hugepagesize | 2048 kB |
Transparent Huge Pages | always |
Automatic NUMA Balancing | Disabled |
NIC | 2x Ethernet Controller X710 for 10GBASE-T for SFP, 2x Ethernet Controller E810-C for QSFP |
Disk | 4 x (3.5 T)SAMSUNG MZQL23T8HCLS-00B7C, 1 x 894.3G INTEL SSDSC2KG96 |
BIOS | 2.0.0 |
Microcode | 0x21000200 |
TDP | 300 watts |
Power & Perf Policy | Performance (0) |
Frequency Governor | performance |
Frequency Driver | intel_pstate |
Max C-State | 9 |
Table 2. Software Configuration
OS | Red Hat Enterprise Linux 8.9 |
Kernel | 4.18.0-513.24.1.el8_9.x86_64 |
Orchestration | RedHat OpenShift, Kubernetes v1.27.8 |
Framework | openvino-nightly 2024.2.0.dev20240412 langchain 0.1.17 langchain-community 0.0.37 |
Other Software | nncf 2.9.0 optimum-intel 1.17.0.dev0+0540b12 qdrant-client 1.8.2 transformers 4.38.2 Python 3.10.14 Datasets 2.18.0 Wheel 0.43.0 Git 2.39.3 GCC 8.5.0 Findutils 4.6.0 Sentencepiece 0.2.0 sentence-transformers 2.5.1 reader 3.12 |
Table 3. Workload Configuration
LLM Model | Meta Llama-2-7b-hf |
Embedding Model | All-mpnet-base-v2 |
Rerank Model | bge-reranker-base |
batch size | 1 |
Beam Width | 1 (greedy search) |
Datasets | Rag-mini-wikipedia, Wikipedia dump (800MB) |
Max Output Token Size | 256 |
Precision | BFloat16,int8,int4 |
Input data processing | Langchain's RecursiveCharacterTextSplitter with 1000 chunk size and 100 overlap |
Vector DB distance metric | Cosine similarity |
Retriever top documents | 10 |
Reranker top documents | 3 |
Vector DB - number of edges per node in the index graph (m) | 16 |
Vector DB - Number of neighbors to consider during the index building (ef) | 11 |
Performance Results
For our tests, we used Langchain to implement a RAG question answering (QA) pipeline. We made use of the Llama2-7B-chat LLM to generate human-like responses based on documents extracted from a Qdrant vector database. The use of OpenVINO in this application is based on the LLM Chatbot Demo notebook created by the OpenVINO team [4].
Before building the vector database, a set of more than 300,000 articles were extracted from an XML dump of the English-language Wikipedia database[2] and converted into multiple 1MB plain text files that included only the articles’ title, main content, and some basic metadata. More than one million documents were then created from the plain text files and stored in the Qdrant vector database using Langchain’s text splitters [5]. Figure 1 shows a diagram of the RAG pipeline QA system implemented for our tests.
For the LLM input prompt, a subset of eight questions were selected from the rag-mini-wikipedia [3] dataset to be used as queries. The rag-mini-wikipedia dataset is a very convenient choice for testing RAG-based QA systems as it provides A) a text corpus split into passages with each passage uniquely identified by a numeric ID, and B) a list of questions with their corresponding short answer and a list of the passage IDs that are most relevant to each question.
- RAG pipeline implemented for QA system.
In addition to storing the vector embeddings of the articles, Qdrant was used to retrieve the 10 documents with the greatest similarity to each of the eight input questions in our tests from the database. Subsequently, after selecting the 10 most relevant documents to each question, the BAAI/bge-reranker-base model was used to further refine the results and select the 3 most relevant documents (out of the 10 documents previously selected) that combined with the input question compose the final LLM input prompt. Each question was run independently of the others through the RAG pipeline and the first and next-token latencies are recorded[4]. The next-token latency is a key metric in text-generation workloads that is considered acceptable if kept under 100ms based on a typical human reading speed of 5-10 words per second. For each query we also record the average end-to-end latency which includes the information retrieval and text generation processes.
Figure 2 shows the inference performance speedup obtained with OpenVINO optimizations using BFloat16, INT8, and INT4, compared to Pytorch. The OpenVINO inference is accelerated via the 5th Gen Xeon Scalable Processor built-in AI Acceleration with AMX, and the OpenVINO toolkit with Optimum [6] and Intel’s NNCF [7].
The following key observations can be made from Figure 2:
- Using OpenVINO allows for a next-token latency well under the typical requirements for LLM-based QA systems (~<100ms) [8].
- By using OpenVINO a speedup of up to 1.9x can be obtained compared to Pytorch when using BF16.
- Additional speedups of 1.4x and 1.9x can be obtained with OpenVINO when INT8 and INT4 quantizations are used, respectively.
Figure 2. Next-token latency in a RAG pipeline using OpenVINO with 5th Gen Intel Scalable Processor
In our tests, the first-token latency —which is typically higher and more compute intensive than the next-token latency— is always less than 0.8 seconds when using OpenVINO. In contrast, using Pytorch results in an average first-token latency of 1.8 seconds. Also, Figure 3 shows that a 2x speedup can be obtained when using OpenVINO to build the vector database compared to the time required using Pytorch. Figure 4 shows the normalized end-to-end runtimes, including retrieval, reranking, and text generation times of the RAG pipeline. A speedup of 1.6x is obtained in average when using OpenVINO BF16 compared to Pytorch. An additional 1.2x or 1.3x can be obtained when using INT8 and INT4 model quantizations, respectively.
Figure 3. Normalized time required to build vector DB with more than a million documents extracted from Wikipedia
Figure 4. Normalized average end-to-end runtime of QA system. This time includes the time required in average to run through the full RAG pipeline, including the retrieval, reranking, and text generation times. Different queries and data types may produce different number of output tokens. However, we show that in our tests OpenVINO is in average at least 1.6x faster than Pytorch.
Terminology
Table 4. Results terminology explained.
Term | Description |
Retrieval Augmented Generation (RAG) | A set of techniques used in generative AI to reference specific knowledge databases from where relevant context is extracted in order to generate an output. |
Embedding | Vector embeddings are representations of data in numerical arrays. |
LLM | Deep learning models that are trained on large amounts of data and that can recognize and generate human-like text. |
Vector Database | A database engine that specializes in the efficient storage, indexing, and search of numerical vectors. |
Quantization | The use of lower-precision data types to reduce the computational and memory requirements of AI models. |
First-Token Latency | First-token latency, also called Time to First Token (TTFT), refers to the time it takes the LLM to process the user’s input prompt and generate the first output token. |
Next-Token Latency | Next-token latency, also called Time per Output Token (TPOT) or inter-token latency (ITL), refers to the time it takes in average for a language model to generate each output token after the first. |
Conclusion
In this document, we demonstrated the effectiveness of OpenVINO in taking advantage of the AI acceleration features of 5th Gen Xeon Scalable Processors applied to an LLM-based RAG pipeline. Our results show that by using OpenVINO on 5th Gen Xeon Scalable Processors the following performance improvements can be achieved compared to Pytorch:
- Approximately 2x speedup in next-token latency using BF16 and an additional 1.9x speedup with model quantizations.
- More than 2.3x speedup in first-token latency.
- 2x faster vector database construction with over a million documents.
These performance improvements can be easily incorporated into RAG pipelines implemented with Langchain and OpenVINO, thus enabling enterprises to build scalable AI solutions that address most of the shortcomings of LLMs. Especially in terms of output quality and response time by leveraging advanced information retrieval techniques and the AI acceleration capabilities of 5th Gen Xeon Scalable Processors. The RAG pipeline itself can be further improved by introducing additional elements for query rewriting and document summarization.
With the rapid adoption of GenAI, developers are faced with a dizzying choice of options. For those wanting to explore more, we also recommend checking out the Open Platform for Enterprise AI.
“OPEA with the support of the broader community, will address critical pain points of RAG adoption and scale today. It will also define a platform for the next phases of developer innovation that harnesses the potential value generative AI can bring to enterprises and all our lives,” said Melissa Evers, vice president of the Software and Advanced Technology Group and general manager of Strategy to Execution at Intel Corporation.
References
[1] "Intel distribution of OpenVINO toolkit," [Online]. Available: https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html. [Accessed 15 July 2024].
[2] "5th Gen Intel Xeon Scalable Processors," [Online]. Available: https://www.intel.com/content/www/us/en/products/docs/processors/xeon/5th-gen-xeon-scalable-processors.html. [Accessed 1 July 2024].
[3] "Dell PowerEdge R760 Rack Server," [Online]. Available: https://www.dell.com/en-us/shop/dell-poweredge-servers/new-poweredge-r760-rack-server/spd/poweredge-r760/pe_r760_15724_vi_vp.
[4] "Create an LLM-powered Chatbot using OpenVINO," 15 June 2024. [Online]. Available: https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/llm-chatbot/llm-chatbot.ipynb.
[5] "Text Splitters | Langchain," [Online]. Available: https://js.langchain.com/v0.1/docs/modules/data_connection/document_transformers/.
[6] "Optimum," [Online]. Available: https://huggingface.co/docs/optimum/en/index.
[7] "Optimum," [Online]. Available: https://huggingface.co/docs/optimum/en/index.
[8] "Optimum," [Online]. Available: https://huggingface.co/docs/optimum/en/index.
[9] "openvinotoolkit/nncf: Neural Network Compression Framework for enhanced OpenVINO inference," [Online]. Available: https://github.com/openvinotoolkit/nncf.
[10] "Driving GenAI Advancements: Dell PowerEdge R760 with the Latest 5th Gen Intel Xeon Scalable Processors," April 2024. [Online]. Available: https://infohub.delltechnologies.com/en-us/t/driving-genai-advancements-dell-poweredge-r760-with-the-latest-5th-gen-intel-r-xeon-r-scalable-processors/.