Whereas the previous section described the conceptual architecture and software, this section describes the specific software modules and components used in the design.
UneeQ Digital Human Platform
The UneeQ Digital Human Platform is used in the digital assistant solution and implemented with an industry-leading software stack. UneeQ is an ISV specializing in immersive customer experiences. Their platform offers a simple interface and a reliable digital assistant that looks and sounds natural. Enterprises use the UneeQ portal and tools to create instances of digital assistants and conversation streams.
The UneeQ Digital Human Platform is built on Unreal Engine by Epic Games, a 3D graphics game engine, and adds the following distinct components, all delivered as images that can be deployed and managed using Kubernetes:
- Digital Human renderer—The renderer is the component that displays the details of the avatar representing the digital assistant. This component requires GPUs to achieve high quality.Digital Human variant options—The platform offers multiple digital assistant variant options, allowing users to choose a digital assistant that best suits their needs.
- Virtual background design capabilities—The platform provides virtual background design capabilities, enabling users to customize the digital environment in which the digital assistant operates.
- Persona creation—The platform allows persona creation, enabling users to define the personality and behavior of the digital assistant.
- Voice and language customization—The platform offers voice and language customization options, enabling the digital assistant to communicate in a way that aligns with the user’s preferences.
- Integrated NLP of choice—The platform integrates with the NLP system (see the description of the Conversational Pod below), enabling the digital assistant to understand and respond to user queries.
- Desktop and mobile browser support—The platform supports users interacting with digital assistants through the web browser on their desktop or mobile phone.
A core component of the UneeQ Digital Human Platform is the NVIDIA Audio2Face microservice, which animates 3D character facial characteristics to match any audio input and is part of NVIDIA AI Enterprise.
Dell Orchestration and Conversation Management
The Dell Orchestration and Conversation Management component is a subsystem of multiple software components that enables the secure definition and management of multiple users simultaneously. It is necessary because different tenants, or types of users, might be defined in the system, and each use case implemented by the solution might require access to a different combination of backend systems such as LLMs and IR systems. For example, an enterprise might choose to provide help desk support to its employees using a digital assistant while offering a digital assistant-enabled storefront to its customers. While the digital assistant avatar might be identical, different user experience (UX) elements might be enabled for each use case, and different backend systems might need to be accessed.
It is usually necessary to integrate the digital assistant solution into an existing customer environment seamlessly and securely. In particular, the entry point of the solution is provided by this subsystem, which resides in its own Kubernetes namespace and includes multiple capabilities, including and ingress control, user interface management, transaction orchestration, conversation management, and others.
Pryon Retrieval Engine
The Pryon Retrieval Engine is a commercial off-the-shelf information retrieval system. It uses several proprietary LLMs to represent ingested content to find the most relevant content rapidly. Content is topically grouped as “collections,” which are accessed one-by-one. Key Pryon characteristics include:
- High-precision ingestion is the starting point for accuracy. Pryon emulates human-like document analysis, employing proprietary OCR technology to extract text in reading order from images, graphics, schematics, and handwritten notes. It uses vision segmentation to identify and label key components, performs content normalization and filtering to remove unnecessary objects, and employs visual semantic segmentation to assemble smarter document chunks.
- Accurate retrieval begins with understanding the question. Pryon's retrieval process uses a proprietary combination of query expansions, out-of-domain detection, and query embedding to comprehensively understand natural language queries for ingested content matching. Then, Pryon uses three proprietary models to quickly find and rank the best matched content.
- Enterprise scale is key to sustained value. Pryon seamlessly integrates with major enterprise systems such as Microsoft SharePoint, Confluence, AWS S3, Google Drive, Zendesk, ServiceNow, Salesforce, and Documentum, enabling users to effortlessly construct a unified retrieval knowledge base without content consolidation.
- Maximum security is a prerequisite to AI adoption. Pryon Retrieval Engine maintains a document-level ACL and can be implemented in on-premises and air-gapped environments. Pryon AI models do not train on enterprise data.
- Rapid time to value is necessary to get business buy-in. Pryon's prebuilt Ingestion, Retrieval, and Generative Engines make it possible to implement the digital assistant solution efficiently.
- A comprehensive admin console is required. The admin console provides the ability to carry out standard administration tasks, launch content ingestion, and define synonyms and validated answers to improve the guidance for the search. Query history, volume, and other metrics can also be viewed at a collection or system level.
Llama 3 LLM
True to its purpose of an on-premises solution, the digital assistant solution uses the Llama 3 model from Meta running on a Dell PowerEdge R760xa server with NVIDIA GPUs.
AI models are typically optimized for performance before deployment to production. Optimized models offer faster inference speed, improved resource efficiency, and reduced latency, resulting in cost savings and better scalability. They can handle increased workloads, require fewer computational resources, and provide a better user experience with quick responses.
To test the quality of LLM-generated responses, multiple methods are used:
- Using another LLM to check the response quality—Responses from GPT 3.5 turbo and Llama 3 are ranked using ChatGPT.
- Answer relevancy—The answer relevancy metric measures the generated response's quality by evaluating how relevant the output of the LLM is compared to the provided input by considering retrieval context.
- Summarization—The summarization metric uses LLMs to determine whether the LLM is generating factually correct summaries while including the necessary details from the original text.
- Faithfulness—The faithfulness metric measures the quality of the LLM by evaluating whether the output aligns factually with the contents of the retrieved context.
- Contextual precision—The contextual precision metric measures the LLM by evaluating whether nodes in the retrieval context that are relevant to the given input are ranked higher than the irrelevant ones.
- Contextual relevancy—The contextual relevancy metric measures the output of the LLM by evaluating the overall relevance of the information in the retrieval context for a specific input.
- Contextual recall—The contextual recall metric measures the quality of the LLM by evaluating the extent to which the context aligns with the output.
- Hallucination—The hallucination metric determines whether the LLM generates factually correct information by comparing the output to the provided context.
- Bias—The bias metric determines whether the LLM output contains gender, racial, or political bias.
- Token cost—The token cost metric calculates the token cost of the LLM.
Other software component
Due to the considerable complexity of the solution, it is imperative to provide a single point of control, both from a security operations as well as from an administration and management perspective.
In this validated design, the following components facilitate solution operations:
- Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups the containers that make up an application into logical units for easy management and discovery.
- Prometheus is an open-source systems monitoring and alerting toolkit. It collects and stores metrics as time series data, that is, metric information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. Prometheus excels at collecting time series data from various sources. It can scrape metrics from applications, services, and infrastructure components.
- Grafana is a multiplatform open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts for the web when connected to supported data sources. Grafana provides the tools to transform raw data into meaningful insights through rich, interactive dashboards.
Together, these three systems deliver full observability. Kubernetes provides the platform for running and managing applications. Prometheus, tightly integrated with Kubernetes, collects metrics and provides real-time monitoring. Grafana, in turn, uses the data collected by Prometheus to create visualizations, providing a user-friendly way to monitor the performance, reliability, and overall health of the Kubernetes-managed applications.
This validated design also uses Keycloak, which is an open-source Identity and Access Management (IAM) tool. It offers features like user federation, identity brokering, and social login. Keycloak simplifies the authentication process and eliminates the need to handle user storage or authentication. It also supports single sign-on and single sign-out. Keycloak can integrate with existing IAM solutions such as Microsoft Active Directory, Ping Identity, and Okta.