Home > AI Solutions > Gen AI > White Papers > Product Support Quick Notes Retrieval > Validation
The solution was validated from both functional and non-functional perspectives. The functional aspects were tested throughout the development stages, in addition to acceptance testing by independent stakeholders. This is in accordance with DTS's test plan which captures testing activities, data sources, and roles and responsibilities.
DTS's development test data consists of a diverse sample of 25 user queries, spanning across 20+ products and the product families representing historical user queries related to the most common products encountered in technical troubleshooting. Subject Matter Experts (SME) determine the ground truth for each query, focusing on the relevance and effectiveness of the retrieved articles in resolving the user queries given the products. This fully annotated data is used to evaluate the effectiveness of dynamic thresholding, and the selection of the best embedding model for semantic similarity tasks, including article retrieval. It also supports metric evaluation, such as Precision, Recall, and F-Score.
For acceptance testing, SMEs conducted comprehensive tests across 178 unique products and 393 unique symptoms. These tests were carried out on both the legacy system and the new system. The results were annotated, highlighting the instances where the new system either returned irrelevant results or failed to return relevant results successfully retrieved by the legacy system.
A metrics-based approach was used in both development and acceptance phases to drive decision making. Precision, Recall, and F1-Score were leveraged to quantify the effectiveness of the system. While recall is typically a costly metric to calculate for IR systems, due to the requirement to annotate all system documents on a per query basis, DTS mitigated cost. For the development set, DTS reduced the document set and annotated only the filtered article set for each query. In acceptance testing, only the results from the legacy system and the new system were annotated as relevant or not. With this, DTS can measure precision metrics over both the dataset’s “absolute” recall measurements for the development test set, and recall “improvement” over the legacy system for the acceptance test set.
The new system improves retrieval precision from less than 40 percent up to 75 percent.