Home > AI Solutions > Artificial Intelligence > White Papers > Rethinking Hierarchical Text Classification: Insights from Multi-Agent Experiments with Small Language Models > Brief evaluation and findings
The team made the following key observations from the experiments.
The SLM-based workflow agents demonstrated proficiency in identifying the individual symptoms mentioned in the text and classifying each with improved accuracy. However, human ground truth labels focused on more holistic root causes or primary request classifications. This discrepancy highlights the challenge SLMs face in replicating the domain expertise and business knowledge that are required to identify overarching themes with precision. Both SLM-based agentic workflow and GPT-4-Turbo had difficulty identifying root causes or major themes, focusing instead on individual symptoms. This difficulty suggests a common challenge in AI-based support classification systems, where the ability to synthesize multiple symptoms into a core issue remains elusive.
The agentic workflow had difficulty with scenarios requiring a broader business understanding. For instance, in cases where the ground truth label indicated [fee based support]-[out of warranty]-[customer declined], the agents failed to extract the necessary cues from the case notes to determine the appropriate labels.
Agents consistently demonstrated difficulties in extending T1 and T1-T2 labels while adhering to formatting and consistency guidelines. This suggests that, even in agent and role-play settings, SLMs produce inferior results without the required domain knowledge and reasoning capabilities.
The phi3 model using the few-shot prompt approach underperformed, consistently producing hallucinated and incorrect labels. The GPT-4-Turbo single-prompt model demonstrated improved reasoning and better classifications, suggesting that advanced high-end models are more suitable for agentic workflows requiring complex decision-making steps.
Overall, the agentic workflow model outperformed single-prompt approaches, with SLMs, aligning with academic literature on the benefits of multi-agent systems.
The following case note examples show the performance discrepancies between human ground truth, the agentic workflow, and baseline models.
Example 1: Cable Inquiry
Case note: “Customer received only three cables (USB, Mini-DP, power cable) and sought clarification about connections.”
Ground truth: [products and services]-[request for information]-[installation and usage queries]
Agentic workflow classification: [products and services]-[logistics]
Proposed new label: [{products and services}-[{logistics}]-[{cable shipment}]]
GPT-4-Turbo single prompt-based classification: [display (external)]-[configuration]-[connection]
phi3 single-prompt based classification: Hardware failures
This example highlights the workflow's focus on specific symptoms, namely cable receipt, rather than the overall nature of the inquiry, which was installation guidance.
Example 2: Out-of-Warranty Support
Case note: “Customer called regarding a no-boot issue, was informed of out-of-warranty status, and declined paid support options.”
Ground truth: [fee based support]-[out of warranty]-[customer declined]
Agentic workflow classifications: [boot]-[no boot], [products & services]-[returns].
Proposed: [boot]-[no boot]-[incomplete os install/boot loop], [support call disconnected]-[{customer declined paid support options}]
phi3 single-prompt based classification: [power supply/adapter]-[faulty or not working]
GPT-4-Turbo single-prompt based classification: [boot]-[no boot]
This case demonstrates the workflow's inability to capture the business context of out-of-warranty support and customer decisions.
Example 3: SSD Boot Inquiry
Case note: “Customer installed a new SSD and sought guidance about ensuring the system boots from that SSD.”
Ground truth: [software]-[windows os]-[usage]
Agentic workflow classification: [hdd/ssd], [boot]-[no boot]
Proposed new labels: Multiple inconsistent labels, including "hardware issue", "software installation", and more.
GPT-4-Turbo single-prompt based classification: [boot]-[no boot]-[bios settings: boot]
phi3 single-prompt based classification: [system]-[boot issues]
This example showcases the challenge of correctly categorizing a software-related inquiry when hardware components are involved.