Brief evaluation and findings

Thank you for your feedback!

Key observations

The team made the following key observations from the experiments.
Symptom identification versus root cause classification
The SLM-based workflow agents demonstrated proficiency in identifying the individual symptoms mentioned in the text and classifying each with improved accuracy. However, human ground truth labels focused on more holistic root causes or primary request classifications. This discrepancy highlights the challenge SLMs face in replicating the domain expertise and business knowledge that are required to identify overarching themes with precision. Both SLM-based agentic workflow and GPT-4-Turbo had difficulty identifying root causes or major themes, focusing instead on individual symptoms. This difficulty suggests a common challenge in AI-based support classification systems, where the ability to synthesize multiple symptoms into a core issue remains elusive.
Lack of business context
The agentic workflow had difficulty with scenarios requiring a broader business understanding. For instance, in cases where the ground truth label indicated [fee based support]-[out of warranty]-[customer declined], the agents failed to extract the necessary cues from the case notes to determine the appropriate labels.
Label extension and consistency
Agents consistently demonstrated difficulties in extending T1 and T1-T2 labels while adhering to formatting and consistency guidelines. This suggests that, even in agent and role-play settings, SLMs produce inferior results without the required domain knowledge and reasoning capabilities.
Comparative performance
The phi3 model using the few-shot prompt approach underperformed, consistently producing hallucinated and incorrect labels. The GPT-4-Turbo single-prompt model demonstrated improved reasoning and better classifications, suggesting that advanced high-end models are more suitable for agentic workflows requiring complex decision-making steps.
Overall, the agentic workflow model outperformed single-prompt approaches, with SLMs, aligning with academic literature on the benefits of multi-agent systems.
Illustrative examples
The following case note examples show the performance discrepancies between human ground truth, the agentic workflow, and baseline models.
Example 1: Cable Inquiry
Case note: “Customer received only three cables (USB, Mini-DP, power cable) and sought clarification about connections.”
Ground truth: [products and services]-[request for information]-[installation and usage queries]
Agentic workflow classification: [products and services]-[logistics]
Proposed new label: [{products and services}-[{logistics}]-[{cable shipment}]]
GPT-4-Turbo single prompt-based classification: [display (external)]-[configuration]-[connection]
phi3 single-prompt based classification: Hardware failures
This example highlights the workflow's focus on specific symptoms, namely cable receipt, rather than the overall nature of the inquiry, which was installation guidance.
Example 2: Out-of-Warranty Support
Case note: “Customer called regarding a no-boot issue, was informed of out-of-warranty status, and declined paid support options.”
Ground truth: [fee based support]-[out of warranty]-[customer declined]
Agentic workflow classifications: [boot]-[no boot], [products & services]-[returns].
Proposed: [boot]-[no boot]-[incomplete os install/boot loop], [support call disconnected]-[{customer declined paid support options}]
phi3 single-prompt based classification: [power supply/adapter]-[faulty or not working]
GPT-4-Turbo single-prompt based classification: [boot]-[no boot]
This case demonstrates the workflow's inability to capture the business context of out-of-warranty support and customer decisions.
Example 3: SSD Boot Inquiry
Case note: “Customer installed a new SSD and sought guidance about ensuring the system boots from that SSD.”
Ground truth: [software]-[windows os]-[usage]
Agentic workflow classification: [hdd/ssd], [boot]-[no boot]
Proposed new labels: Multiple inconsistent labels, including "hardware issue", "software installation", and more.
GPT-4-Turbo single-prompt based classification: [boot]-[no boot]-[bios settings: boot]
phi3 single-prompt based classification: [system]-[boot issues]
This example showcases the challenge of correctly categorizing a software-related inquiry when hardware components are involved.

Your Browser is Out of Date

Brief evaluation and findings

Brief evaluation and findings

Key observations

Symptom identification versus root cause classification

Lack of business context

Label extension and consistency

Comparative performance

Illustrative examples