Brief evaluation and findings

Thank you for your feedback!

Workflow 3 evaluation
In Experiment 3, the team conducted an evaluation of the final agentic workflow using a dataset of human-expert annotated support case notes. An initial manual assessment of 10 records revealed significant deviations from human-generated ground truth labels. In an ML context, “ground truth” refers to information that is known to be accurate and reliable because it is typically obtained through direct observation and measurement.
This finding prompted a postponement of a larger automated evaluation. The decision was further influenced by the substantial runtime of agentic workflows, which spanned 5 to 10 minutes per record on local machines and varied based on the number of technical issues that were present in case notes.
To establish comparative baselines, the team evaluated the same text samples using both GPT-4-Turbo and phi3:latest models with single-prompt, few-shot example-based learning approaches. For more information, see What Is Few-Shot Learning?.