The SOAS team derived the following best practices from their analysis:
- The accuracy of NLP tasks can be substantially increased, from 5 percent to 35 percent in some cases, by using agentic frameworks. SLMs or medium-scale LLMs perform better in multi-agent settings with strategies such as role play. Specific prompt engineering techniques and agentic design patterns, such as CAMEL, MAD, ReConcile, ReAct, and Self-Reflection, have emerged and can give better performance when implemented in agentic workflows. For more information, see Appendix: Advanced Prompting Techniques for LLM Agents. Further, with clever agentic workflow engineering, hallucination can be minimized by imposing Critic roles as guardrails for scoping, proofreading, and verifying content. The "self-consistency" implementation replicates a particular role in parallel and assigns a task to be completed independently. This is followed by another role for reviewing and analyzing the independently performed task to extract consistent insights, such as a majority voting-based response. This implementation also saves inference time due to parallel calls to SLMs. While hallucination can be controlled with proper workflow engineering, when LLMs do hallucinate, even if happens rarely, it can have a cascading ripple effect in the workflow. Such hallucinations can exacerbate final outputs.
- Our findings are consistent with academic literature, which suggests that rather than using long-context LLMs, breaking down bigger tasks into smaller tasks and incorporating agentic role-play and actor-critic strategies gives better and more consistent results. Further, breaking down bigger tasks allows for the independent enhancement of each step. For example, in the business case described here, steps such as the text summarization prompt or T1 label selection can be improved independently of other steps. LangGraph facilitates such individual workflow step enhancements.
- Most agentic frameworks are tested or built around mainstream commercial high-end LLMs, such as GPT-4 variants Anthropic 3.5 Sonnet, or Gemini 1.5 Pro. While some work has been done towards enabling support for open-source and SLM models, these models often produce unexpected results. High-end LLMs are therefore better suited to get the best results with agentic workflows. Further, using high-end LLMs may help cut down several workflow steps that are necessary when using SLMs. This is because high-end LLMs exhibit better reasoning capabilities and can follow instructions, produce JSON consistently, and perform function calling, including the ability to call multiple functions in parallel. Depending on the compute infrastructure, high-end LLMs may also offer faster runtime.
- SLMs, especially those that this paper describes, are prone to hallucinations, particularly with long text, and may either not follow or ignore instructions. During role-play discussions, the SOAS team observed numerous instances of SLMs engaging in not-so-meaningful or unintelligent conversations, which ultimately rendered the role-play sessions inefficient—the discussions might be relevant to the context but not meaningful to the task, such as models thanking one another for fixing feedback. Further, SLMs demonstrate inconsistencies with JSON output abilities and may lack tool support and function calling abilities. Another issue with SLMs and open-source LLMs, in general, is that different LLMs may have their own specific set of instruction tokens for use in prompts. While libraries such as Ollama and LangChain abstract these LLM-specific prompt templates or instruction tokens such as phi3's <|user|>..<|assistant|>, other hosted API endpoints may not support such libraries, making it difficult to swap LLMs for experimentation. In such cases, custom tools must be implemented to keep the main prompt content intact and add LLM-specific token wrappers around the main prompt content so that different LLMs can be experimented on with ease. In either case, decoupling prompts from the main Python notebook or code is useful because prompts can be organized neatly in a separate folder and versioned there. Along with differences that LLMs present, such as their ability to follow instructions and produce JSONs and their reasoning capabilities, these considerations make it generally difficult to swap LLMs without needing to change prompts.
- Traditional ML/Deep Learning techniques usually had a hyperparameter tuning stage, which generally took considerable experimentation time. Similarly, implementing an agentic workflow also has a large exploration space that involves:
- Identifying and defining roles, prompts associated with roles, and workflows or role interaction patterns
- Using different LLMs/SLMs. Some LLMs/SLMs are better for some tasks than others and have varying 'intelligence' and reasoning capabilities.
- Tool engineering
- Multi-agent workflows and agent role-play interactions generally make it difficult and challenging to evaluate and track when something goes wrong. It is necessary to sift through large chat-like conversations to identify where something went wrong, and it may be necessary to use LLMs to analyze lengthy conversations. Role-play interactions also take considerable time, so even after any issues are resolved, testing is challenging because it is necessary to wait until the entire flow is finished.
- Tool development for agent access requires careful thinking and planning. Most tools necessitate retrieving custom data and presenting it to the LLM role for subsequent analysis and tasks. Key aspects when developing agent tools include:
- Making outputs that agents can readily ingest: For a web search agent, a tool that only retrieves URLs containing relevant results is not particularly useful. Ideally, the tool would retrieve relevant text chunks from the web for the query so that the agent can readily use them. Therefore, when designing tools, careful thought is required regarding what information (data or metadata) must be presented to the LLM and in which format.
- Making tool outputs consistent: If the final agentic application requires consistency of output, (that is, the same output for the same input) tools must be developed in such a manner that the same query or input data produces the same output all the time.
- Caching: If running a tool is a time-consuming process, caching tool input/output is highly beneficial to ensure consistent and faster responses. Semantic caching may become handy for information retrieval tools.
- Graceful failures: When a tool fails, it is imperative that the agentic workflow is presented with information so that the effect of the error is not cascaded. Therefore, identifying failure instances and implementing strategies to convert errors into helpful messages that the agent will not confuse is important to prevent undesirable effects.
- Minimizing single points of failure: If, for example, the omission of certain results from a tool affects the rest of the agentic workflow, that can have an undesirable effect on the outcome. Implementing strategies to ensure that all the important information is retrieved by the tool is important. For instance, ensure that recall is high enough for the information retrieval system, allowing all candidate information to be retrieved so that the LLM can choose the most specific or relevant information. If the tool does not recall critical information to begin with, the rest of the workflow that uses the tool is not effective. This is also crucial when engineering the agent workflow to carefully analyze each step to determine whether that step can lead to single point failures.
- Lack of domain knowledge in SLMs/LLMs can affect hallucination and confusion. For example, misinterpretation of domain-specific terminology can lead to undesirable effects.
- The generative and stochastic nature of LLMs means that outputs are prone to inconsistency issues. Additional tools, tool caching, output validation techniques, output templates, and guardrails must be implemented to ensure consistency.
- Allowing agents to write and run code without imposing any guardrails may also have undesirable effects and security risks.
- Make each workflow stage unit-testable: It is advisable to test individual workflow stages during the agentic workflow design rather than testing it after implementing the entire workflow.