We’re building a multi-agent system similar to the drive-thru example. However, the user might have a question that we need to answer before continuing with the process. I’ve been plugging the whole FAQ knowledge into the agent context, and that was fine to start, but this has grown to a point where our context is mainly dominated by them
-
300 = base
-
1100 = agent
-
3500 = FAQs
-
other session context = 200
This obviously hurts latency. I see in the console that we can get down to replies in ~2secs (which still is not fast) if FAQs are not always in the context from the current 4-5 secs. I’m considering different options for the case where the user asks a question and the agent does not have the knowledge to answer:
-
A tool that calls an LLM who does have the FAQs loaded in context, and returns the reply to the user’s question
-
A tool that calls an LLM who does RAG (embedded-based). This would be worse in terms of accuracy, but better in latency. It is feasible right now with the small KB we have
-
A tool that creates a Task to spawn an agent who has the FAQs loaded in context
-
A tool that creates a Task to spawn an agent who has a tool to do RAG
Discarded:
- A tool that returns the whole FAQs as a message - this would persist in conversation history!
–
My intuition is that (3) would be the best for accuracy (all FAQs in context) but latency would again be slow when answering Qs. At least it would free the agent when following the standard path.
On the other hand, (4) looks faster but more complicated
What’s best practice here? Is there an example for this? I see this but it assumes that a) the last turn requires RAG, b) the user message is formatted in the best way to query RAG