Evals for Chatbots

With services like chatjet.app, a chatbot is quickly setup. It works on your own knowledge and is often already better than no chatbot at all. It helps your customers by answering their questions, gain information on which questions your customers ask in the first place, and can start a dialog with prospects.

The real potential is uncovered with evaluations or evals, for short. This is a structured method to analyze errors, fix errors, and setup automated tools like LLMs as a judge to continously monitor for errors. Chatbots that have gone through one or a few eval iterations are much better than ad-hoc chatbots. Errors are reduced, the conversation can be steered better to meet your requirements, and quality drifts can quickly be detected. It’s also a foundation for continous improvement. Try a new model? Your evals will tell you if the chatbot gets better or worse.

This guide is a brief walk-through to evals for chatbots. It uses chatjet.app based chatbots as an example but is applicable to other chatbots, as well.

Steps:

Error Analysis Part A - Open Coding
Error Analysis Part B - Axial Coding
Bootstrapping synthetic traces for LLMs as a Judge (not needed if you have real traces)
Creating and aligning an LLM as a judge
Setting up monitoring with evaluation tools

This guide is based on courses by Hamel and Shreya about evals.

How RAG in Chatjet Works