Evaluating LLM Performance at Scale: A Guide to Building Automated LLM Evaluation Frameworks

By:

Shakudo Team

Updated on:

March 14, 2024

Introduction

It’s thrilling to exploit the generation power of Large Language Models (LLMs) in real-world applications. However, they’re also known for their creative and possibly hallucinating responses. Once you have your LLMs, the questions arise: How well do they work for my specific needs? How much can we trust them? Are they safe to deploy in production and interact with users?

Perhaps you are trying to build or integrate an automated evaluation system for your LLMs. In this blog post, we’ll explore how you can add an evaluation framework to your system, what evaluation metrics can be used for different goals, and what open-source evaluation tools are available. By the end of this guide, you’ll be equipped with the knowledge of how to evaluate your LLMs and the latest open-source tools that come in handy.

Note: This article will discuss use cases including RAG-based chatbots. If you’re particularly interested in building a RAG-based chatbot, We recommend that you read our previous post on Retrieval-Augmented Generation (RAG) first.

Why do we need LLM evaluation?

Imagine that you’ve built an LLM based chatbot using your knowledge base in health care or law field. However, you’re hesitant to deploy it to production because its rapid response capability, while impressive, comes with drawbacks. While the chatbot can respond to user queries 24/7 and generate answers almost instantly, there’s a lingering concern. It sometimes fails to address questions directly, makes claims that don’t align with facts, or adopts a negative tone toward users.

Or picture this scenario: You’ve developed a marketing analysis tool that can use any LLM, or you’ve researched various prompt engineering techniques. Now, it’s time to wrap up the project by choosing the most promising approach among all options. However, you should present some quantitative results for comparison to support your choice instead of your instinct.

One way to address this is through human feedback. ChatGPT, for example, uses reinforcement learning from human feedback (RLHF) to finetune the LLM based on human rankings. However, it involves a labor-intensive process and thus is hard to scale up and automate.

On the other hand, you can curate a production or synthetic dataset and adopt various evaluation metrics depending on your needs. You can even define your own grading rubric using code snippets or your own words. Simply put, given the question, answer, and context (optional), you can use a deterministic metric or use an LLM to make judgements with user-defined criteria. As a fast, scalable, customizable and cost-effective approach, it garners industry attention. In the next section, we’ll go over common evaluation metrics for LLMs in production use cases.

An example of domain-specific evaluation comes from Toloka’s recent work on rubric-based scoring for reasoning tasks. Instead of generic correctness measures, they designed multi-step rubrics tailored to expert-level responses in specialized fields. Their method asks targeted questions such as “Does the answer mention key concepts?”—enabling a fine-grained view of model reasoning strengths and weaknesses. This approach proves particularly valuable in high-stakes or knowledge-intensive domains where surface-level metrics may obscure deeper performance gaps.

Evaluation Metrics

There are essentially two types of evaluation metrics: reference-based and reference-free. The conventional reference-based metrics usually compute a score by comparing the actual output with the ground truth (GT) at a token level. They’re deterministic, but they don’t always align with human judgements according to a recent study (G-Eval: https://arxiv.org/abs/2303.16634). What’s more, GT answers aren’t always available in real-world datasets. In contrast, reference-free metrics don’t need the GT answers and are more aligned with human judgements. We’ll discuss both types of metrics but mainly focus on reference-free metrics which appear to be more useful in production-level evaluations.

In this section, we will mention a few open-source LLM evaluation tools. Ragas, as its name suggests, is specifically designed for RAG-based LLM systems, whereas promptfoo and DeepEval support general LLM systems.

Reference-based Metrics

If you have the GT answers to your queries, you can use the reference-based metrics to provide different angles for evaluation. Here we will discuss a few popular reference-based metrics.

Answer Correctness

A straight-forward approach to measure correctness is by semantic similarity between GT and generated answers. However, this may not be the best way to measure it as it doesn’t take factual correctness into account. Therefore, Ragas combines them by taking a weighted average. More specifically, they use an LLM to identify the true positives (TP), false positives (FP), and false negatives (FN) from the answers. Then, they calculate the F1 score as factual correctness. This way it takes both semantic similarity and factual correctness into consideration and can provide us with a more reliable result.

Context Precision

This metric measures the retriever’s ability to rank relevant contexts correctly. A common approach is to calculate the weighted cumulative precision which gives higher importance to top-ranked contexts and can handle different levels of relevance.

Context Recall

This metric measures how much of the GT answer can be attributed to the context, or how much the retrieved context can help derive the answer. We can compute it using a simple formula: the percentage of GT sentences that can be ascribed to context over all GT sentences.

Reference-free Metrics

Answer Relevancy

One of the most common use cases of LLMs is question answering. The first thing we want to make sure is that our model directly answers the question and stays centered on the subject matter. There are different ways to measure this. For example, Ragas uses LLMs to reverse-engineer possible questions given the answer generated by your model and calculates the cosine similarity between the generated question and the actual question. The idea behind this method is that we should be able to reconstruct the actual question given a clear and complete answer. On the other hand, DeepEval calculates the percentage of relevant statements over all statements extracted from the answer.

Faithfulness

Can I trust my models? LLMs are known for hallucination, thus we might have “trust issues” when interacting with them. A general evaluation approach is to calculate the percentage of truthful claims over all claims extracted from the answer. You can use an LLM to determine whether a claim is truthful by checking if it contradicts with any claim in the context like DeepEval, or more strictly, it has to be inferred from a claim in the context as Ragas.

Perplexity

This is a token-level deterministic metric that does not involve other LLMs. It offers us a way to determine how certain your model is about the generated answer. A lower score implies greater confidence in its prediction. Please note that your model output must include the log probabilities of the output tokens as they are used to compute the metric.

Toxicity

There are different ways to compute the toxicity score. You can use a classification model to detect the tone. You can also use LLMs to determine if the answer is appropriate based on the predefined criteria. For example, DeepEval uses their built-in toxicity metric which calculates the percentage of toxic opinions over all opinions, and Ragas applies the majority voting ensemble method by prompting the LLM multiple times for its judgment.

The RAG system has become a popular choice in the industry since we realized LLMs suffer from hallucinations. Therefore, in addition to the metrics above, we would like to introduce a metric specifically designed for RAG. Note that there are also 2 reference-based metrics related to RAG.

Context Relevancy

Ideally, the retrieved context should contain just enough information to answer the question. We can use this metric to evaluate how much of the context is actually necessary and thus evaluate the quality of the RAG’s retriever. One way to measure it is the percentage of relevant sentences over all sentences in the retrieved context. The other way is a simple variation of this: the percentage of relevant statements over all statements in the retrieved context.

G-Eval

We just introduced 8 popular evaluation metrics, but still you might have particular evaluation criteria for your own project that are not covered by any of them. In this situation, you can craft your own grading rubric and use it as part of the LLM evaluation prompt. This is the G-Eval framework, using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs. You can either define it at a high level or specify when the generated response will earn or lose a point. You can make it zero-shot by only stating the criteria or few-shot by giving it a few examples. We usually ask the output to include the score and rationale in a JSON format for further analysis. For example, A G-Eval evaluation criteria for an LLM designed to write real-estate listing descriptions are as follows.

High-level criteria prompt:

Check if the output is crafted with a professional tone suitable for the finance industry.

Specific criteria prompt:

Grade the output by the following specifications, keeping track of the points scored and the reason why each point is earned or lost:

Did the output include all information mentioned in the context? + 1 point

Did the output avoid red flag words like 'expensive' and 'needs TLC'? + 1 point

Did the output convey an enticing tone? + 1 point

Calculate the score and provide the rationale. Pass the test only if it didn't lose any points. Output your response in the following JSON format:

Open-Source Tools

There are many open source LLM evaluation frameworks, here we compare a few that are the most popular at the time of writing that automates the LLM evaluation process.

Promptfoo

Pros

Offers many customizable metrics, and metrics can be defined by a python, javascript script, webhook or by your own words
Offers an user interface where you can visualize results, overwrite evaluation results and add comments from human feedback, and run new evaluation
Allows users to create shareable links to the evaluation results
Allows users to extend existing evaluation datasets using LLMs

Cons

Can be non-trivial when testing and debugging as a command-line-only package

Ragas

Pros

Designed for RAG systems
Able to create synthetic test sets using an evolutionary generation paradigm
Integrates various tools including LlamaIndex and Langchain
Allows users to generate synthetic datasets

Cons

No built-in UI but allows users to visualize results using third party plugins like Zeno

DeepEval

Pros

Offers more build-in metrics than the others, i.e., summarization, bias, toxicity, and knowledge retention
Allows users to generate synthetic datasets and manage evaluation datasets
Integrates LlamaIndex and Huggingface
Allows for real-time evaluation during finetuning, enabled by Huggingface integration
Compatible with Pytest and thus can be seamlessly integrated into other workflows

Cons

Visualization is not open-source

Conclusion

Evaluating LLM performance can be complex as there is no universal solution; it depends on your use case and test set. In this post, we introduced the general workflow for LLM evaluation and the open-source tools that have nice visualization features. We also discussed the popular metrics and open-source frameworks for RAG-based and general LLM systems which address the dependency on labor-intensive human feedback.

To get started with using these LLM evaluation frameworks like promptfoo, Ragas and DeepEval, Shakudo integrates all of these tools and over 100 different data tools, as part of your data and AI stack. With Shakudo, you decide the best evaluation metrics for your use case, deploy your datasets and models in your cluster, run evaluation and visualize results at ease.

Are you looking to leverage the latest and greatest in LLM technologies? Go from development to production in a flash with Shakudo: the integrated development and deployment environment for RAG, LLM, and data workflows. Schedule a call with a Shakudo expert to learn more!

References

G-Eval https://arxiv.org/abs/2303.16634

Promptfoo https://www.promptfoo.dev/docs/intro

DeepEval https://docs.confident-ai.com/docs/getting-started

Ragas https://docs.ragas.io/en/stable/index.html

Whitepaper

Introduction

It’s thrilling to exploit the generation power of Large Language Models (LLMs) in real-world applications. However, they’re also known for their creative and possibly hallucinating responses. Once you have your LLMs, the questions arise: How well do they work for my specific needs? How much can we trust them? Are they safe to deploy in production and interact with users?

Perhaps you are trying to build or integrate an automated evaluation system for your LLMs. In this blog post, we’ll explore how you can add an evaluation framework to your system, what evaluation metrics can be used for different goals, and what open-source evaluation tools are available. By the end of this guide, you’ll be equipped with the knowledge of how to evaluate your LLMs and the latest open-source tools that come in handy.

Note: This article will discuss use cases including RAG-based chatbots. If you’re particularly interested in building a RAG-based chatbot, We recommend that you read our previous post on Retrieval-Augmented Generation (RAG) first.

Why do we need LLM evaluation?

Imagine that you’ve built an LLM based chatbot using your knowledge base in health care or law field. However, you’re hesitant to deploy it to production because its rapid response capability, while impressive, comes with drawbacks. While the chatbot can respond to user queries 24/7 and generate answers almost instantly, there’s a lingering concern. It sometimes fails to address questions directly, makes claims that don’t align with facts, or adopts a negative tone toward users.

Or picture this scenario: You’ve developed a marketing analysis tool that can use any LLM, or you’ve researched various prompt engineering techniques. Now, it’s time to wrap up the project by choosing the most promising approach among all options. However, you should present some quantitative results for comparison to support your choice instead of your instinct.

One way to address this is through human feedback. ChatGPT, for example, uses reinforcement learning from human feedback (RLHF) to finetune the LLM based on human rankings. However, it involves a labor-intensive process and thus is hard to scale up and automate.

On the other hand, you can curate a production or synthetic dataset and adopt various evaluation metrics depending on your needs. You can even define your own grading rubric using code snippets or your own words. Simply put, given the question, answer, and context (optional), you can use a deterministic metric or use an LLM to make judgements with user-defined criteria. As a fast, scalable, customizable and cost-effective approach, it garners industry attention. In the next section, we’ll go over common evaluation metrics for LLMs in production use cases.

An example of domain-specific evaluation comes from Toloka’s recent work on rubric-based scoring for reasoning tasks. Instead of generic correctness measures, they designed multi-step rubrics tailored to expert-level responses in specialized fields. Their method asks targeted questions such as “Does the answer mention key concepts?”—enabling a fine-grained view of model reasoning strengths and weaknesses. This approach proves particularly valuable in high-stakes or knowledge-intensive domains where surface-level metrics may obscure deeper performance gaps.

Evaluation Metrics

There are essentially two types of evaluation metrics: reference-based and reference-free. The conventional reference-based metrics usually compute a score by comparing the actual output with the ground truth (GT) at a token level. They’re deterministic, but they don’t always align with human judgements according to a recent study (G-Eval: https://arxiv.org/abs/2303.16634). What’s more, GT answers aren’t always available in real-world datasets. In contrast, reference-free metrics don’t need the GT answers and are more aligned with human judgements. We’ll discuss both types of metrics but mainly focus on reference-free metrics which appear to be more useful in production-level evaluations.

In this section, we will mention a few open-source LLM evaluation tools. Ragas, as its name suggests, is specifically designed for RAG-based LLM systems, whereas promptfoo and DeepEval support general LLM systems.