Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

822 citations of this metric

G-Eval is a novel framework designed to evaluate the outputs of large language models (LLMs) using the interpretive and reasoning capabilities of the models themselves. Introduced in the paper “NLG Evaluation using GPT-4 with Better Human Alignment”, G-Eval represents a significant advancement in natural language generation (NLG) evaluation, producing scores that better align with human judgment compared to traditional methods like BLEU or ROUGE.

 

Applicable Models:

G-Eval has been extensively tested with GPT-3.5 and GPT-4 and can be extended to other LLMs with comparable capabilities, provided they support chain-of-thought (CoT) reasoning and output token probability analysis.

 

Background:

Conventional NLG evaluation metrics often fail to fully capture the semantic richness and contextual nuance of LLM outputs. G-Eval addresses this gap by leveraging LLMs to act as evaluators, combining chain-of-thought reasoning with a structured form-filling paradigm to deliver detailed, task-specific metrics. This innovative approach has demonstrated higher correlation with human evaluations, making it a robust tool for assessing complex LLM tasks like summarization, dialogue generation, and coherence measurement.

 

Formulae:

The G-Eval process can be summarized as follows:

1. Define Evaluation Task:

• Specify the evaluation criteria (e.g., coherence, relevance, factuality).

• Provide definitions for the criteria (e.g., “Coherence: the collective quality of all sentences in the actual output.”).

2. Generate Evaluation Steps:

• Use chain-of-thought reasoning to guide the LLM in breaking down the evaluation task into interpretable steps.

3. Construct the Prompt:

• Concatenate the evaluation steps with task-specific inputs (e.g., LLM-generated output).

• Request a score from the LLM based on a predefined scale (e.g., 1–5).

4. Optional Normalization:

• If access to raw token probabilities is available, refine the score by calculating a weighted summation of probabilities to reduce biases.

 

Applications:

NLP Benchmarks: Evaluating outputs in text summarization, dialogue generation, and translation tasks.

Human-Machine Interaction: Assessing coherence, factual correctness, or semantic relevance of LLM outputs.

Research and Development: Implementing task-specific evaluation metrics in frameworks like DeepEval to benchmark LLMs.

 

Impact:

G-Eval offers a significant improvement in aligning automated evaluations with human judgment. By leveraging the semantic understanding of LLMs, G-Eval captures the full complexity of generative text, which traditional metrics struggle to assess. It provides flexibility for domain-specific criteria and enhances interpretability by generating explanations for evaluation scores. However, its reliance on LLMs introduces challenges like arbitrariness and bias, underscoring the need for careful implementation and validation.

About the metric


Metric type(s):








Risk management stage(s):


Github stars:

  • 269

Github forks:

  • 28

Modify this metric

catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.