Testing the Future – AI Model Evaluation for QA Engineers

Introducing paradigm shift & guide you through AI model testing.

October 20, 2024

Hi, greetings to you all readers and innovators. My name is Ton, a Senior QA Engineer at Hugeman Consultant Company (Hugeman). Hugeman is an advanced business and technology Growth Partner providing consultancy with a strong case of helping our client’s business grow 10x from our service. At Hugeman we are normally challenging new norms and innovating new ways to serve our clients. Therefore, today I’d like to share one of the new interesting QA frameworks that we’ve been implementing recently.

As all the readers might be aware that the AI trend is unstoppable, and it’s becoming evident that office workers like us will increasingly rely on ChatGPT in our daily work. As QA professionals, we’re used to testing all kinds of applications, from mobile and web apps to manual and automated tests, covering both functional and non-functional aspects. But what if the product we need to deliver is a Chatbot like ChatGPT that includes specialized business insights? This presents a unique challenge for QA Engineers.

In traditional QA, we write test cases with specific expected results and confirm pass or fail outcomes. However, testing AI models like ChatGPT is different. Instead of binary outcomes, we evaluate performance on a scale (0.0 to 1.0), aiming to fine-tune the model towards perfection. This article will introduce this paradigm shift and guide you through the new landscape of AI model testing.

We need to move beyond traditional testing methods and ensure our AI models generate accurate, relevant, and contextually precise responses. This means diving into knowledge graphs, understanding Retrieval-Augmented Generation (RAG) systems, and using tools like DeepEval and RAGAS metrics. So, how do we approach this new landscape? In this article, we’ll cover the key steps for testing AI systems, the necessary tools, and how to interpret the results to ensure our Chatbot solutions are of the highest quality. Let’s dive into the future of QA together!

‍

What is a Knowledge Graph, and Why Do We Need to Use It?

A knowledge graph is an advanced data structure that organizes information as a network of nodes (entities) and edges (relationships). This interconnected format provides a comprehensive and contextual representation of data, making it easier to navigate and analyze. Unlike traditional databases that store data in isolated tables, knowledge graphs interlink data points, offering a more holistic view of information.

*Image Credit: DB Comparison, source: The Next Platform. Retrieved from The Graph Database Poised to Pounce on the Mainstream.*

In our company, we utilize knowledge graphs to transform business insights into vectors that share the same nature as the GPT-4o model. This integration brings together the structured, relational data of knowledge graphs with the advanced language understanding of GPT-4o, resulting in highly accurate and contextually relevant AI responses. By aligning our business knowledge with the capabilities of GPT-4o, we achieve a seamless integration that enhances the effectiveness of our AI applications.

Traditional relational databases rely on SQL for queries, whereas knowledge graphs use Cypher queries. Although learning Cypher might seem challenging at first, it’s just another query language. The syntax of Cypher is designed to be intuitive and user-friendly, much like SQL. If you encounter difficulties, ChatGPT can assist in drafting Cypher queries, simplifying the process and ensuring that you can leverage the full power of knowledge graphs. This makes transitioning from SQL to Cypher more approachable and less daunting.

For QA professionals, knowledge graphs offer significant advantages over traditional databases. They enhance testing processes by linking test cases, requirements, and defects, providing a more integrated view of the testing landscape. This interconnected view helps QA teams quickly identify issues, understand their implications, and ensure complete test coverage. The ability to see how different data points relate to each other in a knowledge graph can reveal hidden dependencies and potential problem areas that might be missed with traditional databases.

By merging our business insights with GPT-4o’s capabilities, our knowledge graph-based approach ensures top-tier AI performance and quality. This powerful combination not only improves the accuracy and relevance of AI responses but also enhances our ability to manage and utilize business knowledge effectively. In the rapidly evolving field of AI, knowledge graphs represent a critical tool for staying ahead and delivering high-quality, contextually aware AI solutions.

‍

Evaluating AI Models

Evaluating AI models is a critical step in developing reliable and effective AI systems. This process involves assessing the model’s performance, accuracy, and ability to generalize from training data to unseen scenarios. Effective evaluation ensures that the AI system can perform its intended tasks under various conditions and maintain high-quality results.

Steps in Evaluating AI Models:

Define Objectives and Metrics: The first step is to clearly define what you want to achieve with your AI model and identify the metrics that will be used to measure success. Common metrics include accuracy, precision, recall, F1 score, and more specific measures like RAGAS metrics for evaluating language models.
Create a Test Dataset: It is crucial to have a dataset that the model has not seen during training. This dataset should be representative of the real-world data the model will encounter. The test set allows you to assess how well the model generalizes beyond its training data.
Run Evaluations: Using the test dataset, run the model to generate predictions. Compare these predictions against the actual outcomes to calculate the chosen metrics. This step can be automated using tools like DeepEval, which facilitates the evaluation of large language models (LLMs) on various metrics efficiently.
Analyze Results: Evaluate the model’s performance based on the calculated metrics. Identify strengths and weaknesses, such as areas where the model performs well and scenarios where it fails. This analysis helps in understanding the model’s behavior and areas needing improvement.
Iterate and Improve: Based on the analysis, make necessary adjustments to the model. This could involve retraining with more data, fine-tuning parameters, or altering the model architecture. Repeat the evaluation process to ensure that these changes have led to improvements.
Deploy and Monitor: Once satisfied with the model’s performance, deploy it in a real-world environment. Continuously monitor its performance and gather feedback to make further enhancements.

*Image Credit: Diagram generated using Mermaid*

Knowing RAGAS Metrics

RAGAS (Robust and Generalizable Assessment of State-of-the-art) metrics are specifically designed to evaluate the performance of language models, particularly those used in Retrieval-Augmented Generation (RAG) systems. These metrics help ensure that the models generate accurate, relevant, and contextually appropriate responses, which is crucial for applications like chatbots and virtual assistants.

Key RAGAS Metrics:

Answer Relevancy: This metric evaluates how relevant the model’s response is to the given query. It ensures that the answers provided are directly related to the questions asked.
- Example: If a user asks about a return policy, the response should be about the return policy, not something unrelated like shipping details.
Faithfulness: Faithfulness measures the accuracy of the response, ensuring that the information provided is correct and not hallucinated. This metric is crucial for maintaining trustworthiness in AI-generated responses.
- Example: If the policy is a 30-day return, the model shouldn’t say 15 days.
Contextual Precision: This metric assesses the model’s ability to provide precise answers within the context of the query. High contextual precision indicates that the model can focus on relevant details without including unnecessary information.
Contextual Recall: Contextual recall measures how well the model captures all relevant aspects of the context. It ensures that no important details are omitted in the response.
Contextual Relevancy: This combines precision and recall within the context, providing a holistic view of how well the model understands and responds to the context of the query.
Bias: This metric evaluates whether the model’s responses are free from biases that could affect the fairness and neutrality of the answers. It is crucial for ensuring that the AI system provides unbiased and equitable responses.
Toxicity: Toxicity measures the presence of harmful or inappropriate content in the responses. This metric helps maintain the safety and appropriateness of the AI’s interactions.
Hallucination: Hallucination measures the frequency and severity of fabricated or incorrect information in the model’s responses. Minimizing hallucination is essential for maintaining the reliability of the AI.

*Image Credit: RAGAS SCORE, source: LangChain Blog. Retrieved from Evaluating RAG Pipelines with RAGAS & LangSmith.*

RAGAS metrics provide a comprehensive framework for evaluating the performance of RAG systems. By focusing on these metrics, developers can ensure that their AI models deliver high-quality, accurate, and contextually appropriate responses. Tools like DeepEval facilitate the evaluation process, allowing for the efficient assessment of these metrics and continuous improvement of AI systems.

DeepEval: A Tool to Measure RAGAS and Other Metrics

DeepEval is an open-source evaluation framework designed to assess the performance of large language models (LLMs). It provides a comprehensive suite of metrics, including RAGAS, to evaluate various aspects of LLM outputs, making it an essential tool for developers working with AI systems.

Features of DeepEval:

Wide Range of Metrics: DeepEval offers a variety of ready-to-use metrics, including answer relevancy, faithfulness, contextual recall, contextual precision, RAGAS, hallucination, toxicity, and bias. These metrics help ensure that AI models generate accurate, relevant, and safe responses.
Custom Metric Integration: Users can create custom metrics by inheriting DeepEval’s base metric class. This flexibility allows developers to tailor the evaluation framework to their specific needs and incorporate new metrics as required.
Bulk Evaluation: DeepEval supports the evaluation of entire datasets in bulk, making it efficient to assess large volumes of data. This is particularly useful for continuous integration and deployment (CI/CD) pipelines.
Parallel Processing: The framework runs evaluations concurrently, optimizing speed and efficiency. This feature ensures that evaluations are completed quickly, even for large datasets.
Seamless Integration: DeepEval integrates smoothly with popular frameworks like LangChain and LlamaIndex, allowing developers to incorporate it into their existing workflows easily. It also supports integration with Confident AI for continuous evaluation and monitoring throughout the lifecycle of an AI model.
User-Friendly Interface: The tool is designed to be easy to use, with a Pytest-like interface for writing and running tests. This familiar structure makes it accessible for developers who are already experienced with Python testing frameworks.

Using DeepEval:

To get started with DeepEval, developers can install the framework using pip and create an account on the platform to log test results. Writing a test case involves defining the input, actual output, and context, and then applying the relevant metrics to evaluate the model’s performance. DeepEval’s comprehensive documentation and examples make it straightforward to set up and use, ensuring that developers can quickly begin evaluating their AI models effectively.

DeepEval’s robust evaluation capabilities make it an invaluable tool for developers aiming to create high-quality, reliable AI systems. By leveraging its wide range of metrics and user-friendly interface, developers can ensure their models meet the highest standards of performance and accuracy.

Implementing a Python Script to Evaluate Chatbot Responses

Implementing a Python script to evaluate chatbot responses involves setting up a framework that can systematically test the chatbot’s performance using predefined metrics. Here’s a step-by-step guide to achieve this:

Step 1: Install Necessary Libraries

First, ensure you have the required libraries. For this example, we will use DeepEval, which you can install using pip:

Step 2: Create a Test Case

Define the test case by specifying the input question, the expected output, and the context from which the chatbot should retrieve information.

Step 3: Set Environment Variables

Ensure your environment variables are set up correctly. For instance, you might need to set your OpenAI API key:

Step 4: Run the Test

Run the test script to evaluate the chatbot’s response:

Step 5: Interpret the Results

The output will indicate whether the test passed based on the specified metrics. If the answer relevancy metric meets the threshold, the test passes; otherwise, it fails.

Customizing the Evaluation:

DeepEval allows for extensive customization. You can create custom metrics by inheriting from the base metric class. This flexibility enables you to tailor the evaluation process to meet specific requirements.

Example of Running Multiple Metrics:

Using this script, you can efficiently evaluate chatbot responses, ensuring they are relevant, accurate, and contextually appropriate.

‍

Traditional QA with Test Case vs. QA with Testing Models Using Special Test Cases

Traditional QA with Test Cases:

In traditional QA, test cases are designed to validate the functionality of applications. These test cases specify inputs, expected outputs, and the steps to execute the test. They cover various scenarios to ensure the application works correctly under different conditions. This approach is straightforward and effective for deterministic systems where the outputs are predictable based on given inputs.

Example:

A test case for a login feature might include:

Input: Username and password
Expected Output: Successful login message

Testers execute these steps manually or through automated scripts, comparing the actual output to the expected results.

*Image Credit: Diagram generated using DALL-E 3*

‍

QA with Testing Models Using Special Test Cases:

Testing AI models, especially those involving natural language processing like chatbots, requires a different approach. These models generate outputs based on probabilistic algorithms, making their behavior less predictable. Special test cases for AI models include not just the expected output but also contextual information to assess the model’s performance.

Example:

A special test case for a chatbot might include:

Input: User query
Expected Output: Relevant and accurate response
Expected Context: Knowledge base from which the chatbot retrieves information

These test cases evaluate multiple aspects, such as relevancy, accuracy, and contextual appropriateness, using metrics like Answer Relevancy, Faithfulness, and Contextual Precision. Tools like DeepEval facilitate this by providing a structured framework to assess these metrics.

Comparison:

Deterministic vs. Probabilistic: Traditional QA deals with predictable outputs, whereas AI model testing deals with variable outputs.
Manual vs. Metric-based Evaluation: Traditional QA often involves manual checks, while AI testing relies on automated metrics.
Focus on Functionality vs. Contextual Relevance: Traditional QA focuses on functional correctness, while AI testing emphasizes context and relevancy.

This shift in approach ensures that AI models meet the desired standards of performance and reliability, catering to the nuances of language and context.

Interpreting Results and Setting Goals

Interpreting the results of AI model evaluations is crucial for understanding the model’s performance and identifying areas for improvement. Here’s how to approach this process effectively:

Understanding the Metrics:

Each metric used in the evaluation provides insights into different aspects of the model’s performance. For instance:

Answer Relevancy: Indicates how relevant the model’s responses are to the input queries.
Faithfulness: Measures the accuracy and truthfulness of the responses.
Contextual Precision and Recall: Assess the model’s ability to provide precise and comprehensive answers based on the context.

Analyzing the Results:

Once the evaluation is complete, analyze the results by examining the scores for each metric. High scores in Answer Relevancy and Faithfulness suggest that the model provides accurate and relevant responses. Conversely, low scores may indicate issues such as irrelevant or incorrect answers, requiring further investigation.

Setting Goals:

Based on the analysis, set realistic and achievable goals for improvement. These goals should be specific, measurable, and time-bound. For example:

Increase Answer Relevancy score from 0.6 to 0.8 within the next development cycle.
Reduce the Hallucination rate by 50% by refining the model’s training data.

Continuous Monitoring and Improvement:

Evaluation is an ongoing process. Regularly monitor the model’s performance and compare it against the set goals. Use the insights gained to iteratively improve the model. This might involve retraining the model with more data, fine-tuning hyperparameters, or modifying the architecture.

Feedback Loop:

Incorporate feedback from users and stakeholders into the evaluation process. This real-world feedback can provide valuable insights that may not be apparent from automated metrics alone.

By interpreting the results accurately and setting clear goals, you can ensure continuous improvement in your AI models, leading to more reliable and effective systems.

Conclusion

The integration of AI into various domains presents unique challenges and opportunities for QA professionals. Moving beyond traditional testing methods, QA engineers must now embrace new tools and techniques to evaluate AI systems effectively. Knowledge graphs and RAG systems, combined with powerful language models like GPT-4o, offer unprecedented capabilities in handling complex queries and generating accurate responses.

DeepEval provides a robust framework for assessing these AI models, ensuring they meet high standards of accuracy, relevancy, and contextual appropriateness. By leveraging metrics such as Answer Relevancy, Faithfulness, and Contextual Precision, QA professionals can systematically evaluate and improve their AI systems.

As we transition into this new era of AI-driven applications, continuous learning and adaptation are essential. Embracing these advanced evaluation methods and tools will not only enhance the quality of AI models but also ensure they deliver reliable and trustworthy results. By setting clear goals and regularly monitoring performance, we can drive continuous improvement and innovation in AI, ultimately leading to more effective and user-friendly solutions.

This content created by

Ton

A highly experienced Senior QA Engineer with expertise in ensuring software quality and performance through meticulous testing and analysis. Skilled in various testing methodologies and tools, with a focus on delivering robust, reliable solutions. Dedicated to improving processes and enhancing product quality, while maintaining efficiency and collaboration across development teams to meet project goals and exceed expectations.

Ton

Check this out!

What are you waiting for? CLICK IT!

STORY YOU MIGHT LIKE