Estimating QA effort for AI features.
What changes when intelligence enters the system?
Estimating QA effort has never been an exact science, but with AI it becomes a bit like trying to measure fog. You can see it, feel it, but it doesn’t stay still long enough to pin down. The good news is that you can bring structure to it, as long as you know what kind of AI you’re dealing with and what exactly you’re testing.
Step 1: Define what you’re actually testing
The very first question is simple: are you testing the feature itself, the quality of the AI output, or both?
If your team is only validating the feature behaviour, such as user flows, permissions, data input and output handling, then you can apply your usual estimation models: functional complexity, number of test cases, environments, automation, regression, and so on.
However, if you are testing the quality of the AI output as well, that’s a different story. Suddenly, your effort depends on how the AI is built, what data it consumes, and how variable its responses can be.
Step 2: Understand what kind of AI you’re dealing with
Before you begin estimating, you need to know what type of AI your team is working on. That will define the shape and scope of your QA effort.
Machine Learning (ML) Systems
ML systems usually follow a fixed training and inference pipeline. Your main variables are the dataset, model, behaviour, thresholds, and feature design.
You’ll need to understand:
What data is used for training and testing (its quality, size, and representativeness)?
What are the acceptance criteria for launch?
How will you measure pass and fail? Which metrics matter most?
Does your feature or product allow the user to enter a prompt, or is the system fully automated?
What are the edge cases and potential bias boundaries?
Does your team have a good understanding of the technology under test? If there’s a knowledge gap, invest time in learning how the model and its algorithms work because it’ll make your testing sharper and more meaningful.
Do you need to include security testing, such as checking for prompt injections, data leakage, or model manipulation?
Do you need to run performance and load tests to see how the AI system behaves under stress or high request volumes?
Is there a clear retraining strategy and who owns it?
How is data lineage tracked (where the data comes from and how it’s transformed)?
How is model drift monitored after deployment?
Do you understand how the model’s features and weights influence its predictions?
Can the team explain or visualise model behaviour?
Is there a reproducibility plan? Can you recreate the same results with the same data and parameters?
How is model performance monitored in production, and are there alerts for anomalies?
Example:
A fraud detection model that flags transactions as “suspicious” or “safe”. The testing effort will depend on how many datasets you can access (both valid and invalid samples), how the thresholds are configured, and how easily you can reproduce the inference environment.
QA approach:
Treat this as a continuously learning system. Validate datasets, review metrics, and confirm reproducibility across model versions. Include statistical sampling in your test planning, and work closely with your data and ML teams to prepare a realistic test environment with representative datasets.
If the feature allows users to enter prompts, your test effort will increase slightly because it introduces additional variables that can influence the output. You can create test data or seed prompts to guide testing, but you will never be able to cover every possible variation. Focus on typical, high-impact and risky scenarios first, then expand gradually as you learn more about the model’s behaviour.
Large Language Models (LLMs)
LLM-based features are far more unpredictable. They are context-driven and generative, which means you can’t always define a single “expected output”. Estimating testing effort for these systems depends heavily on how they are designed, integrated, and configured.
Before you begin, make sure you understand:
Does the AI system use RAG (Retrieval-Augmented Generation)?
How will you measure pass and fail? what evaluation method or metrics will you rely on (accuracy, factuality, relevance, tone)?
Does your feature or product allow users to enter a prompt, or is the system fully automated?
Does the AI solution under test maintain context or memory, and how persistent is that memory?
Do you have access to the system prompts? Familiarising yourself with them helps you understand how the model is being instructed to behave.
How many model versions or vendors are in play (for example, Gemini 2.5 Flash, GPT-4o mini, etc.)?
What temperature or sampling parameters are used, and how much randomness can you expect in the outputs?
Do you need to include security testing, such as checking for prompt injections, data leakage, or model manipulation?
Do you need to run performance and load tests to understand how the system behaves under stress or high request volumes?
Do you need to test how the system detects and manages harmful, biased, or sensitive content?
Can you access or control prompt and response versioning to ensure consistency during regression testing?
Do you plan to use LLM-as-a-judge to automate output evaluation and speed up scoring?
What is the context window limit, and how does truncation or token overflow affect performance?
Are there fallback mechanisms in case the model times out, produces an empty response, or fails to retrieve context?
Is there sufficient observability and logging to diagnose model behaviour and drift after deployment?
Example:
Imagine a helpdesk assistant that drafts responses to customer queries.
If it uses RAG with your company’s knowledge base, you’ll need to test the quality of information retrieval as well as the accuracy of the generated response. If the system maintains conversational context, plan multi-turn tests to check continuity and consistency of tone and facts.
QA approach:
Start with exploratory testing to get a feel for how the system behaves. Then design evaluation prompts that cover key intents, edge cases, and failure scenarios. Design a scoring card to track results against versions carefully for each LLM model you test.
At this stage, consider using LLM-as-a-judge - a separate model that automatically evaluates outputs against your predefined criteria. For instance, you can feed the original prompt, the model’s response, and a reference answer into another LLM to score accuracy or tone. While it’s not a substitute for human evaluation, it can save significant time once calibrated and can scale your regression testing efficiently. Over time, this becomes an invaluable way to detect changes in response quality across different model or prompt updates.
Other AI Architectures
Depending on your product, you might also encounter:
Computer vision models (image or video analysis)
Recommendation engines (personalisation and ranking)
Speech-to-text or text-to-speech systems
Each comes with its own data biases and quality measures (BLEU, ROUGE, WER, etc.), so your estimation should include dataset preparation and evaluation time.
I’ll dive deeper into how to run LLM evals properly in my next piece, so keep an eye out for it.
Step 3: Identify variables that influence the output
AI systems don’t follow fixed logic trees, so every small variable matters. When estimating QA effort, make sure you account for:
Data quality and diversity – garbage in, garbage out still applies
User prompt cases and variations in phrasing or intent
Datasets – their source, structure, and representativeness
System prompts and hidden context that influence behaviour
Model parameters – temperature, top-p, context window length, and other settings that affect determinism
Versioning – model, dataset, embeddings, and code
Integration points – RAG pipelines, APIs, vector databases, and third-party dependencies
User input range and behaviour unpredictability
Environment and configuration variables – model endpoints, latency, caching, and scaling behaviour
Monitoring and observability data – how outputs, failures, and performance are logged and tracked
Even a small change in data, configuration, or prompt phrasing can alter the outcome. That’s why version control is critical across every layer of your AI system — not just code, but also data, prompts, model configurations, and environment settings.
Step 4: Build a data strategy for testing
Once you know the moving parts, you’ll need the right dataset to test them.
Start with a representative sample of real-world inputs. Make sure your data reflects how people will actually use the system, not just how you expect them to. Keep an eye on balance because if your dataset is too clean or too uniform, you’ll miss real-world edge cases.
Create control prompts or labelled datasets for repeatability, and include negative testing - trick prompts, ambiguous inputs, and edge cases that push the system to its limits. You can also use data augmentation to expand your test coverage by generating variations of existing samples or prompts.
Define success metrics before you begin testing. For example, decide how you’ll measure factual accuracy, tone, relevance, or response stability.
And always start with exploratory testing. This stage is essential for any AI feature because you’re not just testing functionality, you’re learning how the system behaves. Use exploratory sessions to understand where the AI drifts, how it responds to variations, and what kinds of prompts or data trigger unexpected results. Document what you discover and feed it into your formal test planning.
If you already have evaluation pipelines in place, you can even introduce LLM-as-a-judge to help score or summarise responses automatically during exploration. It won’t replace human judgement, but it can help you spot trends faster and prioritise what needs deeper manual review.
This early insight will shape your formal test cases, help you identify high-risk areas, and ultimately save a lot of time when you move into structured testing.
Step 5: Manage versions and feedback loops
Introduce versioning into your AI testing process from day one. Each model update, dataset change, or prompt tweak can affect results dramatically, so track them in the same way you would track software releases.
When a fix is deployed, you’ll need to know exactly which version of the model or dataset it applies to. This allows you to reproduce, retest, and report with confidence.
As your test coverage grows, combine version control with automated evaluation pipelines using LLM-as-a-judge to detect regressions in output quality quickly across model updates.
Step 6: Accept that you’ll never cover it all
The number of potential test cases for an AI feature is vastly higher than for traditional systems, simply because of the probabilistic nature of AI. If user prompts or input data can vary, the space of possible outputs grows exponentially.
This is why I often recommend releasing AI features as beta, with Human-in-the-Loop (HITL) feedback built into the product interface. Allow users to score outputs or flag poor results. Capture that data, feed it back to your AI engineers, and use it to improve inference quality.
It is a practical balance: you reduce the QA cycle by focusing on high-priority test cases that meet your minimal acceptance criteria, while learning directly from real-world use.
Just make sure your legal and compliance teams are aligned before capturing any user data for analysis. Transparency builds trust, both internally and externally.
✅ QA Checklist: Estimating effort for AI features
Before you start
Define what you’re testing: the feature itself, the AI output, or both
Clarify what “quality” means to you and the stakeholders.
Make sure your team understands the AI technology under test
Identify any knowledge gaps and fill them early (learn how the model works before estimating)
Understand the system
Identify whether the solution uses ML, LLM, RAG, or another AI approach
Learn how data flows through the system, from input to output
Review training and testing datasets for quality, size, balance, and representativeness
Understand how algorithms process data, including canonical equivalents, weights, and thresholds
Confirm versioning for model, dataset, embeddings, prompts, and code
Check whether the system keeps context or memory, and how persistent it is
Review system prompts and context instructions that guide model behaviour
Identify integration points (APIs, RAG pipelines, vector databases, etc.)
Define testing scope
Decide how you’ll measure pass and fail. What metrics or evaluation methods apply (accuracy, relevance, tone, factuality, etc.)?
Define clear acceptance criteria for launch
Plan security testing (prompt injections, data leakage, model manipulation)
Include performance and load testing to see how the system behaves under pressure
Test how the system detects and manages harmful, biased, or sensitive content
Identify edge cases, bias boundaries, and fairness thresholds
Understand randomness factors like temperature, top-p, and other sampling parameters
Plan your data and test approach
Build a representative, realistic dataset that reflects real user behaviour
Create control prompts or labelled datasets for repeatability
Include negative testing such as trick prompts, ambiguous data, and edge cases
Use data augmentation to expand coverage if datasets are limited
Define measurable success metrics before formal testing
Start with exploratory testing to learn how the system behaves
Document findings from exploratory testing and use them to shape structured test cases
Work with AI engineers to prepare an appropriate test environment and data setup
Use automation and feedback loops
Introduce version control for all AI artefacts (model, data, prompts, embeddings, and configs)
Consider using LLM-as-a-judge for automated output evaluation and regression checks
Capture and review feedback from Human-in-the-Loop (HITL) processes or beta users
Include early adopters from your customer list to gather real-world feedback sooner
Ensure observability and logging are in place to track outputs, failures, and drift
Keep learning and improving
Document what you learn from each test cycle
Review and discuss results regularly with your data and AI teams
Monitor model drift and bias over time
Adjust your test strategy as the model, data, or architecture evolves
Stay in touch with Support and Customer Success teams to gain insights into customer’s feedback.
Final thoughts
Estimating QA effort for AI features is less about counting test cases and more about understanding how intelligence behaves.
Start with clarity on what you’re testing: the feature, the output, or both.
Map out your AI architecture, identify the variables, and plan your data strategy early.
Stay flexible, explore first, and embrace versioning, automation and feedback loops.
Because in AI, the best test plans aren’t written once. They change with every test, every feedback, and every surprise. The systems we test keep learning, and we need to learn with them. The future of QA isn’t about ticking boxes; it’s about understanding how things behave, asking good questions, and keeping up as the technology evolves.
Thanks for reading! I hope this helps you and your team as AI becomes part of your workflow. If you want to discuss anything in more detail, just drop me a message :-)
