Accuracy is not assurance: Why AI test and evaluation is only the starting point

Written by Charles Choyce | Thursday 4 June 2026

Test and evaluation is essential before deploying any trained AI system. It gives developers, buyers, and users evidence of how a model performs against defined datasets, scenarios, and metrics.

For UK Defence, this evidence matters. AI may support target acquisition, intelligence triage, decision support, autonomy, and mission planning. In each case, poor performance exposes system developers and users to the risk of mission failure, safety incidents, legal scrutiny, and reputational fallout when mitigation falls short.

But test and evaluation is not enough.

Evaluation metrics; accuracy, precision, recall, F1 scores, and benchmark results tell us how a model behaved under tested conditions. They do not prove how the wider AI-enabled system will behave in operation. A model can perform well in evaluation and still be inappropriate, unreliable, or unsafe once deployed.

A benchmark is not the battlefield.

Satellite image with civilian and military vehicles identified and classified in bounding boxes

The impossible scale of the input space

Consider an AI image detection model used to support Defence target acquisition. At first glance, evaluating performance may appear straightforward. Provide labelled images, measure how often the model detects the correct object, then decide whether it is accurate enough.

The reality is more difficult.

Even a relatively small 255 by 255 Red-Green-Blue image contains 65,025 pixels. Each pixel is represented by three colour channels, typically stored with 8 bits, giving 256 possible values per channel, or about 16.7 million possible colours per pixel overall. If we were to generate a set of all possible 255x255 pixel squares, the number of images is astronomically large – more than the estimated number of atoms in the observable universe.

This means no test programme can evaluate the model across the full possible input space. It is not merely impractical, but physically impossible.

Instead, test and evaluation must use a dataset that represents expected operating conditions. That dataset may include different platforms, sensors, angles, ranges, lighting conditions, weather, backgrounds, terrains, and object types. It may include examples of camouflage, obscuration, clutter, and degraded imagery.

But it remains a sample.

The implication is critical. We cannot know with certainty how the model will perform in every operational situation. We can only estimate performance based on available evidence. That estimate may be strong and statistically meaningful, but it is still bounded by the dataset, assumptions, and scenarios used in evaluation.

False positives and false negatives are not abstract errors

In Defence target acquisition, model errors can cause severe harm.

A false positive could mean a civilian vehicle is incorrectly classified as a threat. A false negative could mean an actual threat is missed. Both outcomes matter. The first may create unacceptable risk to civilians, legality, proportionality, and mission legitimacy. The second may expose personnel, assets, or the mission to avoidable danger.

This is why metrics alone are insufficient. A headline score may hide the distribution of errors that matter most. A model that performs well on average may still fail under specific conditions.

A 2023 study found that an AI model for military vehicle recognition achieved 97.2% average accuracy on radar imagery, yet dropped to 65.3% when the same images were contrast-balanced. The model had relied partly on background brightness, showing how headline accuracy can conceal brittle behaviour under realistic sensor-processing changes. (Source)

This is also why the intended role of the AI model matters. A model used to triage imagery for initial human review carries different risks, and assurance requirements, from a model used to support time-sensitive target classification. The same model performance may be acceptable in one role and unacceptable in another.

The broader lesson for AI assurance

This challenge is not limited to image detection or Defence.

AI is useful because it can identify patterns in complex data where deterministic systems are unavailable, impractical, or too narrow. But the same complexity that makes the model useful also makes it difficult to predict completely.

If a neural network is sufficiently complex to solve a problem that cannot be handled by fixed rules, it will also be sufficiently non-transparent and capable of error when small variations are made to the input space.

A slight change in an image, sensor reading, or data distribution may produce an unexpected output. The system may be confident and wrong. It may perform well in testing, then degrade when data quality, user behaviour, system integrations, or operational context changes.

AI should be used, but it must be treated as a component that can and will fail over time. The surrounding processes, controls, and architecture should be designed to remain resilient when it does. System-level assurance supports this by building trust into the overall solution, not just the model.

From test evidence to system-level assurance

Test and evaluation provides the evidence base. Assurance decides whether that evidence is sufficient for the intended use.

Taking the Defence threat recognition model as an example, what additional questions does this framing surface?

This expands the discussion from “how accurate is the model?” to “can this AI-enabled system be trusted for this purpose, in this context, with these controls?”

Guardrails, failure analysis and human-machine teaming

As full certainty is impossible, assurance must focus on making uncertainty visible, bounded, and managed.

That requires system-level controls: defined operating envelopes, clear restrictions on use, confidence thresholds, independent corroboration, audit logs, incident reporting, fallback procedures, and revalidation after updates.

It also requires failure mode analysis. Teams should identify where the model is likely to fail, what harm could result, how failures can be detected, and what mitigations are required before deployment.

Finally, it requires effective human-machine teaming. A human-in-the-loop is not a safeguard by default. The human must have the time, training, authority, and information needed to challenge the model. Where appropriate, interfaces should show uncertainty clearly and avoid presenting AI outputs as more definitive than the evidence supports.

Our research paper on how human-autonomous teams fail - and crucially, how we can analyse failures earlier in system development lifecycles is here. Our method, "Design for Interaction Failure", conducts behavioural failure analysis through human- and machine-based lenses.

Find out more

For AI used in high-stakes settings, test and evaluation is essential but not enough for real assurance. The whole system must be considered and understood to raise risks early and manage them confidently.

Synoptix brings a strong systems thinking heritage to this challenge, built on 15 years supporting some of the UK’s most complex Defence programmes. By assessing how AI performs in real operational conditions and how people actually use it, we surface vulnerabilities that are often missed, including degraded performance and unexpected emergent behaviours, to build a clear, defensible case for trust in operation.

To go beyond tick-box compliance to evidence-based, uncertainty informed, intentional design, sign up to our upcoming webinar or find out more about our work on our website.

View full post