The easiest way to make AI assurance look rigorous is to focus on the model.
Models produce scores. They can be benchmarked, tested, red-teamed, compared, tuned, and monitored. They give us artefacts that fit neatly into evaluation reports and governance processes. If the model performs well enough across the right test sets, the system starts to feel controlled.
The problem is that deployed AI systems do not fail as models. They fail as systems.
A model may perform well in isolation and still produce unacceptable outcomes once it is placed inside a workflow, connected to other software, interpreted by a human, constrained by an organisation, or exposed to an operating environment that was not quite the one imagined during design. This should be treated as a core expectation of system behaviour, not a rare edge case. In complex real-world settings, the gap between model behaviour and system behaviour is exactly where many risks appear.
That is why much of the current approach to AI assurance is weaker than it appears. It often aims at the most measurable part of the system rather than the part that matters most: the behaviour of the whole system in its operational environment. To change this, we need three, system-level, concept shifts:
- A shift from model assurance to system assurance
- A shift from proving predictable behaviour to managing irreducible uncertainty
- A shift from evidence collection to assurance argument
The model is not the system
A deployed AI system includes the model, but also the data pipeline, user interface, surrounding workflow, escalation routes, monitoring arrangements, fallback processes, maintenance practices, organisational incentives, user training, and the environment in which the system operates. It also includes people. Not as external users of the system, but as part of how the socio-technical system behaves. This reality shows why model performance and system performance are not the same thing.
A powerful question is to ask what operating context the assurance claim applies to. In autonomous systems safety, this is often made explicit through an operational design domain model: a definition of the environments, interactions, and scenarios within which the system is intended to operate safely. If that context is incomplete - and it always will be, as the designed domain will always be a subset of the true operational domain - the system may encounter situations that were never analysed, and the safety claim simply does not apply. That idea transfers directly to AI assurance more generally. A model is not “safe” or “trustworthy” in the abstract. It is only supported for particular uses, under particular conditions, with particular assumptions and controls. Once those assumptions change, the assurance claim changes with them.
A model can score highly on a benchmark and still drive poor decisions. It can produce plausible outputs that are misunderstood by users. It can be technically correct in ways that are operationally useless. It can pass a pre-deployment assessment and then become unsafe because the environment changed around it.
When assurance focuses too heavily on the model, it tends to substitute available measurements for meaningful ones. Accuracy becomes a stand-in for value. Red-team examples become a stand-in for robustness. Bias tests become a stand-in for fairness in operation. Nominal human “approval” becomes a stand-in for meaningful oversight and accountability.
None of those things are useless - model-level evidence is not irrelevant. The problem is that it is incomplete on its own, and sometimes incomplete evidence is more dangerous than no evidence at all because it creates unwarranted confidence. A benchmark result may justify “the system performs well right now” but not “the system is resilient to failure under real operating conditions”.
The question should not be, “Is the model good?”. The question should be, “Is the system good enough, under conditions that are realistic, without unacceptable consequences, and why should we believe that this will remain true?”. If assurance does not test this distinction, it can accumulate evidence without actually reducing the uncertainty that matters.
Useful AI cannot be assured by prediction alone
There is a nasty trade-off at the centre of AI systems: the more useful they become, the more uncertainty is present in their behaviour, and so the harder they often become to assure.
A narrow deterministic AI component can be tested against a relatively bounded set of behaviours. However, a more capable AI component is attractive because it can operate across higher-variety situations. It can interpret ambiguous inputs, generalise across cases, summarise messy information, generate options, and adapt to context. But that flexibility is exactly what creates the assurance problem.
The more we ask an AI system to operate in ambiguous, high-variety environments, the more we expose it to uncertainty. Not just uncertainty about whether the model will produce the right output, but uncertainty about how the wider system will respond to that output.
Who will trust it? Who will ignore it? Who will overrule it? Who will become dependent on it? What happens when it is confidently wrong? What happens when it is partly right in a way that misleads the user?
In autonomous systems work, this problem is often approached at the level of scenarios and decisions rather than model outputs alone. A hazardous scenario is not just “the system made an error”. It is a combination of what the system was doing, what was true in the environment, and what decision the system made. The risk sits in the relationship between context, perception, understanding, and action.
That is a useful lesson for AI assurance more broadly. The dangerous output is not always the obviously wrong one. Sometimes the risk is a plausible output placed into the wrong workflow, used by the wrong person, at the wrong moment, with the wrong level of trust - in other words: a “dangerous success”.
This is also why assurance cannot rely on exhaustive testing. As operating environments become more open, the number of possible scenarios grows faster than our ability to identify and test them. Good assurance has to reduce uncertainty where it can, bound it where possible, monitor it where necessary, and design around the fact that some situations will not have been fully anticipated.
Some uncertainty can be reduced. Some can be bounded. Some can be monitored. Some can be designed around. But some of it is intrinsic to the kind of capability we are trying to build. Good assurance starts by acknowledging that, not by pretending it can all be tested away.
Assurance is an argument, not a pile of evidence
Testing matters. Risk assessment matters. Monitoring matters. Documentation matters. But none of them, on their own, justify trust.
Trust in a high-stakes AI system can only be justified by an argument that connects evidence to claims about system behaviour. A benchmark result has to be interpreted. A test has to be shown to represent something important. A control has to be shown to reduce a specific risk. A monitoring process has to be connected to action.
Without that argument, assurance becomes a collection of disconnected artefacts.
A useful assurance case is not a compliance binder assembled to defend a decision that has already been made. It is a live engineering artefact. It states what must be true for the system to be trusted, what evidence supports that belief, where the evidence is weak, what assumptions are being made, and what would force reconsideration.
This last point can’t be a nice-to-have. You must not only to collect supporting evidence, but to actively search for defeaters: credible reasons why the claim might be false. A claim is not well supported just because evidence points in its favour. It is stronger when serious objections have been identified, examined, and either resolved or explicitly accepted as residual uncertainty.
This is particularly relevant as AI assurance is vulnerable to confirmation bias. If a team starts with the belief that a system is ready to deploy, it is easy to gather evidence that supports that belief: a good benchmark, a successful pilot, a red-team report, a fairness assessment, a sign-off from a human reviewer. The harder and more useful question is: what evidence would undermine the claim?
AI systems change. Their data changes. Their users change. Their operating environment changes. Their integration points change. Their threat model changes. Static audits and point-in-time assessments decay quickly when the object being assessed keeps moving. This is a recognised weakness in AI assurance: an audit may capture a model, product, or governance process at one moment while missing how risks emerge later through deployment, adaptation, or downstream use.
That does not mean audits are pointless. It means they need to be connected to change management, monitoring, incident response, and a living assurance argument. Otherwise they provide the appearance of control without the mechanism to preserve it.
Topics from this blog: AI Assurance