In regulated financial crime work, “good enough” AI is rarely good enough - you need decisions you can defend.
Dr Janet Bastiman is Chief Data Scientist at Napier AI, where she leads data science work supporting anti-money laundering and financial crime compliance.
She focuses on translating complex models into outcomes teams can audit, explain, and improve.
Regulation is pushing AI out of the “black box” era. The practical response is explainable workflows, proportionate oversight, and an evidence trail that stands up under challenge.
Financial crime teams adopted machine learning early because transaction monitoring at scale is beyond human-sized work.
What has changed is that models play a greater role in shaping who gets investigated, who gets delayed, and who gets treated as higher risk.
Regulation Cares About Outcomes and Evidence
The EU AI Act sets higher expectations where AI can materially affect people’s lives.
In the UK, the government’s pro-innovation approach still expects sector regulators to enforce accountability, documentation, and control.
“AI in compliance” is high stakes: a wrong decision can mean missed criminal activity, or the wrong person treated as suspicious.
Global standard-setters are also watching the threat landscape shift, including AI-enabled tactics such as deepfakes, which can complicate identity and verification controls.
The FATF’s horizon scan on AI and deepfakes shows model governance and fraud resilience converging in the same operational workflows.
The FCA’s AI sandbox work with NVIDIA shows how quickly experimentation is moving in regulated markets, which raises the bar on oversight as well as innovation.
Model Risk Management Is Spreading Beyond Traditional Use Cases
Organisations dealing with important areas of our lives need ownership, validation, monitoring, and the ability to challenge model outputs.
LLMs are strong at summarising unstructured information, but they are poor substitutes for classification models in review-or-discount workflows.
If the goal is “review or discount”, as in many AML workflows, you need a model designed for classification and engineered to show why it reached its conclusion.
Make Explanations Evidence-Led
When a model reduces false positives, the success metric is whether reviewers can trust the reasons for discounting a potential issue, especially when those decisions may later be challenged.
The practical test is straightforward: would you accept this explanation from a human analyst?
In AML settings, explainability usually needs four things:
How confident the system is - and when it is uncertain.
The key evidence points that drove the decision.
Links back to source records that can be checked quickly.
Plain language that fits into an audit narrative.
This lines up with the ICO guidance on explaining decisions made with AI, emphasising transparency and accountability in ways people can act on.
Watch for “Models in a Trench Coat”
Teams are exploring agentic workflows: multiple models chained together, sometimes with automated actions attached.
Some of these systems may actually be “lots of different AI models sort of in a trench coat”, and some of them may be what we would think of as agentic.
When a decision is produced by a chain, explainability must cover the full chain, including every step that shaped the final answer.
A defensible AI decision needs a record someone else can follow without reopening the whole model.
That is why evidence links, version history, and review notes matter as much as the headline score or alert outcome.
Oversight That Holds Up When Volumes Are High
Human oversight is essential, but simply stating “human in the loop” does little on its own.
The control is whether review, escalation, and challenge still work when the queue is busy.
Scale Review Like a Regulated Process
Model use in financial crime should follow the same pattern institutions already use for high-impact human decision-making.
Most firms have second-line and third-line reviews for high-impact work.
AI-assisted work should follow the same pattern, with controls that are proportionate to risk.
"If you failed your driving test, you didn't hallucinate all of the bits you got wrong - you got them wrong."
Calling errors “hallucinations” can make them sound mystical and unmanageable, when most failures are plain errors: wrong evidence, wrong reasoning, or wrong context.
And they should be addressed as such. A workable oversight design often includes:
Tiered review: sample low-risk decisions, and require mandatory review for high-impact outcomes.
Rotating case sampling to detect drift and bias early.
Clear override routes, so analysts can challenge a model and record why.
"When a firm needs to defend an automated decision, the technical question is whether it can reconstruct what happened: which system ran, on what data, with what output, and who reviewed it."
The Defensible Decision Trail: What Good Looks Like
If regulators ask “why did you do that?”, an AI programme succeeds or fails on documentation as much as accuracy.
The goal is to reconstruct a decision and show it was reasonable, controlled, and monitored.
Audit the Pipeline, Not Only the Outcome
A useful starting point is an implementation audit: data in, outputs out, and how results were scored and reviewed.
In production, organisations should track which model version ran, what it was trained on, and how performance changed each time it retrained.
For complex systems, add traceability across the chain: at a given timestamp, which model IDs ran, on which data, and in what order.
What Good Traceability Looks Like
Inputs, model versions, timestamps, review actions, and evidence links should let a team replay a decision without guessing what the system did at the time.
This is also how you respond when something goes wrong in a way that can lead to logical and proportionate action.
If bad input data poisons a retraining cycle, you need to isolate impact, roll back, and remediate quickly.
What To Ask Before You Scale AI in Financial Crime
Defensibility improves when the questions are embedded in governance early.
Can we explain the decision? Focus on why this case moved in this direction and what evidence supported it.
Can we prove the evidence? Explanations should link back to source records, not invented references.
Could we replay it? Model IDs, timestamps, training data, and performance logs should make this possible.
The simplest test is this: if you had to defend the decision to a regulator tomorrow, would you be comfortable with the record you have today?
FAQs
What Does “Explainable AI” Mean in AML?
It means showing confidence, the key drivers behind a decision, and links back to evidence so investigators can verify and write up the rationale.
Is a Large Language Model Enough for Financial Crime Decisioning?
LLMs can help with summarising and triage, but review or discount decisions usually need models designed for classification and engineered for audit and traceability.
How Do You Add Human Oversight without Slowing Everything Down?
Use tiered controls: sample low-risk decisions, require mandatory review for high-impact outcomes, and rotate case sampling to detect drift early.
How Do You Make AI Decisions Defensible to Regulators?
Record inputs, model versions, timestamps, outputs, review actions, and evidence used, then monitor performance over time and be able to roll back safely.
Sam Kendall works on digital marketing at Beyond Encryption, helping build B2B marketing activity around research, first principles, and sustainable growth. He writes about marketing effectiveness, positioning, customer communications, and digital culture, with longer-form work published at ATNL.