Model Risk Management for Machine Learning and Large Language Models in Financial Services

Why this still matters (even with "new" AI)

Model risk never went away; it just got a bigger stage. The classic supervision playbook—clear ownership, sound development and use, independent validation, and ongoing monitoring—remains the anchor. United States supervisors set that bar in 2011 (SR 11-7 and the companion Office of the Comptroller of the Currency bulletin) and have not walked it back. The language may predate today's models, but the expectations—effective challenge, evidence, and control—map cleanly to machine learning and large language models. (Federal Reserve)

Across the Channel, the United Kingdom Prudential Regulation Authority formalised this into five principles (governance, model lifecycle, independent review, data/technology infrastructure, and reporting) and made one point very clearly: treat model risk as a risk discipline in its own right, not as a side-effect of analytics. The message is the same in the European Central Bank's guide to internal models: institutions are expected to run an effective model risk framework across all models in use. (Bank of England)

What changes when the model can generate text

Generative systems add two practical wrinkles to a familiar lifecycle.

First, evaluation is no longer just about predictive accuracy on neatly labelled test sets. You need a lightweight "evaluation harness" that checks whether answers are grounded in approved sources, whether the system refuses off-limits requests, and whether personal data is being mishandled. This is not exotic; it is simply extending validation to cover the behaviours these systems enable. NIST's Artificial Intelligence Risk Management Framework is a helpful way to organise those checks across govern → map → measure → manage. (NIST Publications)

Second, logging needs more care. Supervisors expect you to be able to reconstruct how a decision was made. With large language models, that means keeping just enough information—prompts, retrieved citations, model version, guardrail hits—to explain the outcome later, while avoiding indiscriminate collection of personal data. The principle is old, the artefacts are new. (Federal Reserve)

A bank example: collections and hardship support

Imagine a retail bank that uses a large-language-model assistant to help agents draft hardship responses. Under classic model risk management, that assistant is a model plus a workflow. The development team documents intended use ("draft only; human approval required"), known limitations (may omit context beyond the retrieved window), and datasets (policy documents, training notes, and retrieval sources). Validation replicates accuracy tests, but also runs red-team probes for prompt injection and checks that every statement in a draft is traceable back to a cited source. Operations then monitor two things: quality (are humans overriding for the same reason repeatedly?) and safety (are guardrails firing when they should?). That is the familiar intake → validation → approval → monitoring cycle, applied to a new class of model. (Federal Reserve)

An insurer example: underwriting assistants

In commercial lines, an underwriting assistant might summarise broker submissions, pull exclusions from policy wordings, and flag missing evidence. The model risk work here is less about the glamour of the model and more about the plumbing: can you prove the summary came from your approved libraries; do you record when an underwriter accepts or edits a suggestion; can you roll back to a previous prompt template if quality dips? Independent review should still provide "effective challenge"—re-running tests with fresh samples and stress scenarios (e.g., long attachments, near-duplicates, adversarial phrasing). The supervisory expectations—independence, documentation, and replicability—are unchanged. (Bank of England)

What your committee actually wants to see

Most approval forums are looking for three things presented plainly:

Purpose and ownership. Who is responsible, what problem the model solves, where it must not be used, and which controls surround it. (This mirrors SR 11-7's emphasis on governance and clear accountability.) (Federal Reserve)
Independent testing. Not just re-running the team's metrics, but challenging conceptual soundness, data lineage, and failure modes. For generative systems, that includes grounding and refusal checks. (Federal Reserve)
Monitoring and change control. Drift and stability metrics for predictive parts; sampling and quality notes for text generation; a signed changelog for prompts, retrieval settings, and guardrails. PRA SS1/23 is explicit that firms should manage this as a first-class risk discipline, with reporting to senior management. (Bank of England)

Keep the pack short. A four-to-six page memo with links to evidence (tests, lineage views, logs) tends to travel better than a hundred slides.

"Buy" models still carry "use" risk

Even if you consume a foundation model as a service, you own the use case. That means you still run intake, validation, approval, and monitoring on the workflow you built around the model: retrieval, redaction, prompt templates, and human oversight. ECB material on internal models and the EBA's work on machine learning for internal ratings-based approaches point in the same direction: supervisors are less interested in your marketing label for a tool and more interested in whether the control environment matches its real impact. (European Banking Supervision)

A pragmatic way to start (without freezing delivery)

Most teams get moving with a 30-day plan:

Days 1–5: list live and near-live models; name the owner; sketch intended use and limits; capture data lineage at a simple level (sources, transformations, where used).
Days 6–15: write a short validation plan and run it. For large language models, include a dozen grounding/refusal/privacy checks and keep raw outputs with pass/fail notes.
Days 16–25: stand up monitoring: a simple dashboard for quality (e.g., human edit rate), stability (e.g., retrieval hit rates), and safety (guardrail events).
Days 26–30: assemble a memo for the approval forum and agree a change-control process (prompts, retrieval settings, model versions).

You will notice this mirrors the traditional lifecycle—on purpose. It lets risk and audit recognise the shape of the programme, even when the model type is new. (Federal Reserve)

Where data governance fits

None of the above works if you cannot trace where inputs came from or show that sensitive fields are handled appropriately. Basel's principles for risk data aggregation are still the best shorthand for "can your executives rely on this output?"—accurate, complete, timely, traceable. For machine learning and large language models, add masking or tokenisation where feasible and keep a clear approval path for access to raw stores. (Bank for International Settlements)

Need help operationalising model risk (without slowing teams)?

We can help you stand up a modern model risk workflow in four weeks: intake templates, validation harness for machine learning and large language models, monitoring with meaningful signals, and a change-control process that product teams can actually live with. You get a concise approval memo and evidence that maps to what supervisors expect.

Book a free consultation

Sources and further reading

Federal Reserve / OCC SR 11-7 (2011) and OCC 2011-12: the foundation for model risk governance, independent validation, and monitoring. (Federal Reserve)
Bank of England PRA SS1/23 (2023): five principles and the expectation to treat model risk as a standalone risk discipline. (Bank of England)
ECB Guide to internal models (2024/2025 editions): expectations for an effective model risk framework across models in use. (European Banking Supervision)
EBA machine-learning for IRB (2023 follow-up report): where supervisors see machine learning fitting within internal ratings-based approaches. (European Banking Authority)
NIST AI Risk Management Framework 1.0: a practical way to structure evaluation and monitoring for modern models. (NIST Publications)