AI and the emerging architecture of operational resilience

In an era where every service is expected to be available around the clock, resilience has become the new measure of trust. Customers demand reliability, regulators demand demonstrable assurance and even a brief outage can damage reputation faster than systems can recover. Financial institutions today operate in an environment where resilience is both a compliance mandate and a brand promise.

Opening the TAB Global–Red Hat industry briefing, Foo Boon Ping, President and Managing Editor of TAB Global, reminded participants that the greatest cost of a service outage is not financial but relational and reputational — the erosion of trust. He emphasised that artificial intelligence (AI), now integral to banking operations, adds new complexity: systems must be not only always on, but always trustworthy. As AI enables predictive and autonomous recovery, the challenge is to ensure that automation enhances accountability rather than obscuring it.

The event convened senior executives from across technology, operations, business lines, risk and regulatory affairs. It focused on how AI-driven systems are transforming resilience from recovery-based to prevention-oriented — a transition that demands transparency, explainability and governance discipline equal to the scale of automation itself.
This framing — that resilience is not merely a matter of uptime but of confidence — set the tone for the presentations and dialogue that followed. Trust, participants agreed, has become both the objective and the test of every operational resilience strategy.

From regulation to architecture

Richard Harmon, Global Head of Financial Services at Red Hat, opened his remarks by referencing a New York Federal Reserve study, “Cyber Risk and the US Financial System – A Pre-Mortem Analysis,” which modelled the failure of one of the five largest banks in the United States (US) and found that a single one-day disruption could impact 38% of the US payment network. He noted that the research demonstrated how interdependence and third-party concentration could propagate systemic risk throughout the financial system.

This, Harmon said, underlined a point that regulators worldwide — from the European Union under Digital Operational Resilience Act (DORA) to the Monetary Authority of Singapore (MAS)’s Technology Risk Management (TRM) and outsourcing guidelines — have begun to codify: “You can outsource a service, but you can’t outsource the risk.” The ability to delegate operations to a technology partner does not absolve the financial institution of accountability for continuity, compliance or customer outcomes.

He argued that this shift signals a deeper transformation — from regulatory policy to regulatory architecture. Frameworks like DORA, TRM, and the Australian Prudential Regulation Authority (APRA)’s Cross-Industry Prudential Standard (CPS) 230 now require institutions to demonstrate that recovery objectives are achievable within hours, not days. The onus has moved from “Do you have a plan?” to “Can you prove your plan works — and works in real time?”

Harmon described this evolution as a move towards isomorphic architectures — where the design of governance and oversight mirrors the structure of the systems it supervises, that is, instead of supervising “Generative AI”, it is to verify “Explainable AI”. In practice, this means resilience must be verifiable not just technically but institutionally. “The goal,” he explained, “is to make compliance and resilience indistinguishable — to design them as one.” Such architectures, he added, are becoming essential to preserving trust in an AI-driven regulatory landscape.

Resilience by design in an AI-enabled stack

Kelvin Loh, Head of Solutions Architecture for Red Hat ASEAN and Korea, expanded on how automation and observability can reinforce resilience in hybrid and multi-cloud environments. He emphasised that modern architectures must be open, consistent and composable, enabling applications to recover and scale dynamically without creating new silos of dependency.

Loh illustrated how human-in-the-loop and human-on-the-loop models work together to sustain operational confidence. The first ensures human judgment remains central to sensitive, risk-weighted processes; the second allows AI systems to execute recovery autonomously under structured supervision. “Automation,” Loh said, “should bring services back online in order of customer impact, not just technical sequence.”

He showed how observability frameworks such as GitOps and Ansible Lightspeed enable continuous monitoring, auditability, and real-time recovery orchestration. When combined with machine learning for anomaly detection, these systems can predict and mitigate failures before they escalate. Yet, Loh cautioned, automation is not autonomy — trust must still rest on explainable mechanisms and visible control.

In describing AI as an amplifier rather than a replacement for resilience, Loh underscored that transparency remains paramount. Each recovery event must be traceable end to end — not to eliminate human involvement, but to ensure that when automation acts, humans understand why and how.

What practitioners are asking

The moderated discussion that followed moved from frameworks to practical frontline realities. Participants leaned into the difficult questions that still separate aspiration from implementation: How can AI enhance recovery beyond the infrastructure layer? How can it anticipate and prevent service degradation before it reaches the customer? And, crucially, could its growing role in resilience also introduce new forms of fragility?

A participant from a global bank began by asking how AI could strengthen business and service recovery, not just technical restoration. The question reframed resilience as a customer-facing challenge rather than a data-centre problem — how to ensure that, in an outage, not only systems but services, channels and communications recover in harmony. It reflected an awareness that even a brief disruption can unravel trust if customers cannot transact, call or receive accurate information.

Loh responded that institutions are already beginning to close the gap between observability and service continuity. He explained that AI-driven playbooks are being used to automate recovery actions, ranking them according to business impact rather than infrastructure hierarchy, while language-based assistants now guide engineers through incident workflows, documenting each decision as it happens. “The intent,” he said, “is not to replace human response, but to ensure the system already knows the first steps before we start reacting.” He added that automation can extend beyond recovery to orchestration — coordinating alerts, dependencies and hand-offs across technology and operations teams so that the business response mirrors the speed and precision of the technical one.

Another participant asked for examples of how AI is being used to monitor and identify potential failures. The question reflected a shift from reactive recovery to proactive prevention — the desire to catch early warning signs before customers are affected. The ensuing discussion noted that machine-learning-based anomaly detection is now widely used to identify subtle deviations in performance and data flow, enabling pre-emptive intervention. Some banks are extending this with simulation environments and virtualised testbeds that allow continuous stress testing of applications and dependencies. Reference was made to Project Litmus, Singapore’s national initiative for evaluating the robustness and safety of AI systems, as an analogue for how the private sector might adopt structured validation environments for operational risk.

Picking up on this, Harmon elaborated on the concept of digital twins — virtual replicas of entire systems that can be used to model complex failure scenarios and recovery pathways. He explained that this approach allows institutions to experiment safely, exploring cascading dependencies and rehearsing coordinated recovery before real incidents occur. Harmon described this as “a shift towards continuous validation,” where resilience is proven iteratively rather than assumed. “The lesson,” he said, “is that resilience can’t just be designed — it has to be demonstrated and continually re-verified.”

A final question raised a more philosophical tension: as AI becomes integral to monitoring, detection and recovery, could it itself become a new point of failure? The participant wondered whether the same algorithms that help prevent outages might also amplify them if they behave unexpectedly or if data quality deteriorates. Harmon acknowledged that paradox. “You can outsource the function,” he said, “but not the risk.” He warned that AI systems, left unchecked, can drift silently or fail in ways that are opaque to human operators. The answer, he argued, lies in transparency, traceability and testability — ensuring that every model and automated decision can be interrogated, audited and overridden.

He added that the industry is still far from achieving explainable resilience: “We can map where data flows, but not always why it fails.” Harmon suggested that this is where isomorphic thinking — aligning regulatory oversight and architectural design — may play a role. If supervisory frameworks evolve in parallel with technology, he said, institutions and regulators could begin to share a common language of assurance, rather than one perpetually lagging the other.

The next questions for AI and trust

Several themes from the briefing now frame the frontier of operational resilience. Automation and oversight are converging as human-in-the-loop and human-on-the-loop paradigms define new modes of control. Continuous testing and simulation, exemplified by initiatives such as Project Litmus and digital-twin modelling, are turning assurance into an ongoing exercise rather than a periodic one. And structural alignment, or isomorphism, is emerging as the architecture of trust between institutions and regulators.

The question that remains is how these frameworks will evolve. How will regulators verify the performance of AI-driven recovery systems? What forms of evidence will count as proof of resilience? Can accountability persist when decisions occur at machine speed?

The answers are still emerging, but one principle is clear: in a financial ecosystem defined by automation, trust itself has become the measure of resilience. Building and sustaining that trust will require not just technological sophistication, but transparency, collaboration and the willingness to keep testing what we believe our systems can withstand.

Topics

Other Resources

Leadership Achievement

Financial Markets

Banking

Country/Territory/Regional Awards

Signature Conferences

Country Level Annual Conferences

The Banking Academy

Innovation Study Tours

Directories

Other Resources

Content Marketing Solutions

Research & Consulting

Events

The Banking Academy

Recent Search

From banker to insurer - Dennis Tan’s transition to impacting lives at Prudential Singapore

From banker to insurer - Dennis Tan’s transition to impacting lives at Prudential Singapore

AI and the emerging architecture of operational resilience

Recent Articles

Absa Ghana, Commercial Bank of Qatar and BSF recognised for achievements in retail, transaction banking and trade finance in Middle East and Africa

Circle’s USDC embedded in regulated finance

BNP Paribas’ Securities Services business drives private capital, digitalisation and trust in Asia Pacific asset servicing market

Bank of America advances cross-border real-time payments and FX risk solutions in Asia Pacific

Finastra launches Trade Innovation Nexus to tackle interoperability, compliance and fraud in trade finance

Conversations

All Comments

No results found

No results found

No results found

No results found

No results found

No results found