Swiss Cheese Model for AI Safety

This paper, titled “Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents”, authored by Md Shamsujjoha, Qinghua Lu, Dehai Zhao, and Liming Zhu from Data61, CSIRO, Australia, addresses the growing concerns surrounding AI safety in Foundation Model (FM)-based agents.

The paper highlights that while FM-based agents are revolutionizing application development due to their versatile nature and ability to adapt to a wide range of tasks, their rapidly growing capabilities and autonomy introduce significant challenges for AI safety. These challenges include the potential for generating harmful or offensive content, producing dangerous or unintended outcomes, and spreading disinformation and misinformation. Existing guardrail approaches are often insufficient, as they primarily focus on functional correctness and are typically single-layered, applied narrowly to specific agent artifacts, which cannot effectively manage the inherent autonomous and non-deterministic nature of FM-agents. If a single guardrail fails, associated risks may bypass it, potentially impacting the final results.

To address these critical issues, the authors propose a robust solution: multi-layered guardrails. The core contributions of this paper are:

  • A Comprehensive Taxonomy of Runtime Guardrails: Based on a systematic literature review (SLR), the paper presents a comprehensive taxonomy to categorize runtime guardrails from a software architecture perspective. This taxonomy comprises two primary categories:
    • Quality Attributes: These are essential for designing runtime guardrails, ensuring they meet critical performance, security, and reliability goals. Key attributes include:
      • Accuracy (mitigating hallucinations, misinformation, disinformation).
      • Efficiency (preventing resource-intensive tasks, endless loops).
      • Privacy (handling sensitive data, preventing leakage).
      • Security (protecting from malicious activities, data breaches, adversarial attacks).
      • Safety (preventing harmful or misleading outputs).
      • Fairness (addressing bias and discrimination).
      • Compliance (adhering to legal and regulatory standards, copyright protection).
      • Generalizability (functioning effectively across diverse scenarios without prior configurations).
      • Customizability (providing tailored protection to meet specific requirements).
      • Adaptability (adjusting and remaining effective under varying conditions).
      • Traceability (tracking and recording origins, processes, and decision paths).
      • Portability (being easily adapted and applied across different FM-based agents).
      • Interoperability (working seamlessly across differing agents and technologies).
      • Interpretability (clarity and transparency of guardrail operations).
    • Design Options: These represent practical approaches for implementing guardrails. They include:
      • Actions: Such as Block, Filter, Flag, Modify, Validate, Parallel calls, Retry, Fall back, Human intervention, Defer, Isolate, Redundancy, and Evaluate.
      • Targets: Guardrails can be applied to various elements, including Pipelines (Prompts, Intermediate Results, Final Results) and Artifacts (Goals, Context, Memory, Reasoning, Plans, Workflows, Tools, Knowledge Bases, Other Agents, FMs, Execution Time).
      • Rules: Uniform, priority-enabled, context-dependent, and negotiable (hard/soft).
      • Applicability Scope: Industry, organizational, team, and user levels.
      • Modality: Single modal (text, image, audio) or multimodal.
      • Underlying Models: Rule-based, hybrid, and machine learning models (narrow models and FMs).
  • A Novel Reference Architecture for Multi-Layered Guardrails: Inspired by the Swiss Cheese Model, the paper proposes a reference architecture for designing multi-layered runtime guardrails for FM-based agents. In this model, each “cheese slice” represents a protective layer within the agent system. These layers are designed to protect specific quality attributes (e.g., privacy, security), specific pipeline stages (e.g., prompts, intermediate results, final results), and agent artifacts (e.g., goals, plans, tools). The key insight is that while each layer may have its own weaknesses (i.e., “holes”), these holes are positioned differently across layers. Therefore, the combined layers create a robust defense against failures, ensuring that if one layer fails, another can catch and mitigate the issue. The architecture also incorporates an AgentOps infrastructure for continuous monitoring and logging, feeding data back to activate relevant guardrails.

This proposed taxonomy and reference architecture aim to provide concrete and robust guidance for researchers and practitioners to build AI-safety-by-design from a software architecture perspective. The methodology employed in this study involved a systematic literature review to identify relevant research and synthesize findings. The authors also discuss potential threats to validity, such as search and selection bias, and the generalizability of guardrails, suggesting continual re-evaluation and refinement.

The paper concludes by setting the stage for future work, which includes developing guardrail services for a scientific agent platform that will implement the proposed reference architecture and integrate the various design options outlined in the taxonomy.

Reference: Shamsujjoha, M., Lu, Q., Zhao, D., & Zhu, L. (2025, March). Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents. In 2025 IEEE 22nd International Conference on Software Architecture (ICSA) (pp. 37-48). IEEE.

Video

Podcast Link

https://notebooklm.google.com/notebook/685bacd2-76e5-48cb-bf94-8d8feb4c3ef3/audio

Subscribe to the Health Topics Newsletter!

Google reCaptcha: Invalid site key.